Commit Graph

873650 Commits

Author SHA1 Message Date
Kairui Song b7e9b568c2 libfdt: include fdt_addresses.c
Upstream: c273a2bd8a
Link: 887436bdb7

commit c273a2bd8a
Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date:   Mon Dec 9 12:03:44 2019 +0900

    libfdt: include fdt_addresses.c

    In the implementation of kexec_file_loaded-based kdump for arm64,
    fdt_appendprop_addrrange() will be needed.

    So include fdt_addresses.c in making libfdt.

    Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
    Cc: Rob Herring <robh+dt@kernel.org>
    Cc: Frank Rowand <frowand.list@gmail.com>
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:02 +08:00
Kairui Song 6fb78c4cc6 arm64: kdump: remove dependency on arm64_dma32_phys_limit
From: Yi Li <adamliyi@msn.com>
Link: 696027f109

The patch b2da6ad294
(arm64: kdump: reimplement crashkernel=X) depends on commit 1a8e1cef76
("arm64: use both ZONE_DMA and ZONE_DMA32").

Commit 1a8e1cef76 is not ported to 5.4 kernel. So use arm64_dma_phys_limit.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:01 +08:00
Kairui Song fd02a1b5bc kdump: update Documentation about crashkernel
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 023deaec32

For arm64, the behavior of crashkernel=X has been changed, which
tries low allocation in DMA zone or DMA32 zone if CONFIG_ZONE_DMA
is disabled, and fall back to high allocation if it fails.

We can also use "crashkernel=X,high" to select a high region above
DMA zone, which also tries to allocate at least 256M low memory in
DMA zone automatically (or the DMA32 zone if CONFIG_ZONE_DMA is disabled).

"crashkernel=Y,low" can be used to allocate specified size low memory.

So update the Documentation.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:01 +08:00
Kairui Song 3fd41ff677 arm64: kdump: add memory for devices by DT property linux,usable-memory-range
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 2012a3b392

When reserving crashkernel in high memory, some low memory is reserved
for crash dump kernel devices and never mapped by the first kernel.
This memory range is advertised to crash dump kernel via DT property
under /chosen,
	linux,usable-memory-range = <BASE1 SIZE1 [BASE2 SIZE2]>

We reused the DT property linux,usable-memory-range and made the low
memory region as the second range "BASE2 SIZE2", which keeps compatibility
with existing user-space and older kdump kernels.

Crash dump kernel reads this property at boot time and call memblock_add()
to add the low memory region after memblock_cap_memory_range() has been
called.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:01 +08:00
Kairui Song d001dccf2b x86, arm64: Add ARCH_WANT_RESERVE_CRASH_KERNEL config
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: c8013ee6cd

We make the functions reserve_crashkernel[_low]() as generic for
x86 and arm64. Since reserve_crashkernel[_low]() implementations
are quite similar on other architectures as well, we can have more
users of this later.

So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and
select this by X86 and ARM64.

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:00 +08:00
Kairui Song bd482067c3 arm64: kdump: reimplement crashkernel=X
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 70e586365f

There are following issues in arm64 kdump:
1. We use crashkernel=X to reserve crashkernel below 4G, which
will fail when there is no enough low memory.
2. If reserving crashkernel above 4G, in this case, crash dump
kernel will boot failure because there is no low memory available
for allocation.
3. Since commit 1a8e1cef76 ("arm64: use both ZONE_DMA and ZONE_DMA32"),
if the memory reserved for crash dump kernel falled in ZONE_DMA32,
the devices in crash dump kernel need to use ZONE_DMA will alloc
fail.

To solve these issues, change the behavior of crashkernel=X and
introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation
in DMA zone or DMA32 zone if CONFIG_ZONE_DMA is disabled, and fall back
to high allocation if it fails.
We can also use "crashkernel=X,high" to select a region above DMA zone,
which also tries to allocate at least 256M in DMA zone automatically
(or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
"crashkernel=Y,low" can be used to allocate specified size low memory.

Another minor change, there may be two regions reserved for crash
dump kernel, in order to distinct from the high region and make no
effect to the use of existing kexec-tools, rename the low region as
"Crash kernel (low)".

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:00 +08:00
Kairui Song f30355b620 arm64: kdump: introduce some macroes for crash kernel reservation
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 667118f8c1

Introduce macro CRASH_ALIGN for alignment, macro CRASH_ADDR_LOW_MAX
for upper bound of low crash memory, macro CRASH_ADDR_HIGH_MAX for
upper bound of high crash memory, use macroes instead.

Besides, keep consistent with x86, use CRASH_ALIGN as the lower bound
of crash kernel reservation.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:00 +08:00
Kairui Song 21ff8ff8f3 x86/elf: Move vmcore_elf_check_arch_cross to arch/x86/include/asm/elf.h
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: b332ab8970

Move macro vmcore_elf_check_arch_cross from arch/x86/include/asm/kexec.h
to arch/x86/include/asm/elf.h to fix the following compiling warning:

In file included from arch/x86/kernel/setup.c:39:0:
./arch/x86/include/asm/kexec.h:77:0: warning: "vmcore_elf_check_arch_cross" redefined
 # define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)

In file included from arch/x86/kernel/setup.c:9:0:
./include/linux/crash_dump.h:39:0: note: this is the location of the previous definition
 #define vmcore_elf_check_arch_cross(x) 0

The root cause is that vmcore_elf_check_arch_cross under CONFIG_CRASH_CORE
depend on CONFIG_KEXEC_CORE. Commit 532b66d2279d ("x86: kdump: move
reserve_crashkernel[_low]() into crash_core.c") triggered the issue.

Suggested by Mike, simply move vmcore_elf_check_arch_cross from
arch/x86/include/asm/kexec.h to arch/x86/include/asm/elf.h to fix
the warning.

Fixes: 532b66d2279d ("x86: kdump: move reserve_crashkernel[_low]() into crash_core.c")
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:59 +08:00
Kairui Song 3177fa46ec x86: kdump: move reserve_crashkernel[_low]() into crash_core.c
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8cb8686864

Make the functions reserve_crashkernel[_low]() as generic.
Arm64 will use these to reimplement crashkernel=X.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:59 +08:00
Kairui Song e5eac006f6 x86: kdump: move xen_pv_domain() check and insert_resource() to setup_arch()
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8ec4a816f2

We will make the functions reserve_crashkernel() as generic, the
xen_pv_domain() check in reserve_crashkernel() is relevant only to
x86, the same as insert_resource() in reserve_crashkernel[_low]().
So move xen_pv_domain() check and insert_resource() to setup_arch()
to keep them in x86.

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:59 +08:00
Kairui Song 2fbb10e99b x86: kdump: use macro CRASH_ADDR_LOW_MAX in functions reserve_crashkernel()
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: a2e0b4351d

To make the functions reserve_crashkernel() as generic,
replace some hard-coded numbers with macro CRASH_ADDR_LOW_MAX.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:58 +08:00
Kairui Song cc6803d7d8 x86: kdump: make the lower bound of crash kernel reservation consistent
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8882ba540e

The lower bounds of crash kernel reservation and crash kernel low
reservation are different, use the consistent value CRASH_ALIGN.

Suggested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:58 +08:00
Kairui Song b3ab0276fe x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 873384fe79

Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
function reserve_crashkernel() also used 1M alignment. So just
replace hard-coded alignment 1M with macro CRASH_ALIGN.

Suggested-by: Dave Young <dyoung@redhat.com>
Suggested-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:57 +08:00
Kairui Song 47c1cc9217 arm64: remove the hard coded crashkernel address limit
This conflicts with upstream's kdump high reservation support, and we
already have CONFIG_ZONE_DMA32 set, so we have:

ARCH_LOW_ADDRESS_LIMIT = min(offset + (1ULL << 32), memblock_end_of_DRAM());

Which limits the address below 4G, so this hard code limit is redundant.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:57 +08:00
Alex Shi 6bc9581ddd Revert "gup: document and work around "COW can break either way" issue"
This reverts commit 918f50807eccd63d482ef4cf778b1d2b416770a9.
the commit force COW to write model, which force COW breaking, and cause
page usage increase a lot. On upstream, commit 376a34efa ("mm/gup:
refactor and de-duplicate gup_fast() code") give another way to fix fork
secuirty issue of COW, and then revert the buggy commit by commit a308c71bf1
("mm/gup: Remove enfornced COW mechanism")

Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:50:57 +08:00
Peter Xu 2453865ed4 mm/ksm: Remove reuse_ksm_page()
Remove the function as the last reference has gone away with the do_wp_page()
changes.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1a0cf26323)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:50:56 +08:00
Linus Torvalds 0fb4d8fd75 mm: do_wp_page() simplification
commit 09854ba94c upstrem
How about we just make sure we're the only possible valid user fo the
page before we bother to reuse it?

Simplify, simplify, simplify.

And get rid of the nasty serialization on the page lock at the same time.

[peterx: add subject prefix]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 09854ba94c)
Signed-off-by: Alex Shi <alexsshi@tencent.com>

Conflicts:
	mm/memory.c
2024-06-11 20:50:56 +08:00
Yuehong Wu 85ba10e6ef config: enable BFQ io scheduler
Enable CONFIG_IOSCHED_BFQ,CONFIG_BFQ_GROUP_IOSCHED for ARM to
support bfq io-scheduler.

Signed-off-by: Yuehong Wu <yuehongwu@tencent.com>
Signed-off-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:53 +08:00
Ni Xun 3c048b0f89 config: change CONFIG_CONFIGFS_FS to Y for default conf
CONFIG_CONFIGFS_FS from M to Y for arm default config

Signed-off-by: Ni Xun <richardni@tencent.com>
2024-06-11 20:50:40 +08:00
KP Singh 5e0977fd08 security: Fix hook iteration for secid_to_secctx
[upstream commit 0550cfe8c2]

secid_to_secctx is not stackable, and since the BPF LSM registers this
hook by default, the call_int_hook logic is not suitable which
"bails-on-fail" and casues issues when other LSMs register this hook and
eventually breaks Audit.

In order to fix this, directly iterate over the security hooks instead
of using call_int_hook as suggested in:

https: //lore.kernel.org/bpf/9d0eb6c6-803a-ff3a-5603-9ad6d9edfc00@schaufler-ca.com/#t

Fixes: 98e828a065 ("security: Refactor declaration of LSM hooks")
Fixes: 625236ba38 ("security: Fix the default value of secid_to_secctx hook")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200520125616.193765-1-kpsingh@chromium.org
Signed-off-by: Menglong Dong <imagedong@tencent.com>
2024-06-11 20:50:15 +08:00
soonflywang f1f1da34d4 arm64: fix NEON/VFP reentrant in fast_copy_page
Add fixup in fast_copy_page, this feature is disabled by default,
set vm.fast_copy_page_enabled to enable it.

Signed-off-by: soonflywang <soonflywang@tencent.com>
Signed-off-by: caelli <caelli@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
2024-06-11 20:50:14 +08:00
soonflywang ccb2a062d5 arm64: implemented a fast copy_page version while NEON/VFP is met
When running on Arm server, usually there is NEON/VFP extension on
Arm server CPU, this patch levearges SIMD instructions to speed up
the efficiency of current copy_page().

Signed-off-by: soonflywang <soonflywang@tencent.com>
Signed-off-by: Chengdong Li <chengdongli@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
2024-06-11 20:50:14 +08:00
Anders Roxell c7ff1ae2e7 security: Fix the default value of secid_to_secctx hook
Upstream commit 625236ba38

security_secid_to_secctx is called by the bpf_lsm hook and a successful
return value (i.e 0) implies that the parameter will be consumed by the
LSM framework. The current behaviour return success when the pointer
isn't initialized when CONFIG_BPF_LSM is enabled, with the default
return from kernel/bpf/bpf_lsm.c.

This is the internal error:

[ 1229.341488][ T2659] usercopy: Kernel memory exposure attempt detected from null address (offset 0, size 280)!
[ 1229.374977][ T2659] ------------[ cut here ]------------
[ 1229.376813][ T2659] kernel BUG at mm/usercopy.c:99!
[ 1229.378398][ T2659] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 1229.380348][ T2659] Modules linked in:
[ 1229.381654][ T2659] CPU: 0 PID: 2659 Comm: systemd-journal Tainted: G    B   W         5.7.0-rc5-next-20200511-00019-g864e0c6319b8-dirty #13
[ 1229.385429][ T2659] Hardware name: linux,dummy-virt (DT)
[ 1229.387143][ T2659] pstate: 80400005 (Nzcv daif +PAN -UAO BTYPE=--)
[ 1229.389165][ T2659] pc : usercopy_abort+0xc8/0xcc
[ 1229.390705][ T2659] lr : usercopy_abort+0xc8/0xcc
[ 1229.392225][ T2659] sp : ffff000064247450
[ 1229.393533][ T2659] x29: ffff000064247460 x28: 0000000000000000
[ 1229.395449][ T2659] x27: 0000000000000118 x26: 0000000000000000
[ 1229.397384][ T2659] x25: ffffa000127049e0 x24: ffffa000127049e0
[ 1229.399306][ T2659] x23: ffffa000127048e0 x22: ffffa000127048a0
[ 1229.401241][ T2659] x21: ffffa00012704b80 x20: ffffa000127049e0
[ 1229.403163][ T2659] x19: ffffa00012704820 x18: 0000000000000000
[ 1229.405094][ T2659] x17: 0000000000000000 x16: 0000000000000000
[ 1229.407008][ T2659] x15: 0000000000000000 x14: 003d090000000000
[ 1229.408942][ T2659] x13: ffff80000d5b25b2 x12: 1fffe0000d5b25b1
[ 1229.410859][ T2659] x11: 1fffe0000d5b25b1 x10: ffff80000d5b25b1
[ 1229.412791][ T2659] x9 : ffffa0001034bee0 x8 : ffff00006ad92d8f
[ 1229.414707][ T2659] x7 : 0000000000000000 x6 : ffffa00015eacb20
[ 1229.416642][ T2659] x5 : ffff0000693c8040 x4 : 0000000000000000
[ 1229.418558][ T2659] x3 : ffffa0001034befc x2 : d57a7483a01c6300
[ 1229.420610][ T2659] x1 : 0000000000000000 x0 : 0000000000000059
[ 1229.422526][ T2659] Call trace:
[ 1229.423631][ T2659]  usercopy_abort+0xc8/0xcc
[ 1229.425091][ T2659]  __check_object_size+0xdc/0x7d4
[ 1229.426729][ T2659]  put_cmsg+0xa30/0xa90
[ 1229.428132][ T2659]  unix_dgram_recvmsg+0x80c/0x930
[ 1229.429731][ T2659]  sock_recvmsg+0x9c/0xc0
[ 1229.431123][ T2659]  ____sys_recvmsg+0x1cc/0x5f8
[ 1229.432663][ T2659]  ___sys_recvmsg+0x100/0x160
[ 1229.434151][ T2659]  __sys_recvmsg+0x110/0x1a8
[ 1229.435623][ T2659]  __arm64_sys_recvmsg+0x58/0x70
[ 1229.437218][ T2659]  el0_svc_common.constprop.1+0x29c/0x340
[ 1229.438994][ T2659]  do_el0_svc+0xe8/0x108
[ 1229.440587][ T2659]  el0_svc+0x74/0x88
[ 1229.441917][ T2659]  el0_sync_handler+0xe4/0x8b4
[ 1229.443464][ T2659]  el0_sync+0x17c/0x180
[ 1229.444920][ T2659] Code: aa1703e2 aa1603e1 910a8260 97ecc860 (d4210000)
[ 1229.447070][ T2659] ---[ end trace 400497d91baeaf51 ]---
[ 1229.448791][ T2659] Kernel panic - not syncing: Fatal exception
[ 1229.450692][ T2659] Kernel Offset: disabled
[ 1229.452061][ T2659] CPU features: 0x240002,20002004
[ 1229.453647][ T2659] Memory Limit: none
[ 1229.455015][ T2659] ---[ end Kernel panic - not syncing: Fatal exception ]---

Rework the so the default return value is -EOPNOTSUPP.

There are likely other callbacks such as security_inode_getsecctx() that
may have the same problem, and that someone that understand the code
better needs to audit them.

Thank you Arnd for helping me figure out what went wrong.

Fixes: 98e828a065 ("security: Refactor declaration of LSM hooks")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Chun Liu <kaicliu@tencent.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/bpf/20200512174607.9630-1-anders.roxell@linaro.org
2024-06-11 20:49:57 +08:00
Xinghui Li 5c899e5403 firmware: fix one UAF issue
There could be the use after free issue in dmi_sysfs_register_handle.
During handling specializations process, the entry->child could be
free if the error occurs. However, it will be kobject_put after free.
So, we set the entry->child to NULL to avoid above case.

Reported-by: loydlv <loydlv@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:49:56 +08:00
Xinghui Li 84daaf3511 media:cec:fix double free and uaf issue when cancel data during noblocking
data could be free when it is not completed during transmit if
the opt is nonblocking.In this case,the regular free could lead
to double-free.So, add the return value '-EPERM' to mark the
above case.

Reported-by: loydlv <loydlv@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:49:56 +08:00
Lv Yunlong ca589c18a1 gpu/xen: Fix a use after free in xen_drm_drv_init
commit 52762efa2b upstream.

In function displback_changed, has the call chain
displback_connect(front_info)->xen_drm_drv_init(front_info).
We can see that drm_info is assigned to front_info->drm_info
and drm_info is freed in fail branch in xen_drm_drv_init().

Later displback_disconnect(front_info) is called and it calls
xen_drm_drv_fini(front_info) cause a use after free by
drm_info = front_info->drm_info statement.

My patch has done two things. First fixes the fail label which
drm_info = kzalloc() failed and still free the drm_info.
Second sets front_info->drm_info to NULL to avoid uaf.

Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn>
Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210323014656.10068-1-lyl2019@mail.ustc.edu.cn
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reviewed-by: Robinlai <robinlai@tencent.com>
2024-06-11 20:49:56 +08:00
Vinicius Costa Gomes 627ec74d6a igc: Fix use-after-free error during reset
upstream commit: 56ea7ed103

Cleans the next descriptor to watch (next_to_watch) when cleaning the
TX ring.

Failure to do so can cause invalid memory accesses. If igc_poll() runs
while the controller is being reset this can lead to the driver try to
free a skb that was already freed.

Log message:

 [  101.525242] refcount_t: underflow; use-after-free.
 [  101.525251] WARNING: CPU: 1 PID: 646 at lib/refcount.c:28 refcount_warn_saturate+0xab/0xf0
 [  101.525259] Modules linked in: sch_etf(E) sch_mqprio(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E)
 x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) binfmt_misc(E) kvm_intel(E) kvm(E) irqbypass(E) crc32_pclmul(E)
 ghash_clmulni_intel(E) aesni_intel(E) mei_wdt(E) libaes(E) crypto_simd(E) cryptd(E) glue_helper(E) snd_hda_codec_hdmi(E)
 rapl(E) intel_cstate(E) snd_hda_intel(E) snd_intel_dspcfg(E) sg(E) soundwire_intel(E) intel_uncore(E) at24(E)
 soundwire_generic_allocation(E) iTCO_wdt(E) soundwire_cadence(E) intel_pmc_bxt(E) serio_raw(E) snd_hda_codec(E)
 iTCO_vendor_support(E) watchdog(E) snd_hda_core(E) snd_hwdep(E) snd_soc_core(E) snd_compress(E) snd_pcsp(E)
 soundwire_bus(E) snd_pcm(E) evdev(E) snd_timer(E) mei_me(E) snd(E) soundcore(E) mei(E) configfs(E) ip_tables(E) x_tables(E)
 autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc_t10dif(E) crct10dif_generic(E)
 i915(E) ahci(E) libahci(E) ehci_pci(E) igb(E) xhci_pci(E) ehci_hcd(E)
 [  101.525303]  drm_kms_helper(E) dca(E) xhci_hcd(E) libata(E) crct10dif_pclmul(E) cec(E) crct10dif_common(E) tsn(E) igc(E)
 e1000e(E) ptp(E) i2c_i801(E) crc32c_intel(E) psmouse(E) i2c_algo_bit(E) i2c_smbus(E) scsi_mod(E) lpc_ich(E) pps_core(E)
 usbcore(E) drm(E) button(E) video(E)
 [  101.525318] CPU: 1 PID: 646 Comm: irq/37-enp7s0-T Tainted: G            E     5.10.30-rt37-tsn1-rt-ipipe #ipipe
 [  101.525320] Hardware name: SIEMENS AG SIMATIC IPC427D/A5E31233588, BIOS V17.02.09 03/31/2017
 [  101.525322] RIP: 0010:refcount_warn_saturate+0xab/0xf0
 [  101.525325] Code: 05 31 48 44 01 01 e8 f0 c6 42 00 0f 0b c3 80 3d 1f 48 44 01 00 75 90 48 c7 c7 78 a8 f3 a6 c6 05 0f 48
 44 01 01 e8 d1 c6 42 00 <0f> 0b c3 80 3d fe 47 44 01 00 0f 85 6d ff ff ff 48 c7 c7 d0 a8 f3
 [  101.525327] RSP: 0018:ffffbdedc0917cb8 EFLAGS: 00010286
 [  101.525329] RAX: 0000000000000000 RBX: ffff98fd6becbf40 RCX: 0000000000000001
 [  101.525330] RDX: 0000000000000001 RSI: ffffffffa6f2700c RDI: 00000000ffffffff
 [  101.525332] RBP: ffff98fd6becc14c R08: ffffffffa7463d00 R09: ffffbdedc0917c50
 [  101.525333] R10: ffffffffa74c3578 R11: 0000000000000034 R12: 00000000ffffff00
 [  101.525335] R13: ffff98fd6b0b1000 R14: 0000000000000039 R15: ffff98fd6be35c40
 [  101.525337] FS:  0000000000000000(0000) GS:ffff98fd6e240000(0000) knlGS:0000000000000000
 [  101.525339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  101.525341] CR2: 00007f34135a3a70 CR3: 0000000150210003 CR4: 00000000001706e0
 [  101.525343] Call Trace:
 [  101.525346]  sock_wfree+0x9c/0xa0
 [  101.525353]  unix_destruct_scm+0x7b/0xa0
 [  101.525358]  skb_release_head_state+0x40/0x90
 [  101.525362]  skb_release_all+0xe/0x30
 [  101.525364]  napi_consume_skb+0x57/0x160
 [  101.525367]  igc_poll+0xb7/0xc80 [igc]
 [  101.525376]  ? sched_clock+0x5/0x10
 [  101.525381]  ? sched_clock_cpu+0xe/0x100
 [  101.525385]  net_rx_action+0x14c/0x410
 [  101.525388]  __do_softirq+0xe9/0x2f4
 [  101.525391]  __local_bh_enable_ip+0xe3/0x110
 [  101.525395]  ? irq_finalize_oneshot.part.47+0xe0/0xe0
 [  101.525398]  irq_forced_thread_fn+0x6a/0x80
 [  101.525401]  irq_thread+0xe8/0x180
 [  101.525403]  ? wake_threads_waitq+0x30/0x30
 [  101.525406]  ? irq_thread_check_affinity+0xd0/0xd0
 [  101.525408]  kthread+0x183/0x1a0
 [  101.525412]  ? kthread_park+0x80/0x80
 [  101.525415]  ret_from_fork+0x22/0x30

Fixes: 13b5b7fd6a ("igc: Add support for Tx/Rx rings")
Reported-by: Erez Geva <erez.geva.ext@siemens.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: jackjunliu <jackjunliu@tencent.com>
2024-06-11 20:49:55 +08:00
Sasha Neftin b22ddb4543 igc: Remove _I_PHY_ID checking
upstream commit: 7c496de538

i225 devices have only one PHY vendor. There is no point checking
_I_PHY_ID during the link establishment and auto-negotiation process.
This patch comes to clean up these pointless checkings.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: jackjunliu <jackjunliu@tencent.com>
2024-06-11 20:49:55 +08:00
Ni Xun 8782b3a1b0 config: change CONFIG_CONFIGFS_FS to Y for AARCH64
with CONFIG_CONFIGFS_FS=m, the default cpu cgroup has user.slice, which
will slowdown unixbench Pipe-based Context Switching score: 39660 ->
19626

Signed-off-by: Ni Xun <richardni@tencent.com>
2024-06-11 20:49:55 +08:00
johnnyaiai 45a150cd8d ARM64/conf: Disable CONFIG_RODATA_FULL_DEFAULT_ENABLED
[tapd]
ID877978657

This configuration resulted in a 15% regression on
unixbench's execl testing.

This additional enhancement can be turned on with
rodata=full after this patch.

Signed-off-by: johnnyaiai <johnnyaiai@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
2024-06-11 20:49:52 +08:00
Hangyu Hua 3faa1c1ccc xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in xfrm_bundle_lookup()
upstream commit: f85daf0e72

xfrm_policy_lookup() will call xfrm_pol_hold_rcu() to get a refcount of
pols[0]. This refcount can be dropped in xfrm_expand_policies() when
xfrm_expand_policies() return error. pols[0]'s refcount is balanced in
here. But xfrm_bundle_lookup() will also call xfrm_pols_put() with
num_pols == 1 to drop this refcount when xfrm_expand_policies() return
error.

This patch also fix an illegal address access. pols[0] will save a error
point when xfrm_policy_lookup fails. This lead to xfrm_pols_put to resolve
an illegal address in xfrm_bundle_lookup's error path.

Fix these by setting num_pols = 0 in xfrm_expand_policies()'s error path.

Fixes: 80c802f307 ("xfrm: cache bundles instead of policies for outgoing flows")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-06-11 20:49:23 +08:00
Duoming Zhou b96175c56f NFC: netlink: fix sleep in atomic bug when firmware download timeout
upstream commit: 4071bf121d

There are sleep in atomic bug that could cause kernel panic during
firmware download process. The root cause is that nlmsg_new with
GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
handler. The call trace is shown below:

BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
Call Trace:
kmem_cache_alloc_node
__alloc_skb
nfc_genl_fw_download_done
call_timer_fn
__run_timers.part.0
run_timer_softirq
__do_softirq
...

The nlmsg_new with GFP_KERNEL parameter may sleep during memory
allocation process, and the timer handler is run as the result of
a "software interrupt" that should not call any other function
that could sleep.

This patch changes allocation mode of netlink message from GFP_KERNEL
to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.

Fixes: 9674da8759 ("NFC: Add firmware upload netlink command")
Fixes: 9ea7187c53 ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:49:23 +08:00
Duoming Zhou 154c293fc8 net: rose: fix UAF bugs caused by timer handler
upstream commit: 9cc02ede69

There are UAF bugs in rose_heartbeat_expiry(), rose_timer_expiry()
and rose_idletimer_expiry(). The root cause is that del_timer()
could not stop the timer handler that is running and the refcount
of sock is not managed properly.

One of the UAF bugs is shown below:

    (thread 1)          |        (thread 2)
                        |  rose_bind
                        |  rose_connect
                        |    rose_start_heartbeat
rose_release            |    (wait a time)
  case ROSE_STATE_0     |
  rose_destroy_socket   |  rose_heartbeat_expiry
    rose_stop_heartbeat |
    sock_put(sk)        |    ...
  sock_put(sk) // FREE  |
                        |    bh_lock_sock(sk) // USE

The sock is deallocated by sock_put() in rose_release() and
then used by bh_lock_sock() in rose_heartbeat_expiry().

Although rose_destroy_socket() calls rose_stop_heartbeat(),
it could not stop the timer that is running.

The KASAN report triggered by POC is shown below:

BUG: KASAN: use-after-free in _raw_spin_lock+0x5a/0x110
Write of size 4 at addr ffff88800ae59098 by task swapper/3/0
...
Call Trace:
 <IRQ>
 dump_stack_lvl+0xbf/0xee
 print_address_description+0x7b/0x440
 print_report+0x101/0x230
 ? irq_work_single+0xbb/0x140
 ? _raw_spin_lock+0x5a/0x110
 kasan_report+0xed/0x120
 ? _raw_spin_lock+0x5a/0x110
 kasan_check_range+0x2bd/0x2e0
 _raw_spin_lock+0x5a/0x110
 rose_heartbeat_expiry+0x39/0x370
 ? rose_start_heartbeat+0xb0/0xb0
 call_timer_fn+0x2d/0x1c0
 ? rose_start_heartbeat+0xb0/0xb0
 expire_timers+0x1f3/0x320
 __run_timers+0x3ff/0x4d0
 run_timer_softirq+0x41/0x80
 __do_softirq+0x233/0x544
 irq_exit_rcu+0x41/0xa0
 sysvec_apic_timer_interrupt+0x8c/0xb0
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:default_idle+0xb/0x10
RSP: 0018:ffffc9000012fea0 EFLAGS: 00000202
RAX: 000000000000bcae RBX: ffff888006660f00 RCX: 000000000000bcae
RDX: 0000000000000001 RSI: ffffffff843a11c0 RDI: ffffffff843a1180
RBP: dffffc0000000000 R08: dffffc0000000000 R09: ffffed100da36d46
R10: dfffe9100da36d47 R11: ffffffff83cf0950 R12: 0000000000000000
R13: 1ffff11000ccc1e0 R14: ffffffff8542af28 R15: dffffc0000000000
...
Allocated by task 146:
 __kasan_kmalloc+0xc4/0xf0
 sk_prot_alloc+0xdd/0x1a0
 sk_alloc+0x2d/0x4e0
 rose_create+0x7b/0x330
 __sock_create+0x2dd/0x640
 __sys_socket+0xc7/0x270
 __x64_sys_socket+0x71/0x80
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Freed by task 152:
 kasan_set_track+0x4c/0x70
 kasan_set_free_info+0x1f/0x40
 ____kasan_slab_free+0x124/0x190
 kfree+0xd3/0x270
 __sk_destruct+0x314/0x460
 rose_release+0x2fa/0x3b0
 sock_close+0xcb/0x230
 __fput+0x2d9/0x650
 task_work_run+0xd6/0x160
 exit_to_user_mode_loop+0xc7/0xd0
 exit_to_user_mode_prepare+0x4e/0x80
 syscall_exit_to_user_mode+0x20/0x40
 do_syscall_64+0x4f/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

This patch adds refcount of sock when we use functions
such as rose_start_heartbeat() and so on to start timer,
and decreases the refcount of sock when timer is finished
or deleted by functions such as rose_stop_heartbeat()
and so on. As a result, the UAF bugs could be mitigated.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Tested-by: Duoming Zhou <duoming@zju.edu.cn>
Link: https://lore.kernel.org/r/20220629002640.5693-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:49:22 +08:00
Florian Westphal 7912b75a66 netfilter: nf_queue: do not allow packet truncation below transport header offset
upstream commit: 99a63d36cb

Domingo Dirutigliano and Nicola Guerrera report kernel panic when
sending nf_queue verdict with 1-byte nfta_payload attribute.

The IP/IPv6 stack pulls the IP(v6) header from the packet after the
input hook.

If user truncates the packet below the header size, this skb_pull() will
result in a malformed skb (skb->len < 0).

Fixes: 7af4cc3fa1 ("[NETFILTER]: Add "nfnetlink_queue" netfilter queue handler over nfnetlink")
Reported-by: Domingo Dirutigliano <pwnzer0tt1@proton.me>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:49:22 +08:00
Yuehong Wu 1768198a0c config: add DRM_AST hyperV,network and DRM modules
Some images are being built using 0009-kabi branch and was expected
to run on some virtualization environments, and might be used for
desktops. So enable related drivers. More test needed though.

config: enable more commonly used DRM drivers
config: enable CONFIG_DRM_AST
config: enable hyperv related configs
config: enable CONFIG_IGC

Signed-off-by: Yuehong Wu <yuehongwu@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
2024-06-11 20:49:15 +08:00
sumiyawang a0fc351741 driver: update hisilicon hardware crypto engine
keep the hisilicon crypto driver up to 1.3.11, autoprobe
the modules when hardware enabled.

Signed-off-by: sumiyawang <sumiyawang@tencent.com>
2024-06-11 20:47:35 +08:00
Jann Horn fa3c010699 net: usb: ax88179_178a: Fix out-of-bounds accesses in RX fixup
commit 57bc3d3ae8 upstream.

ax88179_rx_fixup() contains several out-of-bounds accesses that can be
triggered by a malicious (or defective) USB device, in particular:

 - The metadata array (hdr_off..hdr_off+2*pkt_cnt) can be out of bounds,
   causing OOB reads and (on big-endian systems) OOB endianness flips.
 - A packet can overlap the metadata array, causing a later OOB
   endianness flip to corrupt data used by a cloned SKB that has already
   been handed off into the network stack.
 - A packet SKB can be constructed whose tail is far beyond its end,
   causing out-of-bounds heap data to be considered part of the SKB's
   data.

I have tested that this can be used by a malicious USB device to send a
bogus ICMPv6 Echo Request and receive an ICMPv6 Echo Reply in response
that contains random kernel heap data.
It's probably also possible to get OOB writes from this on a
little-endian system somehow - maybe by triggering skb_cow() via IP
options processing -, but I haven't tested that.

Fixes: e2ca90c276 ("ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver")
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-06-11 20:44:41 +08:00
Jianping Liu b5c25d0440 smp: fix slave core boot fail on altramax platform
Try 5000ms again with irq disable every 5ms to fix slave core
boot fail on altramax platform.
On ampere altramax platform, it has 256 cpu cores with multi node.
When CONFIG_HZ>=250, the tick will be created too frequently,
which cause slave core boot fail (ampere cpu bug). It needing to
disable cpu0's irq >= 5ms each time, which can reduce irq act.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
Reviewed-by: samuelliao <samuelliao@tencent.com>
2024-06-11 20:44:41 +08:00
Bhupesh Sharma c2671630ad arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo
commit bbdbc11804 upstream.

TCR_EL1.TxSZ, which controls the VA space size, is configured by a
single kernel image to support either 48-bit or 52-bit VA space.

If the ARMv8.2-LVA optional feature is present and we are running
with a 64KB page size, then it is possible to use 52-bits of address
space for both userspace and kernel addresses. However, any kernel
binary that supports 52-bit must also be able to fall back to 48-bit
at early boot time if the hardware feature is not present.

Since TCR_EL1.T1SZ indicates the size of the memory region addressed by
TTBR1_EL1, export the same in vmcoreinfo. User-space utilities like
makedumpfile and crash-utility need to read this value from vmcoreinfo
for determining if a virtual address lies in the linear map range.

While at it also add documentation for TCR_EL1.T1SZ variable being
added to vmcoreinfo.

It indicates the size offset of the memory region addressed by
TTBR1_EL1.

Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: John Donnelly <john.p.donnelly@oracle.com>
Tested-by: Kamlakant Patel <kamlakantp@marvell.com>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: kexec@lists.infradead.org
Link: https://lore.kernel.org/r/1589395957-24628-3-git-send-email-bhsharma@redhat.com
[catalin.marinas@arm.com: removed vabits_actual from the commit log]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 20:44:41 +08:00
Maciej Fijalkowski b675b19f3b veth: Implement ethtool's get_channels() callback
[upstream commit 34829eec3b]

Libbpf's xsk part calls get_channels() API to retrieve the queue count
of the underlying driver so that XSKMAP is sized accordingly.

Implement that in veth so multi queue scenarios can work properly.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-14-maciej.fijalkowski@intel.com
Signed-off-by: Menglong Dong <imagedong@tencent.com>
2024-06-11 20:44:40 +08:00
Xin Long d812f3ef2e xfrm: add prep for esp beet mode offload
Like __xfrm_transport/mode_tunnel_prep(), this patch is to add
__xfrm_mode_beet_prep() to fix the transport_header for gso
segments, and reset skb mac_len, and pull skb data to the
proto inside esp.

This patch also fixes a panic, reported by ltp:

  # modprobe esp4_offload
  # runltp -f net_stress.ipsec_tcp

  [ 2452.780511] kernel BUG at net/core/skbuff.c:109!
  [ 2452.799851] Call Trace:
  [ 2452.800298]  <IRQ>
  [ 2452.800705]  skb_push.cold.98+0x14/0x20
  [ 2452.801396]  esp_xmit+0x17b/0x270 [esp4_offload]
  [ 2452.802799]  validate_xmit_xfrm+0x22f/0x2e0
  [ 2452.804285]  __dev_queue_xmit+0x589/0x910
  [ 2452.806264]  __neigh_update+0x3d7/0xa50
  [ 2452.806958]  arp_process+0x259/0x810
  [ 2452.807589]  arp_rcv+0x18a/0x1c

It was caused by the skb going to esp_xmit with a wrong transport
header.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-06-11 20:44:40 +08:00
Alexander Mikhalitsyn 0ba944e1bd shm: extend forced shm destroy to support objects from several IPC nses
commit 85b6d24646 upstream.

Currently, the exit_shm() function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
leads to use-after-free (reproducer exists).

This is an attempt to fix the problem by extending exit_shm mechanism to
handle shm's destroy from several IPC ns'es.

To achieve that we do several things:

1. add a namespace (non-refcounted) pointer to the struct shmid_kernel

2. during new shm object creation (newseg()/shmget syscall) we
   initialize this pointer by current task IPC ns

3. exit_shm() fully reworked such that it traverses over all shp's in
   task->sysvshm.shm_clist and gets IPC namespace not from current task
   as it was before but from shp's object itself, then call
   shm_destroy(shp, ns).

Note: We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we
using special helper get_ipc_ns_not_zero() which allows to get IPC ns
refcounter only if IPC ns not in the "state of destruction".

Q/A

Q: Why can we access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther than IPC namespace
   lifetime, so, if we get shp object from the task->sysvshm.shm_clist
   while holding task_lock(task) nobody can steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: No. It's just fixes non-covered case when process may leave IPC
   namespace without getting task->sysvshm.shm_clist list cleaned up.

Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
Fixes: ab602f7991 ("shm: make exit_shm work proportional to task activity")
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:44:39 +08:00
Haisu Wang cee4b3596d block: fix the incorrect spin_lock_irq to spin_lock
The process already run in irq disabled state.
Should use spin_lock instead of spin_lock_irq, otherwise
spin_unlock_irq may enable the irq in wrong stage.

   Call Trace:
    _raw_spin_lock_irq+0x20/0x24
    blkcg_print_blkgs+0x4f/0xe0
    blkg_print_stat_bytes+0x44/0x50
    cgroup_seqfile_show+0x4c/0xb0
    kernfs_seq_show+0x21/0x30
    seq_read+0x14c/0x3f0
    kernfs_fop_read+0x35/0x190
    __vfs_read+0x18/0x40
    vfs_read+0x99/0x160
    ksys_read+0x61/0xe0
    __x64_sys_read+0x1a/0x20
    do_syscall_64+0x47/0x140
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: f2519e1ed9a16 ("blkcg: add per blkcg diskstats")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by:: Honglin Li <honglinli@tencent.com>
2024-06-11 20:44:39 +08:00
shookliu 904baaf92b md/raid10: avoid deadlock on recovery.
When disk failure happens and the array has a spare drive, resync thread
kicks in and starts to refill the spare. However it may get blocked by
a retry thread that resubmits failed IO to a mirror and itself can get
blocked on a barrier raised by the resync thread.

upstream commit id:fe630de009d0729584d79c78f43121e07c745fdc

Acked-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Vitaly Mayatskikh <vmayatskikh@digitalocean.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: shookliu <shookliu@tencent.com>
2024-06-11 20:44:39 +08:00
Kairui Song d1c25caef9 x86/mpparse, kexec: switch apic driver early if x2apic is pre-enabled
Following kernel panic is observed when doing kexec/kdump on machines
that use mptable, and supports x2apic:

[    0.010090] Intel MultiProcessor Specification v1.4
[    0.010688] MPTABLE: OEM ID: BOCHSCPU
[    0.010886] MPTABLE: Product ID: 0.1
[    0.011119] MPTABLE: APIC at: 0xFEE00000
[    0.011332] BUG: unable to handle page fault for address: ffffffffff5fc020
[    0.011702] #PF: supervisor read access in kernel mode
[    0.011981] #PF: error_code(0x0000) - not-present page
[    0.012256] PGD 25e15067 P4D 25e15067 PUD 25e17067 PMD 25e18067 PTE 0
[    0.012603] Oops: 0000 [#1] SMP NOPTI
[    0.012801] CPU: 0 PID: 0 Comm: swapper Not tainted 5.14.10-300.fc35.x86_64 #1
[    0.013189] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[    0.013658] RIP: 0010:native_apic_mem_read+0x2/0x10
[    0.013924] Code: 14 25 20 cd e3 82 c3 90 bf 30 08 00 00 ff 14 25 18 cd e3 82 c3 cc cc cc 89 ff 89 b7 00 c0 5f ff c3 0f 1f 80 00 00 00 00 89 ff <8b> 87 00 c0 5f ff c3 0f 1f 80 00 00 00 0
[    0.014930] RSP: 0000:ffffffff82e03e18 EFLAGS: 00010046
[    0.015211] RAX: ffffffff81064840 RBX: ffffffffff240b6c RCX: ffffffff82f17428
[    0.015593] RDX: c0000000ffffdfff RSI: 00000000ffffdfff RDI: 0000000000000020
[    0.015977] RBP: ffff888023200000 R08: 0000000000000000 R09: ffffffff82e03c50
[    0.016385] R10: ffffffff82e03c48 R11: ffffffff82f47468 R12: ffffffffff240b40
[    0.016768] R13: ffffffffff200b30 R14: 0000000000000000 R15: 00000000000000d4
[    0.017155] FS:  0000000000000000(0000) GS:ffffffff8365b000(0000) knlGS:0000000000000000
[    0.017589] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.017899] CR2: ffffffffff5fc020 CR3: 0000000025e10000 CR4: 00000000000006b0
[    0.018284] Call Trace:
[    0.018417]  ? read_apic_id+0x15/0x30
[    0.018616]  ? register_lapic_address+0x76/0x97
[    0.018864]  ? default_get_smp_config+0x28b/0x42d
[    0.019119]  ? dmi_check_system+0x1c/0x60
[    0.019337]  ? acpi_boot_init+0x1d/0x4c3
[    0.019550]  ? setup_arch+0xb37/0xc2a
[    0.019749]  ? slab_is_available+0x5/0x10
[    0.019969]  ? start_kernel+0x61/0x980
[    0.020173]  ? load_ucode_bsp+0x4c/0xcd
[    0.020380]  ? secondary_startup_64_no_verify+0xc2/0xcb
[    0.020664] Modules linked in:
[    0.020830] CR2: ffffffffff5fc020
[    0.021012] random: get_random_bytes called from oops_exit+0x35/0x60 with crng_init=0
[    0.021015] ---[ end trace c9e569df3bdbefd3 ]---

Checking following init order we have:

setup_arch()
  check_x2apic()     <-- x2apic is enabled by first kernel before kexec,
                         this set x2apic_mode = 1, make sure later probes
                         will recognize pre-enabled x2apic.
  ....
  acpi_boot_init();  <-- With ACPI MADT, this will switch apic driver
                         to x2apic, but it will do nothing with mptable.
  x86_dtb_init();
  get_smp_config();
    default_get_smp_config();
      check_physptr();
        smp_read_mpc();
          register_lapic_address(); <-- panic here
  init_apic_mappings();
  ....

The problem here is mpparse need to read some boot info from apic, so
calls register_lapic_address() early. But without MADT, apic driver
is still apic_flat, it attempts to use the MMIO interface which is
never mapped since: commit 0450193bff ("x86, x2apic: Don't map lapic
addr for preenabled x2apic systems")

Simply map it won't work either as in x2apic mode the MMIO interface is
not really available (Intel SDM Volume 3A 10.12.2), later code will
fail with other errors. So here we do the apic driver probe early.
With pre-enabled x2apic, the probe will recognize it and switch to
the right driver just fine.

Such issue is currently only seen with kexec/kdump, which enabled the
x2apic in first kernel and kept it enabled to 2nd kernel.

This can be easily reproduced with qemu, use -no-acpi and enable x2apic.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-06-11 20:44:38 +08:00
Liuchun 48bc119a5a cpuhotplug: reject core0 offline by default
Core 0 of some server models with ARM architecture cannot be taken
offline, so it is rejected by default.

Signed-off-by: Chun Liu <kaicliu@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:44:38 +08:00
mayercheng 8769f0840e driver: update megaraid_sas to 07.721.02.00
Signed-off-by: mayercheng <mayercheng@tencent.com>
2024-06-11 20:44:38 +08:00
Xin Long f34f3b4e56 sctp: use call_rcu to free endpoint
[ Upstream commit 5ec7d18d18 ]

This patch is to delay the endpoint free by calling call_rcu() to fix
another use-after-free issue in sctp_sock_dump():

  BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20
  Call Trace:
    __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218
    lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
    __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
    _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
    spin_lock_bh include/linux/spinlock.h:334 [inline]
    __lock_sock+0x203/0x350 net/core/sock.c:2253
    lock_sock_nested+0xfe/0x120 net/core/sock.c:2774
    lock_sock include/net/sock.h:1492 [inline]
    sctp_sock_dump+0x122/0xb20 net/sctp/diag.c:324
    sctp_for_each_transport+0x2b5/0x370 net/sctp/socket.c:5091
    sctp_diag_dump+0x3ac/0x660 net/sctp/diag.c:527
    __inet_diag_dump+0xa8/0x140 net/ipv4/inet_diag.c:1049
    inet_diag_dump+0x9b/0x110 net/ipv4/inet_diag.c:1065
    netlink_dump+0x606/0x1080 net/netlink/af_netlink.c:2244
    __netlink_dump_start+0x59a/0x7c0 net/netlink/af_netlink.c:2352
    netlink_dump_start include/linux/netlink.h:216 [inline]
    inet_diag_handler_cmd+0x2ce/0x3f0 net/ipv4/inet_diag.c:1170
    __sock_diag_cmd net/core/sock_diag.c:232 [inline]
    sock_diag_rcv_msg+0x31d/0x410 net/core/sock_diag.c:263
    netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2477
    sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:274

This issue occurs when asoc is peeled off and the old sk is freed after
getting it by asoc->base.sk and before calling lock_sock(sk).

To prevent the sk free, as a holder of the sk, ep should be alive when
calling lock_sock(). This patch uses call_rcu() and moves sock_put and
ep free into sctp_endpoint_destroy_rcu(), so that it's safe to try to
hold the ep under rcu_read_lock in sctp_transport_traverse_process().

If sctp_endpoint_hold() returns true, it means this ep is still alive
and we have held it and can continue to dump it; If it returns false,
it means this ep is dead and can be freed after rcu_read_unlock, and
we should skip it.

In sctp_sock_dump(), after locking the sk, if this ep is different from
tsp->asoc->ep, it means during this dumping, this asoc was peeled off
before calling lock_sock(), and the sk should be skipped; If this ep is
the same with tsp->asoc->ep, it means no peeloff happens on this asoc,
and due to lock_sock, no peeloff will happen either until release_sock.

Note that delaying endpoint free won't delay the port release, as the
port release happens in sctp_endpoint_destroy() before calling call_rcu().
Also, freeing endpoint by call_rcu() makes it safe to access the sk by
asoc->base.sk in sctp_assocs_seq_show() and sctp_rcv().

Thanks Jones to bring this issue up.

v1->v2:
  - improve the changelog.
  - add kfree(ep) into sctp_endpoint_destroy_rcu(), as Jakub noticed.

Reported-by: syzbot+9276d76e83e3bcde6c99@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Fixes: d25adbeb0c ("sctp: fix an use-after-free issue in sctp_sock_dump")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Fuhai Wang <fuhaiwang@tencent.com>
2024-06-11 20:44:37 +08:00
Hangyu Hua 014df595cb phonet: refcount leak in pep_sock_accep
commit bcd0f93353 upstream.

sock_hold(sk) is invoked in pep_sock_accept(), but __sock_put(sk) is not
invoked in subsequent failure branches(pep_accept_conn() != 0).

Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Link: https://lore.kernel.org/r/20211209082839.33985-1-hbh25y@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Aayush Agarwal <aayush.a.agarwal@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Fuhai Wang <fuhaiwang@tencent.com>
2024-06-11 20:44:37 +08:00
Pablo Neira Ayuso f4f8bcc4f1 netfilter: nf_tables: disallow non-stateful expression in sets earlier
commit 520778042c upstream.

Since 3e135cd499 ("netfilter: nft_dynset: dynamic stateful expression
instantiation"), it is possible to attach stateful expressions to set
elements.

cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate
and destroy phase") introduces conditional destruction on the object to
accomodate transaction semantics.

nft_expr_init() calls expr->ops->init() first, then check for
NFT_STATEFUL_EXPR, this stills allows to initialize a non-stateful
lookup expressions which points to a set, which might lead to UAF since
the set is not properly detached from the set->binding for this case.
Anyway, this combination is non-sense from nf_tables perspective.

This patch fixes this problem by checking for NFT_STATEFUL_EXPR before
expr->ops->init() is called.

The reporter provides a KASAN splat and a poc reproducer (similar to
those autogenerated by syzbot to report use-after-free errors). It is
unknown to me if they are using syzbot or if they use similar automated
tool to locate the bug that they are reporting.

For the record, this is the KASAN splat.

[   85.431824] ==================================================================
[   85.432901] BUG: KASAN: use-after-free in nf_tables_bind_set+0x81b/0xa20
[   85.433825] Write of size 8 at addr ffff8880286f0e98 by task poc/776
[   85.434756]
[   85.434999] CPU: 1 PID: 776 Comm: poc Tainted: G        W         5.18.0+ #2
[   85.436023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014

Fixes: 0b2d8a7b63 ("netfilter: nf_tables: add helper functions for expression handling")
Reported-and-tested-by: Aaron Adams <edg-e@nccgroup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
[Ajay: Regenerated the patch for v5.4.y]
Signed-off-by: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Fuhai Wang <fuhaiwang@tencent.com>
2024-06-11 20:44:37 +08:00