Upstream: c273a2bd8a
Link: 887436bdb7
commit c273a2bd8a
Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date: Mon Dec 9 12:03:44 2019 +0900
libfdt: include fdt_addresses.c
In the implementation of kexec_file_loaded-based kdump for arm64,
fdt_appendprop_addrrange() will be needed.
So include fdt_addresses.c in making libfdt.
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Yi Li <adamliyi@msn.com>
Link: 696027f109
The patch b2da6ad294
(arm64: kdump: reimplement crashkernel=X) depends on commit 1a8e1cef76
("arm64: use both ZONE_DMA and ZONE_DMA32").
Commit 1a8e1cef76 is not ported to 5.4 kernel. So use arm64_dma_phys_limit.
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 023deaec32
For arm64, the behavior of crashkernel=X has been changed, which
tries low allocation in DMA zone or DMA32 zone if CONFIG_ZONE_DMA
is disabled, and fall back to high allocation if it fails.
We can also use "crashkernel=X,high" to select a high region above
DMA zone, which also tries to allocate at least 256M low memory in
DMA zone automatically (or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
"crashkernel=Y,low" can be used to allocate specified size low memory.
So update the Documentation.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 2012a3b392
When reserving crashkernel in high memory, some low memory is reserved
for crash dump kernel devices and never mapped by the first kernel.
This memory range is advertised to crash dump kernel via DT property
under /chosen,
linux,usable-memory-range = <BASE1 SIZE1 [BASE2 SIZE2]>
We reused the DT property linux,usable-memory-range and made the low
memory region as the second range "BASE2 SIZE2", which keeps compatibility
with existing user-space and older kdump kernels.
Crash dump kernel reads this property at boot time and call memblock_add()
to add the low memory region after memblock_cap_memory_range() has been
called.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: c8013ee6cd
We make the functions reserve_crashkernel[_low]() as generic for
x86 and arm64. Since reserve_crashkernel[_low]() implementations
are quite similar on other architectures as well, we can have more
users of this later.
So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and
select this by X86 and ARM64.
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 70e586365f
There are following issues in arm64 kdump:
1. We use crashkernel=X to reserve crashkernel below 4G, which
will fail when there is no enough low memory.
2. If reserving crashkernel above 4G, in this case, crash dump
kernel will boot failure because there is no low memory available
for allocation.
3. Since commit 1a8e1cef76 ("arm64: use both ZONE_DMA and ZONE_DMA32"),
if the memory reserved for crash dump kernel falled in ZONE_DMA32,
the devices in crash dump kernel need to use ZONE_DMA will alloc
fail.
To solve these issues, change the behavior of crashkernel=X and
introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation
in DMA zone or DMA32 zone if CONFIG_ZONE_DMA is disabled, and fall back
to high allocation if it fails.
We can also use "crashkernel=X,high" to select a region above DMA zone,
which also tries to allocate at least 256M in DMA zone automatically
(or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
"crashkernel=Y,low" can be used to allocate specified size low memory.
Another minor change, there may be two regions reserved for crash
dump kernel, in order to distinct from the high region and make no
effect to the use of existing kexec-tools, rename the low region as
"Crash kernel (low)".
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 667118f8c1
Introduce macro CRASH_ALIGN for alignment, macro CRASH_ADDR_LOW_MAX
for upper bound of low crash memory, macro CRASH_ADDR_HIGH_MAX for
upper bound of high crash memory, use macroes instead.
Besides, keep consistent with x86, use CRASH_ALIGN as the lower bound
of crash kernel reservation.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: b332ab8970
Move macro vmcore_elf_check_arch_cross from arch/x86/include/asm/kexec.h
to arch/x86/include/asm/elf.h to fix the following compiling warning:
In file included from arch/x86/kernel/setup.c:39:0:
./arch/x86/include/asm/kexec.h:77:0: warning: "vmcore_elf_check_arch_cross" redefined
# define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
In file included from arch/x86/kernel/setup.c:9:0:
./include/linux/crash_dump.h:39:0: note: this is the location of the previous definition
#define vmcore_elf_check_arch_cross(x) 0
The root cause is that vmcore_elf_check_arch_cross under CONFIG_CRASH_CORE
depend on CONFIG_KEXEC_CORE. Commit 532b66d2279d ("x86: kdump: move
reserve_crashkernel[_low]() into crash_core.c") triggered the issue.
Suggested by Mike, simply move vmcore_elf_check_arch_cross from
arch/x86/include/asm/kexec.h to arch/x86/include/asm/elf.h to fix
the warning.
Fixes: 532b66d2279d ("x86: kdump: move reserve_crashkernel[_low]() into crash_core.c")
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8cb8686864
Make the functions reserve_crashkernel[_low]() as generic.
Arm64 will use these to reimplement crashkernel=X.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8ec4a816f2
We will make the functions reserve_crashkernel() as generic, the
xen_pv_domain() check in reserve_crashkernel() is relevant only to
x86, the same as insert_resource() in reserve_crashkernel[_low]().
So move xen_pv_domain() check and insert_resource() to setup_arch()
to keep them in x86.
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: a2e0b4351d
To make the functions reserve_crashkernel() as generic,
replace some hard-coded numbers with macro CRASH_ADDR_LOW_MAX.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8882ba540e
The lower bounds of crash kernel reservation and crash kernel low
reservation are different, use the consistent value CRASH_ALIGN.
Suggested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 873384fe79
Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
function reserve_crashkernel() also used 1M alignment. So just
replace hard-coded alignment 1M with macro CRASH_ALIGN.
Suggested-by: Dave Young <dyoung@redhat.com>
Suggested-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
This conflicts with upstream's kdump high reservation support, and we
already have CONFIG_ZONE_DMA32 set, so we have:
ARCH_LOW_ADDRESS_LIMIT = min(offset + (1ULL << 32), memblock_end_of_DRAM());
Which limits the address below 4G, so this hard code limit is redundant.
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
This reverts commit 918f50807eccd63d482ef4cf778b1d2b416770a9.
the commit force COW to write model, which force COW breaking, and cause
page usage increase a lot. On upstream, commit 376a34efa ("mm/gup:
refactor and de-duplicate gup_fast() code") give another way to fix fork
secuirty issue of COW, and then revert the buggy commit by commit a308c71bf1
("mm/gup: Remove enfornced COW mechanism")
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Remove the function as the last reference has gone away with the do_wp_page()
changes.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1a0cf26323)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
commit 09854ba94c upstrem
How about we just make sure we're the only possible valid user fo the
page before we bother to reuse it?
Simplify, simplify, simplify.
And get rid of the nasty serialization on the page lock at the same time.
[peterx: add subject prefix]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 09854ba94c)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Conflicts:
mm/memory.c
Enable CONFIG_IOSCHED_BFQ,CONFIG_BFQ_GROUP_IOSCHED for ARM to
support bfq io-scheduler.
Signed-off-by: Yuehong Wu <yuehongwu@tencent.com>
Signed-off-by: Bin Lai <robinlai@tencent.com>
[upstream commit 0550cfe8c2]
secid_to_secctx is not stackable, and since the BPF LSM registers this
hook by default, the call_int_hook logic is not suitable which
"bails-on-fail" and casues issues when other LSMs register this hook and
eventually breaks Audit.
In order to fix this, directly iterate over the security hooks instead
of using call_int_hook as suggested in:
https: //lore.kernel.org/bpf/9d0eb6c6-803a-ff3a-5603-9ad6d9edfc00@schaufler-ca.com/#t
Fixes: 98e828a065 ("security: Refactor declaration of LSM hooks")
Fixes: 625236ba38 ("security: Fix the default value of secid_to_secctx hook")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200520125616.193765-1-kpsingh@chromium.org
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Add fixup in fast_copy_page, this feature is disabled by default,
set vm.fast_copy_page_enabled to enable it.
Signed-off-by: soonflywang <soonflywang@tencent.com>
Signed-off-by: caelli <caelli@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
When running on Arm server, usually there is NEON/VFP extension on
Arm server CPU, this patch levearges SIMD instructions to speed up
the efficiency of current copy_page().
Signed-off-by: soonflywang <soonflywang@tencent.com>
Signed-off-by: Chengdong Li <chengdongli@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
There could be the use after free issue in dmi_sysfs_register_handle.
During handling specializations process, the entry->child could be
free if the error occurs. However, it will be kobject_put after free.
So, we set the entry->child to NULL to avoid above case.
Reported-by: loydlv <loydlv@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
data could be free when it is not completed during transmit if
the opt is nonblocking.In this case,the regular free could lead
to double-free.So, add the return value '-EPERM' to mark the
above case.
Reported-by: loydlv <loydlv@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
commit 52762efa2b upstream.
In function displback_changed, has the call chain
displback_connect(front_info)->xen_drm_drv_init(front_info).
We can see that drm_info is assigned to front_info->drm_info
and drm_info is freed in fail branch in xen_drm_drv_init().
Later displback_disconnect(front_info) is called and it calls
xen_drm_drv_fini(front_info) cause a use after free by
drm_info = front_info->drm_info statement.
My patch has done two things. First fixes the fail label which
drm_info = kzalloc() failed and still free the drm_info.
Second sets front_info->drm_info to NULL to avoid uaf.
Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn>
Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210323014656.10068-1-lyl2019@mail.ustc.edu.cn
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reviewed-by: Robinlai <robinlai@tencent.com>
upstream commit: 7c496de538
i225 devices have only one PHY vendor. There is no point checking
_I_PHY_ID during the link establishment and auto-negotiation process.
This patch comes to clean up these pointless checkings.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: jackjunliu <jackjunliu@tencent.com>
with CONFIG_CONFIGFS_FS=m, the default cpu cgroup has user.slice, which
will slowdown unixbench Pipe-based Context Switching score: 39660 ->
19626
Signed-off-by: Ni Xun <richardni@tencent.com>
[tapd]
ID877978657
This configuration resulted in a 15% regression on
unixbench's execl testing.
This additional enhancement can be turned on with
rodata=full after this patch.
Signed-off-by: johnnyaiai <johnnyaiai@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
upstream commit: f85daf0e72
xfrm_policy_lookup() will call xfrm_pol_hold_rcu() to get a refcount of
pols[0]. This refcount can be dropped in xfrm_expand_policies() when
xfrm_expand_policies() return error. pols[0]'s refcount is balanced in
here. But xfrm_bundle_lookup() will also call xfrm_pols_put() with
num_pols == 1 to drop this refcount when xfrm_expand_policies() return
error.
This patch also fix an illegal address access. pols[0] will save a error
point when xfrm_policy_lookup fails. This lead to xfrm_pols_put to resolve
an illegal address in xfrm_bundle_lookup's error path.
Fix these by setting num_pols = 0 in xfrm_expand_policies()'s error path.
Fixes: 80c802f307 ("xfrm: cache bundles instead of policies for outgoing flows")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
upstream commit: 4071bf121d
There are sleep in atomic bug that could cause kernel panic during
firmware download process. The root cause is that nlmsg_new with
GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
handler. The call trace is shown below:
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
Call Trace:
kmem_cache_alloc_node
__alloc_skb
nfc_genl_fw_download_done
call_timer_fn
__run_timers.part.0
run_timer_softirq
__do_softirq
...
The nlmsg_new with GFP_KERNEL parameter may sleep during memory
allocation process, and the timer handler is run as the result of
a "software interrupt" that should not call any other function
that could sleep.
This patch changes allocation mode of netlink message from GFP_KERNEL
to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: 9674da8759 ("NFC: Add firmware upload netlink command")
Fixes: 9ea7187c53 ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Bin Lai <robinlai@tencent.com>
upstream commit: 9cc02ede69
There are UAF bugs in rose_heartbeat_expiry(), rose_timer_expiry()
and rose_idletimer_expiry(). The root cause is that del_timer()
could not stop the timer handler that is running and the refcount
of sock is not managed properly.
One of the UAF bugs is shown below:
(thread 1) | (thread 2)
| rose_bind
| rose_connect
| rose_start_heartbeat
rose_release | (wait a time)
case ROSE_STATE_0 |
rose_destroy_socket | rose_heartbeat_expiry
rose_stop_heartbeat |
sock_put(sk) | ...
sock_put(sk) // FREE |
| bh_lock_sock(sk) // USE
The sock is deallocated by sock_put() in rose_release() and
then used by bh_lock_sock() in rose_heartbeat_expiry().
Although rose_destroy_socket() calls rose_stop_heartbeat(),
it could not stop the timer that is running.
The KASAN report triggered by POC is shown below:
BUG: KASAN: use-after-free in _raw_spin_lock+0x5a/0x110
Write of size 4 at addr ffff88800ae59098 by task swapper/3/0
...
Call Trace:
<IRQ>
dump_stack_lvl+0xbf/0xee
print_address_description+0x7b/0x440
print_report+0x101/0x230
? irq_work_single+0xbb/0x140
? _raw_spin_lock+0x5a/0x110
kasan_report+0xed/0x120
? _raw_spin_lock+0x5a/0x110
kasan_check_range+0x2bd/0x2e0
_raw_spin_lock+0x5a/0x110
rose_heartbeat_expiry+0x39/0x370
? rose_start_heartbeat+0xb0/0xb0
call_timer_fn+0x2d/0x1c0
? rose_start_heartbeat+0xb0/0xb0
expire_timers+0x1f3/0x320
__run_timers+0x3ff/0x4d0
run_timer_softirq+0x41/0x80
__do_softirq+0x233/0x544
irq_exit_rcu+0x41/0xa0
sysvec_apic_timer_interrupt+0x8c/0xb0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:default_idle+0xb/0x10
RSP: 0018:ffffc9000012fea0 EFLAGS: 00000202
RAX: 000000000000bcae RBX: ffff888006660f00 RCX: 000000000000bcae
RDX: 0000000000000001 RSI: ffffffff843a11c0 RDI: ffffffff843a1180
RBP: dffffc0000000000 R08: dffffc0000000000 R09: ffffed100da36d46
R10: dfffe9100da36d47 R11: ffffffff83cf0950 R12: 0000000000000000
R13: 1ffff11000ccc1e0 R14: ffffffff8542af28 R15: dffffc0000000000
...
Allocated by task 146:
__kasan_kmalloc+0xc4/0xf0
sk_prot_alloc+0xdd/0x1a0
sk_alloc+0x2d/0x4e0
rose_create+0x7b/0x330
__sock_create+0x2dd/0x640
__sys_socket+0xc7/0x270
__x64_sys_socket+0x71/0x80
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 152:
kasan_set_track+0x4c/0x70
kasan_set_free_info+0x1f/0x40
____kasan_slab_free+0x124/0x190
kfree+0xd3/0x270
__sk_destruct+0x314/0x460
rose_release+0x2fa/0x3b0
sock_close+0xcb/0x230
__fput+0x2d9/0x650
task_work_run+0xd6/0x160
exit_to_user_mode_loop+0xc7/0xd0
exit_to_user_mode_prepare+0x4e/0x80
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x4f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This patch adds refcount of sock when we use functions
such as rose_start_heartbeat() and so on to start timer,
and decreases the refcount of sock when timer is finished
or deleted by functions such as rose_stop_heartbeat()
and so on. As a result, the UAF bugs could be mitigated.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Tested-by: Duoming Zhou <duoming@zju.edu.cn>
Link: https://lore.kernel.org/r/20220629002640.5693-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Bin Lai <robinlai@tencent.com>
upstream commit: 99a63d36cb
Domingo Dirutigliano and Nicola Guerrera report kernel panic when
sending nf_queue verdict with 1-byte nfta_payload attribute.
The IP/IPv6 stack pulls the IP(v6) header from the packet after the
input hook.
If user truncates the packet below the header size, this skb_pull() will
result in a malformed skb (skb->len < 0).
Fixes: 7af4cc3fa1 ("[NETFILTER]: Add "nfnetlink_queue" netfilter queue handler over nfnetlink")
Reported-by: Domingo Dirutigliano <pwnzer0tt1@proton.me>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Bin Lai <robinlai@tencent.com>
Some images are being built using 0009-kabi branch and was expected
to run on some virtualization environments, and might be used for
desktops. So enable related drivers. More test needed though.
config: enable more commonly used DRM drivers
config: enable CONFIG_DRM_AST
config: enable hyperv related configs
config: enable CONFIG_IGC
Signed-off-by: Yuehong Wu <yuehongwu@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
commit 57bc3d3ae8 upstream.
ax88179_rx_fixup() contains several out-of-bounds accesses that can be
triggered by a malicious (or defective) USB device, in particular:
- The metadata array (hdr_off..hdr_off+2*pkt_cnt) can be out of bounds,
causing OOB reads and (on big-endian systems) OOB endianness flips.
- A packet can overlap the metadata array, causing a later OOB
endianness flip to corrupt data used by a cloned SKB that has already
been handed off into the network stack.
- A packet SKB can be constructed whose tail is far beyond its end,
causing out-of-bounds heap data to be considered part of the SKB's
data.
I have tested that this can be used by a malicious USB device to send a
bogus ICMPv6 Echo Request and receive an ICMPv6 Echo Reply in response
that contains random kernel heap data.
It's probably also possible to get OOB writes from this on a
little-endian system somehow - maybe by triggering skb_cow() via IP
options processing -, but I haven't tested that.
Fixes: e2ca90c276 ("ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver")
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Honglin Li <honglinli@tencent.com>
Try 5000ms again with irq disable every 5ms to fix slave core
boot fail on altramax platform.
On ampere altramax platform, it has 256 cpu cores with multi node.
When CONFIG_HZ>=250, the tick will be created too frequently,
which cause slave core boot fail (ampere cpu bug). It needing to
disable cpu0's irq >= 5ms each time, which can reduce irq act.
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
Reviewed-by: samuelliao <samuelliao@tencent.com>
commit bbdbc11804 upstream.
TCR_EL1.TxSZ, which controls the VA space size, is configured by a
single kernel image to support either 48-bit or 52-bit VA space.
If the ARMv8.2-LVA optional feature is present and we are running
with a 64KB page size, then it is possible to use 52-bits of address
space for both userspace and kernel addresses. However, any kernel
binary that supports 52-bit must also be able to fall back to 48-bit
at early boot time if the hardware feature is not present.
Since TCR_EL1.T1SZ indicates the size of the memory region addressed by
TTBR1_EL1, export the same in vmcoreinfo. User-space utilities like
makedumpfile and crash-utility need to read this value from vmcoreinfo
for determining if a virtual address lies in the linear map range.
While at it also add documentation for TCR_EL1.T1SZ variable being
added to vmcoreinfo.
It indicates the size offset of the memory region addressed by
TTBR1_EL1.
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: John Donnelly <john.p.donnelly@oracle.com>
Tested-by: Kamlakant Patel <kamlakantp@marvell.com>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: kexec@lists.infradead.org
Link: https://lore.kernel.org/r/1589395957-24628-3-git-send-email-bhsharma@redhat.com
[catalin.marinas@arm.com: removed vabits_actual from the commit log]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[upstream commit 34829eec3b]
Libbpf's xsk part calls get_channels() API to retrieve the queue count
of the underlying driver so that XSKMAP is sized accordingly.
Implement that in veth so multi queue scenarios can work properly.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-14-maciej.fijalkowski@intel.com
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Like __xfrm_transport/mode_tunnel_prep(), this patch is to add
__xfrm_mode_beet_prep() to fix the transport_header for gso
segments, and reset skb mac_len, and pull skb data to the
proto inside esp.
This patch also fixes a panic, reported by ltp:
# modprobe esp4_offload
# runltp -f net_stress.ipsec_tcp
[ 2452.780511] kernel BUG at net/core/skbuff.c:109!
[ 2452.799851] Call Trace:
[ 2452.800298] <IRQ>
[ 2452.800705] skb_push.cold.98+0x14/0x20
[ 2452.801396] esp_xmit+0x17b/0x270 [esp4_offload]
[ 2452.802799] validate_xmit_xfrm+0x22f/0x2e0
[ 2452.804285] __dev_queue_xmit+0x589/0x910
[ 2452.806264] __neigh_update+0x3d7/0xa50
[ 2452.806958] arp_process+0x259/0x810
[ 2452.807589] arp_rcv+0x18a/0x1c
It was caused by the skb going to esp_xmit with a wrong transport
header.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
commit 85b6d24646 upstream.
Currently, the exit_shm() function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
leads to use-after-free (reproducer exists).
This is an attempt to fix the problem by extending exit_shm mechanism to
handle shm's destroy from several IPC ns'es.
To achieve that we do several things:
1. add a namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we
initialize this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all shp's in
task->sysvshm.shm_clist and gets IPC namespace not from current task
as it was before but from shp's object itself, then call
shm_destroy(shp, ns).
Note: We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we
using special helper get_ipc_ns_not_zero() which allows to get IPC ns
refcounter only if IPC ns not in the "state of destruction".
Q/A
Q: Why can we access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther than IPC namespace
lifetime, so, if we get shp object from the task->sysvshm.shm_clist
while holding task_lock(task) nobody can steal our namespace.
Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: No. It's just fixes non-covered case when process may leave IPC
namespace without getting task->sysvshm.shm_clist list cleaned up.
Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
Fixes: ab602f7991 ("shm: make exit_shm work proportional to task activity")
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
The process already run in irq disabled state.
Should use spin_lock instead of spin_lock_irq, otherwise
spin_unlock_irq may enable the irq in wrong stage.
Call Trace:
_raw_spin_lock_irq+0x20/0x24
blkcg_print_blkgs+0x4f/0xe0
blkg_print_stat_bytes+0x44/0x50
cgroup_seqfile_show+0x4c/0xb0
kernfs_seq_show+0x21/0x30
seq_read+0x14c/0x3f0
kernfs_fop_read+0x35/0x190
__vfs_read+0x18/0x40
vfs_read+0x99/0x160
ksys_read+0x61/0xe0
__x64_sys_read+0x1a/0x20
do_syscall_64+0x47/0x140
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: f2519e1ed9a16 ("blkcg: add per blkcg diskstats")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by:: Honglin Li <honglinli@tencent.com>
When disk failure happens and the array has a spare drive, resync thread
kicks in and starts to refill the spare. However it may get blocked by
a retry thread that resubmits failed IO to a mirror and itself can get
blocked on a barrier raised by the resync thread.
upstream commit id:fe630de009d0729584d79c78f43121e07c745fdc
Acked-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Vitaly Mayatskikh <vmayatskikh@digitalocean.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: shookliu <shookliu@tencent.com>
Following kernel panic is observed when doing kexec/kdump on machines
that use mptable, and supports x2apic:
[ 0.010090] Intel MultiProcessor Specification v1.4
[ 0.010688] MPTABLE: OEM ID: BOCHSCPU
[ 0.010886] MPTABLE: Product ID: 0.1
[ 0.011119] MPTABLE: APIC at: 0xFEE00000
[ 0.011332] BUG: unable to handle page fault for address: ffffffffff5fc020
[ 0.011702] #PF: supervisor read access in kernel mode
[ 0.011981] #PF: error_code(0x0000) - not-present page
[ 0.012256] PGD 25e15067 P4D 25e15067 PUD 25e17067 PMD 25e18067 PTE 0
[ 0.012603] Oops: 0000 [#1] SMP NOPTI
[ 0.012801] CPU: 0 PID: 0 Comm: swapper Not tainted 5.14.10-300.fc35.x86_64 #1
[ 0.013189] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[ 0.013658] RIP: 0010:native_apic_mem_read+0x2/0x10
[ 0.013924] Code: 14 25 20 cd e3 82 c3 90 bf 30 08 00 00 ff 14 25 18 cd e3 82 c3 cc cc cc 89 ff 89 b7 00 c0 5f ff c3 0f 1f 80 00 00 00 00 89 ff <8b> 87 00 c0 5f ff c3 0f 1f 80 00 00 00 0
[ 0.014930] RSP: 0000:ffffffff82e03e18 EFLAGS: 00010046
[ 0.015211] RAX: ffffffff81064840 RBX: ffffffffff240b6c RCX: ffffffff82f17428
[ 0.015593] RDX: c0000000ffffdfff RSI: 00000000ffffdfff RDI: 0000000000000020
[ 0.015977] RBP: ffff888023200000 R08: 0000000000000000 R09: ffffffff82e03c50
[ 0.016385] R10: ffffffff82e03c48 R11: ffffffff82f47468 R12: ffffffffff240b40
[ 0.016768] R13: ffffffffff200b30 R14: 0000000000000000 R15: 00000000000000d4
[ 0.017155] FS: 0000000000000000(0000) GS:ffffffff8365b000(0000) knlGS:0000000000000000
[ 0.017589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.017899] CR2: ffffffffff5fc020 CR3: 0000000025e10000 CR4: 00000000000006b0
[ 0.018284] Call Trace:
[ 0.018417] ? read_apic_id+0x15/0x30
[ 0.018616] ? register_lapic_address+0x76/0x97
[ 0.018864] ? default_get_smp_config+0x28b/0x42d
[ 0.019119] ? dmi_check_system+0x1c/0x60
[ 0.019337] ? acpi_boot_init+0x1d/0x4c3
[ 0.019550] ? setup_arch+0xb37/0xc2a
[ 0.019749] ? slab_is_available+0x5/0x10
[ 0.019969] ? start_kernel+0x61/0x980
[ 0.020173] ? load_ucode_bsp+0x4c/0xcd
[ 0.020380] ? secondary_startup_64_no_verify+0xc2/0xcb
[ 0.020664] Modules linked in:
[ 0.020830] CR2: ffffffffff5fc020
[ 0.021012] random: get_random_bytes called from oops_exit+0x35/0x60 with crng_init=0
[ 0.021015] ---[ end trace c9e569df3bdbefd3 ]---
Checking following init order we have:
setup_arch()
check_x2apic() <-- x2apic is enabled by first kernel before kexec,
this set x2apic_mode = 1, make sure later probes
will recognize pre-enabled x2apic.
....
acpi_boot_init(); <-- With ACPI MADT, this will switch apic driver
to x2apic, but it will do nothing with mptable.
x86_dtb_init();
get_smp_config();
default_get_smp_config();
check_physptr();
smp_read_mpc();
register_lapic_address(); <-- panic here
init_apic_mappings();
....
The problem here is mpparse need to read some boot info from apic, so
calls register_lapic_address() early. But without MADT, apic driver
is still apic_flat, it attempts to use the MMIO interface which is
never mapped since: commit 0450193bff ("x86, x2apic: Don't map lapic
addr for preenabled x2apic systems")
Simply map it won't work either as in x2apic mode the MMIO interface is
not really available (Intel SDM Volume 3A 10.12.2), later code will
fail with other errors. So here we do the apic driver probe early.
With pre-enabled x2apic, the probe will recognize it and switch to
the right driver just fine.
Such issue is currently only seen with kexec/kdump, which enabled the
x2apic in first kernel and kept it enabled to 2nd kernel.
This can be easily reproduced with qemu, use -no-acpi and enable x2apic.
Signed-off-by: Kairui Song <kasong@tencent.com>
Core 0 of some server models with ARM architecture cannot be taken
offline, so it is rejected by default.
Signed-off-by: Chun Liu <kaicliu@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
[ Upstream commit 5ec7d18d18 ]
This patch is to delay the endpoint free by calling call_rcu() to fix
another use-after-free issue in sctp_sock_dump():
BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20
Call Trace:
__lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218
lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
_raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
spin_lock_bh include/linux/spinlock.h:334 [inline]
__lock_sock+0x203/0x350 net/core/sock.c:2253
lock_sock_nested+0xfe/0x120 net/core/sock.c:2774
lock_sock include/net/sock.h:1492 [inline]
sctp_sock_dump+0x122/0xb20 net/sctp/diag.c:324
sctp_for_each_transport+0x2b5/0x370 net/sctp/socket.c:5091
sctp_diag_dump+0x3ac/0x660 net/sctp/diag.c:527
__inet_diag_dump+0xa8/0x140 net/ipv4/inet_diag.c:1049
inet_diag_dump+0x9b/0x110 net/ipv4/inet_diag.c:1065
netlink_dump+0x606/0x1080 net/netlink/af_netlink.c:2244
__netlink_dump_start+0x59a/0x7c0 net/netlink/af_netlink.c:2352
netlink_dump_start include/linux/netlink.h:216 [inline]
inet_diag_handler_cmd+0x2ce/0x3f0 net/ipv4/inet_diag.c:1170
__sock_diag_cmd net/core/sock_diag.c:232 [inline]
sock_diag_rcv_msg+0x31d/0x410 net/core/sock_diag.c:263
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2477
sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:274
This issue occurs when asoc is peeled off and the old sk is freed after
getting it by asoc->base.sk and before calling lock_sock(sk).
To prevent the sk free, as a holder of the sk, ep should be alive when
calling lock_sock(). This patch uses call_rcu() and moves sock_put and
ep free into sctp_endpoint_destroy_rcu(), so that it's safe to try to
hold the ep under rcu_read_lock in sctp_transport_traverse_process().
If sctp_endpoint_hold() returns true, it means this ep is still alive
and we have held it and can continue to dump it; If it returns false,
it means this ep is dead and can be freed after rcu_read_unlock, and
we should skip it.
In sctp_sock_dump(), after locking the sk, if this ep is different from
tsp->asoc->ep, it means during this dumping, this asoc was peeled off
before calling lock_sock(), and the sk should be skipped; If this ep is
the same with tsp->asoc->ep, it means no peeloff happens on this asoc,
and due to lock_sock, no peeloff will happen either until release_sock.
Note that delaying endpoint free won't delay the port release, as the
port release happens in sctp_endpoint_destroy() before calling call_rcu().
Also, freeing endpoint by call_rcu() makes it safe to access the sk by
asoc->base.sk in sctp_assocs_seq_show() and sctp_rcv().
Thanks Jones to bring this issue up.
v1->v2:
- improve the changelog.
- add kfree(ep) into sctp_endpoint_destroy_rcu(), as Jakub noticed.
Reported-by: syzbot+9276d76e83e3bcde6c99@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Fixes: d25adbeb0c ("sctp: fix an use-after-free issue in sctp_sock_dump")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Fuhai Wang <fuhaiwang@tencent.com>
commit 520778042c upstream.
Since 3e135cd499 ("netfilter: nft_dynset: dynamic stateful expression
instantiation"), it is possible to attach stateful expressions to set
elements.
cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate
and destroy phase") introduces conditional destruction on the object to
accomodate transaction semantics.
nft_expr_init() calls expr->ops->init() first, then check for
NFT_STATEFUL_EXPR, this stills allows to initialize a non-stateful
lookup expressions which points to a set, which might lead to UAF since
the set is not properly detached from the set->binding for this case.
Anyway, this combination is non-sense from nf_tables perspective.
This patch fixes this problem by checking for NFT_STATEFUL_EXPR before
expr->ops->init() is called.
The reporter provides a KASAN splat and a poc reproducer (similar to
those autogenerated by syzbot to report use-after-free errors). It is
unknown to me if they are using syzbot or if they use similar automated
tool to locate the bug that they are reporting.
For the record, this is the KASAN splat.
[ 85.431824] ==================================================================
[ 85.432901] BUG: KASAN: use-after-free in nf_tables_bind_set+0x81b/0xa20
[ 85.433825] Write of size 8 at addr ffff8880286f0e98 by task poc/776
[ 85.434756]
[ 85.434999] CPU: 1 PID: 776 Comm: poc Tainted: G W 5.18.0+ #2
[ 85.436023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Fixes: 0b2d8a7b63 ("netfilter: nf_tables: add helper functions for expression handling")
Reported-and-tested-by: Aaron Adams <edg-e@nccgroup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
[Ajay: Regenerated the patch for v5.4.y]
Signed-off-by: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Fuhai Wang <fuhaiwang@tencent.com>