commit 30a996fbb3 upstream.
Move the definitions of intel_idle() and intel_idle_s2idle() before
the definitions of cpuidle_state structures referring to them to
avoid having to use additional declarations of them (and drop those
declarations).
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit bc721c1e45 upstream.
Add proper kerneldoc descriptions to intel_idle() and
intel_idle_s2idle(), annotate the latter with __cpuidle and
reorder the declarations of local variables in both of them to
reflect the mwait_idle_with_hints() arguments order.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 40ab82e08d upstream.
The lapic_timer_always_reliable variable really takes only two values
and some arithmetic in intel_idle() related to comparing it with the
target C-state's MWAIT hint value is unnecessary.
Simplify the code by replacing lapic_timer_always_reliable with
a bool variable lapic_timer_always_reliable and dropping the
LAPIC_TIMER_ALWAYS_RELIABLE symbol along with the excess
computations in intel_idle().
While at it, add a comment explaining the branch taken in intel_idle()
if the LAPIC timer is only reliable in C1 and modify the related debug
message in intel_idle_init() accordingly (the modification of this
message in the only expected functional impact of the change made
here).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 4dcb78ee57 upstream.
In certain system configurations it may not be desirable to use some
C-states assumed to be available by intel_idle and the driver needs
to be prevented from using them even before the cpuidle sysfs
interface becomes accessible to user space. Currently, the only way
to achieve that is by setting the 'max_cstate' module parameter to a
value lower than the index of the shallowest of the C-states in
question, but that may be overly intrusive, because it effectively
makes all of the idle states deeper than the 'max_cstate' one go
away (and the C-state to avoid may be in the middle of the range
normally regarded as available).
To allow that limitation to be overcome, introduce a new module
parameter called 'states_off' to represent a list of idle states to
be disabled by default in the form of a bitmask and update the
documentation to cover it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 3a5be9b8f4 upstream.
For diagnostics, it is generally useful to be able to make intel_idle
take the system's ACPI tables into consideration even if that is not
required for the processor model in there, so introduce a new module
parameter, 'use_acpi', to make that happen and update the documentation
to cover it.
While at it, fix the 'no_acpi' module parameter name in the
documentation.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 86e9466ae6 upstream.
Move the irtl_ns_units[] definition into irtl_2_usec() which is the
only user of it, use div_u64() for the division in there (as the
divisor is small enough) and use the NSEC_PER_USEC symbol for the
divisor. Also convert the irtl_2_usec() comment to a proper
kerneldo one.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 1aefbd7aeb upstream.
Move intel_idle_verify_cstate(), auto_demotion_disable() and
c1e_promotion_disable() closer to their callers.
While at it, annotate intel_idle_verify_cstate() with __init,
as it is only used during the initialization of the driver.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 095928ae48 upstream.
Annotate the functions that are only used at the initialization time
with __init and the data structures used by them with __initdata or
__initconst.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 3d3a1ae9b4 upstream.
Notice that intel_idle_state_table_update() only needs to be called
if icpu is not NULL, so fold it into intel_idle_init_cstates_icpu(),
and pass a pointer to the driver object to
intel_idle_cpuidle_driver_init() as an argument instead of
referencing it locally in there.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit a6c86e3362 upstream.
There is no particular reason why intel_idle_probe() needs to be
a separate function and folding it into intel_idle_init() causes
the code to be somewhat easier to follow, so do just that.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit cbd2c4c25d upstream.
The __setup_broadcast_timer() static function is only called in one
place and "true" is passed to it as the argument in there, so
effectively it is a wrapper arround tick_broadcast_enable().
To simplify the code, call tick_broadcast_enable() directly instead
of __setup_broadcast_timer() and drop the latter.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
Add the 'sysctl_tcp_wnd_shrink' to control the enable/disable of TCP
window shrink. By default, it is disabled.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
when open gso, tcp Write queues have less overhead, and make some app
run faster.
test of redis-benchmark like follow:
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
In the origin logic, zero-window probe can not only be raised on
0 window, but also in other case, such as MTU probe fails.
Therefore, we need modify tcp_probe0_needed() to make it compatible
with origin logic.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
upstream commit: c2407cf7d2
Ever since commit 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common()
logic") we've had some very occasional reports of BUG_ON(PageWriteback)
in write_cache_pages(), which we thought we already fixed in commit
073861ed77 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").
But syzbot just reported another one, even with that commit in place.
And it turns out that there's a simpler way to trigger the BUG_ON() than
the one Hugh found with page re-use. It all boils down to the fact that
the page writeback is ostensibly serialized by the page lock, but that
isn't actually really true.
Yes, the people _setting_ writeback all do so under the page lock, but
the actual clearing of the bit - and waking up any waiters - happens
without any page lock.
This gives us this fairly simple race condition:
CPU1 = end previous writeback
CPU2 = start new writeback under page lock
CPU3 = write_cache_pages()
CPU1 CPU2 CPU3
---- ---- ----
end_page_writeback()
test_clear_page_writeback(page)
... delayed...
lock_page();
set_page_writeback()
unlock_page()
lock_page()
wait_on_page_writeback();
wake_up_page(page, PG_writeback);
.. wakes up CPU3 ..
BUG_ON(PageWriteback(page));
where the BUG_ON() happens because we woke up the PG_writeback bit
becasue of the _previous_ writeback, but a new one had already been
started because the clearing of the bit wasn't actually atomic wrt the
actual wakeup or serialized by the page lock.
The reason this didn't use to happen was that the old logic in waiting
on a page bit would just loop if it ever saw the bit set again.
The nice proper fix would probably be to get rid of the whole "wait for
writeback to clear, and then set it" logic in the writeback path, and
replace it with an atomic "wait-to-set" (ie the same as we have for page
locking: we set the page lock bit with a single "lock_page()", not with
"wait for lock bit to clear and then set it").
However, out current model for writeback is that the waiting for the
writeback bit is done by the generic VFS code (ie write_cache_pages()),
but the actual setting of the writeback bit is done much later by the
filesystem ".writepages()" function.
IOW, to make the writeback bit have that same kind of "wait-to-set"
behavior as we have for page locking, we'd have to change our roughly
~50 different writeback functions. Painful.
Instead, just make "wait_on_page_writeback()" loop on the very unlikely
situation that the PG_writeback bit is still set, basically re-instating
the old behavior. This is very non-optimal in case of contention, but
since we only ever set the bit under the page lock, that situation is
controlled.
Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
Fixes: 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Bin Lai <robinlai@tencent.com>
commit 85cd39af14 upstream.
KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine. The directory name is built from the process
pid and a VM fd. While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:
[ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!
Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.
When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked. The slow memory leak was spotted by syzkaller, where it
caused OOM reports.
Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.
Cc: stable@vger.kernel.org
Fixes: 536a6f88c4 ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Bin Lai <robinlai@tencent.com>
[upstream commit 35dfb01314]
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Window shrink is not allowed and also not handled for now, but it's
needed in some case.
In the origin logic, 0 probe is triggered only when there is no any
data in the retrans queue and the receive window can't hold the data
of the 1th packet in the send queue.
Now, let's change it and trigger the 0 probe in such cases:
- if the retrans queue has data and the 1th packet in it is not within
the receive window
- no data in the retrans queue and the 1th packet in the send queue is
out of the end of the receive window
Signed-off-by: Menglong Dong <imagedong@tencent.com>
For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.
Therefore, now we force to receive one packet on current socket when
the protocol memory is out of the limitation. Then, this socket will
stay in 'no mem' status, util protocol memory is available.
When a socket is in 'no mem' status, it's receive window will become
0, which means window shrink happens. And the sender need to handle
such window shrink properly, which is done in the next commit.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
In arm64 system, when the memory that less than 4G is a little,
the capture kernel cannot use dma memory. Therefore, it is necessary
to enable CONFIG_EXEC_FILE and fixes the issue of reserved memory
to pass low memory to the kdump kernel.
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Liu Chun <kaicliu@tencent.com>
Upstream: 3751e728ce
Link: 40e94ab32e
commit 3751e728ce
Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date: Mon Dec 16 11:12:47 2019 +0900
arm64: kexec_file: add crash dump support
Enabling crash dump (kdump) includes
* prepare contents of ELF header of a core dump file, /proc/vmcore,
using crash_prepare_elf64_headers(), and
* add two device tree properties, "linux,usable-memory-range" and
"linux,elfcorehdr", which represent respectively a memory range
to be used by crash dump kernel and the header's location
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Tested-and-reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Upstream: c273a2bd8a
Link: 887436bdb7
commit c273a2bd8a
Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date: Mon Dec 9 12:03:44 2019 +0900
libfdt: include fdt_addresses.c
In the implementation of kexec_file_loaded-based kdump for arm64,
fdt_appendprop_addrrange() will be needed.
So include fdt_addresses.c in making libfdt.
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Yi Li <adamliyi@msn.com>
Link: 696027f109
The patch b2da6ad294
(arm64: kdump: reimplement crashkernel=X) depends on commit 1a8e1cef76
("arm64: use both ZONE_DMA and ZONE_DMA32").
Commit 1a8e1cef76 is not ported to 5.4 kernel. So use arm64_dma_phys_limit.
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 023deaec32
For arm64, the behavior of crashkernel=X has been changed, which
tries low allocation in DMA zone or DMA32 zone if CONFIG_ZONE_DMA
is disabled, and fall back to high allocation if it fails.
We can also use "crashkernel=X,high" to select a high region above
DMA zone, which also tries to allocate at least 256M low memory in
DMA zone automatically (or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
"crashkernel=Y,low" can be used to allocate specified size low memory.
So update the Documentation.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 2012a3b392
When reserving crashkernel in high memory, some low memory is reserved
for crash dump kernel devices and never mapped by the first kernel.
This memory range is advertised to crash dump kernel via DT property
under /chosen,
linux,usable-memory-range = <BASE1 SIZE1 [BASE2 SIZE2]>
We reused the DT property linux,usable-memory-range and made the low
memory region as the second range "BASE2 SIZE2", which keeps compatibility
with existing user-space and older kdump kernels.
Crash dump kernel reads this property at boot time and call memblock_add()
to add the low memory region after memblock_cap_memory_range() has been
called.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: c8013ee6cd
We make the functions reserve_crashkernel[_low]() as generic for
x86 and arm64. Since reserve_crashkernel[_low]() implementations
are quite similar on other architectures as well, we can have more
users of this later.
So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and
select this by X86 and ARM64.
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 70e586365f
There are following issues in arm64 kdump:
1. We use crashkernel=X to reserve crashkernel below 4G, which
will fail when there is no enough low memory.
2. If reserving crashkernel above 4G, in this case, crash dump
kernel will boot failure because there is no low memory available
for allocation.
3. Since commit 1a8e1cef76 ("arm64: use both ZONE_DMA and ZONE_DMA32"),
if the memory reserved for crash dump kernel falled in ZONE_DMA32,
the devices in crash dump kernel need to use ZONE_DMA will alloc
fail.
To solve these issues, change the behavior of crashkernel=X and
introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation
in DMA zone or DMA32 zone if CONFIG_ZONE_DMA is disabled, and fall back
to high allocation if it fails.
We can also use "crashkernel=X,high" to select a region above DMA zone,
which also tries to allocate at least 256M in DMA zone automatically
(or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
"crashkernel=Y,low" can be used to allocate specified size low memory.
Another minor change, there may be two regions reserved for crash
dump kernel, in order to distinct from the high region and make no
effect to the use of existing kexec-tools, rename the low region as
"Crash kernel (low)".
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 667118f8c1
Introduce macro CRASH_ALIGN for alignment, macro CRASH_ADDR_LOW_MAX
for upper bound of low crash memory, macro CRASH_ADDR_HIGH_MAX for
upper bound of high crash memory, use macroes instead.
Besides, keep consistent with x86, use CRASH_ALIGN as the lower bound
of crash kernel reservation.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: b332ab8970
Move macro vmcore_elf_check_arch_cross from arch/x86/include/asm/kexec.h
to arch/x86/include/asm/elf.h to fix the following compiling warning:
In file included from arch/x86/kernel/setup.c:39:0:
./arch/x86/include/asm/kexec.h:77:0: warning: "vmcore_elf_check_arch_cross" redefined
# define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
In file included from arch/x86/kernel/setup.c:9:0:
./include/linux/crash_dump.h:39:0: note: this is the location of the previous definition
#define vmcore_elf_check_arch_cross(x) 0
The root cause is that vmcore_elf_check_arch_cross under CONFIG_CRASH_CORE
depend on CONFIG_KEXEC_CORE. Commit 532b66d2279d ("x86: kdump: move
reserve_crashkernel[_low]() into crash_core.c") triggered the issue.
Suggested by Mike, simply move vmcore_elf_check_arch_cross from
arch/x86/include/asm/kexec.h to arch/x86/include/asm/elf.h to fix
the warning.
Fixes: 532b66d2279d ("x86: kdump: move reserve_crashkernel[_low]() into crash_core.c")
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8cb8686864
Make the functions reserve_crashkernel[_low]() as generic.
Arm64 will use these to reimplement crashkernel=X.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8ec4a816f2
We will make the functions reserve_crashkernel() as generic, the
xen_pv_domain() check in reserve_crashkernel() is relevant only to
x86, the same as insert_resource() in reserve_crashkernel[_low]().
So move xen_pv_domain() check and insert_resource() to setup_arch()
to keep them in x86.
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: a2e0b4351d
To make the functions reserve_crashkernel() as generic,
replace some hard-coded numbers with macro CRASH_ADDR_LOW_MAX.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 8882ba540e
The lower bounds of crash kernel reservation and crash kernel low
reservation are different, use the consistent value CRASH_ALIGN.
Suggested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 873384fe79
Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
function reserve_crashkernel() also used 1M alignment. So just
replace hard-coded alignment 1M with macro CRASH_ALIGN.
Suggested-by: Dave Young <dyoung@redhat.com>
Suggested-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
This conflicts with upstream's kdump high reservation support, and we
already have CONFIG_ZONE_DMA32 set, so we have:
ARCH_LOW_ADDRESS_LIMIT = min(offset + (1ULL << 32), memblock_end_of_DRAM());
Which limits the address below 4G, so this hard code limit is redundant.
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
This reverts commit 918f50807eccd63d482ef4cf778b1d2b416770a9.
the commit force COW to write model, which force COW breaking, and cause
page usage increase a lot. On upstream, commit 376a34efa ("mm/gup:
refactor and de-duplicate gup_fast() code") give another way to fix fork
secuirty issue of COW, and then revert the buggy commit by commit a308c71bf1
("mm/gup: Remove enfornced COW mechanism")
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Remove the function as the last reference has gone away with the do_wp_page()
changes.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1a0cf26323)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
commit 09854ba94c upstrem
How about we just make sure we're the only possible valid user fo the
page before we bother to reuse it?
Simplify, simplify, simplify.
And get rid of the nasty serialization on the page lock at the same time.
[peterx: add subject prefix]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 09854ba94c)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Conflicts:
mm/memory.c
Enable CONFIG_IOSCHED_BFQ,CONFIG_BFQ_GROUP_IOSCHED for ARM to
support bfq io-scheduler.
Signed-off-by: Yuehong Wu <yuehongwu@tencent.com>
Signed-off-by: Bin Lai <robinlai@tencent.com>
[upstream commit 0550cfe8c2]
secid_to_secctx is not stackable, and since the BPF LSM registers this
hook by default, the call_int_hook logic is not suitable which
"bails-on-fail" and casues issues when other LSMs register this hook and
eventually breaks Audit.
In order to fix this, directly iterate over the security hooks instead
of using call_int_hook as suggested in:
https: //lore.kernel.org/bpf/9d0eb6c6-803a-ff3a-5603-9ad6d9edfc00@schaufler-ca.com/#t
Fixes: 98e828a065 ("security: Refactor declaration of LSM hooks")
Fixes: 625236ba38 ("security: Fix the default value of secid_to_secctx hook")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200520125616.193765-1-kpsingh@chromium.org
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Add fixup in fast_copy_page, this feature is disabled by default,
set vm.fast_copy_page_enabled to enable it.
Signed-off-by: soonflywang <soonflywang@tencent.com>
Signed-off-by: caelli <caelli@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
When running on Arm server, usually there is NEON/VFP extension on
Arm server CPU, this patch levearges SIMD instructions to speed up
the efficiency of current copy_page().
Signed-off-by: soonflywang <soonflywang@tencent.com>
Signed-off-by: Chengdong Li <chengdongli@tencent.com>
Reviewed-by: robinlai <robinlai@tencent.com>
There could be the use after free issue in dmi_sysfs_register_handle.
During handling specializations process, the entry->child could be
free if the error occurs. However, it will be kobject_put after free.
So, we set the entry->child to NULL to avoid above case.
Reported-by: loydlv <loydlv@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
data could be free when it is not completed during transmit if
the opt is nonblocking.In this case,the regular free could lead
to double-free.So, add the return value '-EPERM' to mark the
above case.
Reported-by: loydlv <loydlv@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
commit 52762efa2b upstream.
In function displback_changed, has the call chain
displback_connect(front_info)->xen_drm_drv_init(front_info).
We can see that drm_info is assigned to front_info->drm_info
and drm_info is freed in fail branch in xen_drm_drv_init().
Later displback_disconnect(front_info) is called and it calls
xen_drm_drv_fini(front_info) cause a use after free by
drm_info = front_info->drm_info statement.
My patch has done two things. First fixes the fail label which
drm_info = kzalloc() failed and still free the drm_info.
Second sets front_info->drm_info to NULL to avoid uaf.
Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn>
Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210323014656.10068-1-lyl2019@mail.ustc.edu.cn
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reviewed-by: Robinlai <robinlai@tencent.com>