Commit Graph

873692 Commits

Author SHA1 Message Date
Alex Shi c0ddf78f1f dist: release 5.4.119-20.0009.28
Upstream: no

Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:52:39 +08:00
Christoph Hellwig b868ff6759 virtio-blk: remove VIRTIO_BLK_F_SCSI support
[upstream commit: 782e067dba]

Since the need for a special flag to support SCSI passthrough on a
block device was added in May 2017 the SCSI passthrough support in
virtio-blk has been disabled.  It has always been a bad idea
(just ask the original author..) and we have virtio-scsi for proper
passthrough.  The feature also never made it into the virtio 1.0
or later specifications.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: XiaoLei Zhu <leonzzhu@tencent.com>
2024-06-11 20:52:37 +08:00
Peter Zijlstra 66c8ef4a22 cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
commit 32d4fd5751 upstream.

Commit c227233ad6 ("intel_idle: enable interrupts before C1 on
Xeons") wrecked intel_idle in two ways:

 - must not have tracing in idle functions
 - must return with IRQs disabled

Additionally, it added a branch for no good reason.

Fixes: c227233ad6 ("intel_idle: enable interrupts before C1 on Xeons")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ rjw: Moved the intel_idle() kerneldoc comment next to the function ]
Cc: 5.16+ <stable@vger.kernel.org> # 5.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:29 +08:00
Artem Bityutskiy 6f7b0b5f2b intel_idle: make SPR C1 and C1E be independent
commit 1548fac47a upstream.

This patch partially reverts the changes made by the following commit:

da0e58c038 intel_idle: add 'preferred_cstates' module argument

As that commit describes, on early Sapphire Rapids Xeon platforms the C1 and
C1E states were mutually exclusive, so that users could only have either C1 and
C6, or C1E and C6.

However, Intel firmware engineers managed to remove this limitation and make C1
and C1E to be completely independent, just like on previous Xeon platforms.

Therefore, this patch:
 * Removes commentary describing the old, and now non-existing SPR C1E
   limitation.
 * Marks SPR C1E as available by default.
 * Removes the 'preferred_cstates' parameter handling for SPR. Both C1 and
   C1E will be available regardless of 'preferred_cstates' value.

We expect that all SPR systems are shipping with new firmware, which includes
the C1/C1E improvement.

Cc: v5.18+ <stable@vger.kernel.org> # v5.18+
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:29 +08:00
Artem Bityutskiy 3d77bca89a intel_idle: Fix the 'preferred_cstates' module parameter
commit 39c184a6a9 upstream.

Problem description.

When user boots kernel up with the 'intel_idle.preferred_cstates=4' option,
we enable C1E and disable C1 states on Sapphire Rapids Xeon (SPR). In order
for C1E to work on SPR, we have to enable the C1E promotion bit on all
CPUs.  However, we enable it only on one CPU.

Fix description.

The 'intel_idle' driver already has the infrastructure for disabling C1E
promotion on every CPU. This patch uses the same infrastructure for
enabling C1E promotion on every CPU. It changes the boolean
'disable_promotion_to_c1e' variable to a tri-state 'c1e_promotion'
variable.

Tested on a 2-socket SPR system. I verified the following combinations:

 * C1E promotion enabled and disabled in BIOS.
 * Booted with and without the 'intel_idle.preferred_cstates=4' kernel
   argument.

In all 4 cases C1E promotion was correctly set on all CPUs.

Also tested on an old Broadwell system, just to make sure it does not cause
a regression. C1E promotion was correctly disabled on that system, both C1
and C1E were exposed (as expected).

Fixes: da0e58c038 ("intel_idle: add 'preferred_cstates' module argument")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
[ rjw: Minor changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:28 +08:00
Artem Bityutskiy 0f46fd086e intel_idle: Fix SPR C6 optimization
commit 7eac3bd38d upstream.

The Sapphire Rapids (SPR) C6 optimization was added to the end of the
'spr_idle_state_table_update()' function. However, the function has a
'return' which may happen before the optimization has a chance to run.
And this may prevent the optimization from happening.

This is an unlikely scenario, but possible if user boots with, say,
the 'intel_idle.preferred_cstates=6' kernel boot option.

This patch fixes the issue by eliminating the problematic 'return'
statement.

Fixes: 3a9cf77b60 ("intel_idle: add core C6 optimization for SPR")
Suggested-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
[ rjw: Minor changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:28 +08:00
Artem Bityutskiy b39482cff5 intel_idle: add core C6 optimization for SPR
commit 3a9cf77b60 upstream.

Add a Sapphire Rapids Xeon C6 optimization, similar to what we have for Sky Lake
Xeon: if package C6 is disabled, adjust C6 exit latency and target residency to
match core C6 values, instead of using the default package C6 values.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:28 +08:00
Artem Bityutskiy 53e20e738f intel_idle: add 'preferred_cstates' module argument
commit da0e58c038 upstream.

On Sapphire Rapids Xeon (SPR) the C1 and C1E states are basically mutually
exclusive - only one of them can be enabled. By default, 'intel_idle' driver
enables C1 and disables C1E. However, some users prefer to use C1E instead of
C1, because it saves more energy.

This patch adds a new module parameter ('preferred_cstates') for enabling C1E
and disabling C1. Here is the idea behind it.

1. This option has effect only for "mutually exclusive" C-states like C1 and
   C1E on SPR.
2. It does not have any effect on independent C-states, which do not require
   other C-states to be disabled (most states on most platforms as of today).
3. For mutually exclusive C-states, the 'intel_idle' driver always has a
   reasonable default, such as enabling C1 on SPR by default. On other
   platforms, the default may be different.
4. Users can override the default using the 'preferred_cstates' parameter.
5. The parameter accepts the preferred C-states bit-mask, similarly to the
   existing 'states_off' parameter.
6. This parameter is not limited to C1/C1E, and leaves room for supporting
   other mutually exclusive C-states, if they come in the future.

Today 'intel_idle' can only be compiled-in, which means that on SPR, in order
to disable C1 and enable C1E, users should boot with the following kernel
argument: intel_idle.preferred_cstates=4

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:27 +08:00
Tony Luck 223b89f1b6 x86/cpu: Add Sapphire Rapids CPU model number
commit be25d1b5ea upstream.

Latest edition (039) of "Intel Architecture Instruction Set Extensions
and Future Features Programming Reference" includes three new CPU model
numbers. Linux already has the two Ice Lake server ones. Add the new
model number for Sapphire Rapids.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200603173352.15506-1-tony.luck@intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:27 +08:00
Artem Bityutskiy f79fb80ce5 intel_idle: add SPR support
commit 9edf3c0ffe upstream.

Add Sapphire Rapids Xeon support.

Up until very recently, the C1 and C1E C-states were independent, but this
has changed in some new chips, including Sapphire Rapids Xeon (SPR). In these
chips the C1 and C1E states cannot be enabled at the same time. The "C1E
promotion" bit in 'MSR_IA32_POWER_CTL' also has its semantics changed a bit.

Here are the C1, C1E, and "C1E promotion" bit rules on Xeons before SPR.

1. If C1E promotion bit is disabled.
   a. C1  requests end up with C1  C-state.
   b. C1E requests end up with C1E C-state.
2. If C1E promotion bit is enabled.
   a. C1  requests end up with C1E C-state.
   b. C1E requests end up with C1E C-state.

Here are the C1, C1E, and "C1E promotion" bit rules on Sapphire Rapids Xeon.
1. If C1E promotion bit is disabled.
   a. C1  requests end up with C1 C-state.
   b. C1E requests end up with C1 C-state.
2. If C1E promotion bit is enabled.
   a. C1  requests end up with C1E C-state.
   b. C1E requests end up with C1E C-state.

Before SPR Xeon, the 'intel_idle' driver was disabling C1E promotion and was
exposing C1 and C1E as independent C-states. But on SPR, C1 and C1E cannot be
enabled at the same time.

This patch adds both C1 and C1E states. However, C1E is marked as with the
"CPUIDLE_FLAG_UNUSABLE" flag, which means that in won't be registered by
default. The C1E promotion bit will be cleared, which means that by default
only C1 and C6 will be registered on SPR.

The next patch will add an option for enabling C1E and disabling C1 on SPR.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:27 +08:00
Artem Bityutskiy 686588a7bb intel_idle: enable interrupts before C1 on Xeons
commit c227233ad6 upstream.

Enable local interrupts before requesting C1 on the last two generations
of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
This decreases average C1 interrupt latency by about 5-10%, as measured
with the 'wult' tool.

The '->enter()' function of the driver enters C-states with local
interrupts disabled by executing the 'monitor' and 'mwait' pair of
instructions. If an interrupt happens, the CPU exits the C-state and
continues executing instructions after 'mwait'. It does not jump to
the interrupt handler, because local interrupts are disabled. The
cpuidle subsystem enables interrupts a bit later, after doing some
housekeeping.

With this patch, we enable local interrupts before requesting C1. In
this case, if the CPU wakes up because of an interrupt, it will jump
to the interrupt handler right away. The cpuidle housekeeping will be
done after the pending interrupt(s) are handled.

Enabling interrupts before entering a C-state has measurable impact
for faster C-states, like C1. Deeper, but slower C-states like C6 do
not really benefit from this sort of change, because their latency is
a lot higher comparing to the delay added by cpuidle housekeeping.

This change was also tested with cyclictest and dbench. In case of Ice
Lake, the average cyclictest latency decreased by 5.1%, and the average
'dbench' throughput increased by about 0.8%. Both tests were run for 4
hours with only C1 enabled (all other idle states, including 'POLL',
were disabled). CPU frequency was pinned to HFM, and uncore frequency
was pinned to the maximum value. The other platforms had similar
single-digit percentage improvements.

It is worth noting that this patch affects 'cpuidle' statistics a tiny
bit.  Before this patch, C1 residency did not include the interrupt
handling time, but with this patch, it will include it. This is similar
to what happens in case of the 'POLL' state, which also runs with
interrupts enabled.

Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:26 +08:00
Artem Bityutskiy dcaa41dcb4 intel_idle: add Iclelake-D support
commit 22141d5f41 upstream.

This patch adds Icelake Xeon D support to the intel_idle driver.

Since Icelake D and Icelake SP C-state characteristics the same,
we use Icelake SP C-states table for Icelake D as well.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:26 +08:00
Tom Rix ce6df7b720 intel_idle: remove definition of DEBUG
commit 651bc5816c upstream.

Defining DEBUG should only be done in development.
So remove DEBUG.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:26 +08:00
Peter Zijlstra 91d50bd49f intel_idle: Build fix
commit 4d916140bf upstream.

Because CONFIG_ soup.

Fixes: 6e1d2bc675 ("intel_idle: Fix intel_idle() vs tracing")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201130115402.GO3040@hirez.programming.kicks-ass.net
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:25 +08:00
Peter Zijlstra 69a0944244 intel_idle: Fix intel_idle() vs tracing
commit 6e1d2bc675 upstream.

cpuidle->enter() callbacks should not call into tracing because RCU
has already been disabled. Instead of doing the broadcast thing
itself, simply advertise to the cpuidle core that those states stop
the timer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20201123143510.GR3021@hirez.programming.kicks-ass.net
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:25 +08:00
Chen Yu 24b0b982e5 intel_idle: Fix max_cstate for processor models without C-state tables
commit 4e0ba5577d upstream.

Currently intel_idle driver gets the c-state information from ACPI
_CST if the processor model is not recognized by it. However the
c-state in _CST starts with index 1 which is different from the
index in intel_idle driver's internal c-state table.

While intel_idle_max_cstate_reached() was previously introduced to
deal with intel_idle driver's internal c-state table, re-using
this function directly on _CST is incorrect.

Fix this by subtracting 1 from the index when checking max_cstate
in the _CST case.

For example, append intel_idle.max_cstate=1 in boot command line,
Before the patch:
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
POLL
After the patch:
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI

Fixes: 18734958e9 ("intel_idle: Use ACPI _CST for processor models without C-state tables")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Cc: 5.6+ <stable@vger.kernel.org> # 5.6+
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:25 +08:00
Mel Gorman dc516a70f4 intel_idle: Ignore _CST if control cannot be taken from the platform
commit 75af76d0a3 upstream.

e6d4f08a67 ("intel_idle: Use ACPI _CST on server systems") avoids
enabling c-states that have been disabled by the platform with the
exception of C1E.

Unfortunately, BIOS implementations are not always consistent in terms
of how capabilities are advertised and control cannot always be handed
over. If control cannot be handed over then intel_idle reports that "ACPI
_CST not found or not usable" but does not clear acpi_state_table.count
meaning the information is still partially used.

This patch ignores ACPI information if CST control cannot be requested from
the platform. This was only observed on a number of Haswell platforms that
had identical CPUs but not identical BIOS versions.  While this problem
may be rare overall, 24 separate test cases bisected to this specific
commit across 4 separate test machines and is worth addressing. If the
situation occurs, the kernel behaves as it did before commit e6d4f08a67
and uses any c-states that are discovered.

The affected test cases were all ones that involved a small number of
processes -- exec microbenchmark, pipe microbenchmark, git test suite,
netperf, tbench with one client and system call microbenchmark. Each
case benefits from being able to use turboboost which is prevented if the
lower c-states are unavailable. This may mask real regressions specific
to older hardware so it is worth addressing.

C-state status before and after the patch

5.9.0-vanilla            POLL     latency:0      disabled:0 default:enabled
5.9.0-vanilla            C1       latency:2      disabled:0 default:enabled
5.9.0-vanilla            C1E      latency:10     disabled:0 default:enabled
5.9.0-vanilla            C3       latency:33     disabled:1 default:disabled
5.9.0-vanilla            C6       latency:133    disabled:1 default:disabled
5.9.0-ignore-cst-v1r1    POLL     latency:0      disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C1       latency:2      disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C1E      latency:10     disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C3       latency:33     disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C6       latency:133    disabled:0 default:enabled

Patch enables C3/C6.

Netperf UDP_STREAM

netperf-udp
                                      5.5.0                  5.9.0
                                    vanilla        ignore-cst-v1r1
Hmean     send-64         193.41 (   0.00%)      226.54 *  17.13%*
Hmean     send-128        392.16 (   0.00%)      450.54 *  14.89%*
Hmean     send-256        769.94 (   0.00%)      881.85 *  14.53%*
Hmean     send-1024      2994.21 (   0.00%)     3468.95 *  15.85%*
Hmean     send-2048      5725.60 (   0.00%)     6628.99 *  15.78%*
Hmean     send-3312      8468.36 (   0.00%)    10288.02 *  21.49%*
Hmean     send-4096     10135.46 (   0.00%)    12387.57 *  22.22%*
Hmean     send-8192     17142.07 (   0.00%)    19748.11 *  15.20%*
Hmean     send-16384    28539.71 (   0.00%)    30084.45 *   5.41%*

Fixes: e6d4f08a67 ("intel_idle: Use ACPI _CST on server systems")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: 5.6+ <stable@vger.kernel.org> # 5.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:24 +08:00
Chen Zhuo 5cdab9b63d cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic
commit bf9282dc26 upstream.

This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:24 +08:00
Rafael J. Wysocki fd63805ee8 intel_idle: Add __initdata annotations to init time variables
commit 7f843dd712 upstream.

Annotate static variables cpuidle_state_table and mwait_substates
with __initdata, because they are only used during the initialization
of the driver.

Also notice that static variable icpu could be annotated analogously
and the structure pointed to by it could be __initconst, but two of
its fields are accessed via icpu in intel_idle_cpu_init() and
auto_demotion_disable(), so introduce two new static variables,
auto_demotion_disable_flags and disable_promotion_to_c1e, to hold
the values of these fields, set them during the initialization and
use them in those functions instead of accessing the source data
structure via icpu.

That allows icpu to be annotated with __initdata, so do that,
and it will also allow some __initconst annotations to be added
subsequently.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:24 +08:00
Rafael J. Wysocki 8362f4dbb9 intel_idle: Relocate definitions of cpuidle callbacks
commit 30a996fbb3 upstream.

Move the definitions of intel_idle() and intel_idle_s2idle() before
the definitions of cpuidle_state structures referring to them to
avoid having to use additional declarations of them (and drop those
declarations).

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:23 +08:00
Rafael J. Wysocki c26b45ea7d intel_idle: Clean up definitions of cpuidle callbacks
commit bc721c1e45 upstream.

Add proper kerneldoc descriptions to intel_idle() and
intel_idle_s2idle(), annotate the latter with __cpuidle and
reorder the declarations of local variables in both of them to
reflect the mwait_idle_with_hints() arguments order.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:23 +08:00
Rafael J. Wysocki 525c095aab intel_idle: Simplify LAPIC timer reliability checks
commit 40ab82e08d upstream.

The lapic_timer_always_reliable variable really takes only two values
and some arithmetic in intel_idle() related to comparing it with the
target C-state's MWAIT hint value is unnecessary.

Simplify the code by replacing lapic_timer_always_reliable with
a bool variable lapic_timer_always_reliable and dropping the
LAPIC_TIMER_ALWAYS_RELIABLE symbol along with the excess
computations in intel_idle().

While at it, add a comment explaining the branch taken in intel_idle()
if the LAPIC timer is only reliable in C1 and modify the related debug
message in intel_idle_init() accordingly (the modification of this
message in the only expected functional impact of the change made
here).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:23 +08:00
Rafael J. Wysocki 29626393d1 intel_idle: Introduce 'states_off' module parameter
commit 4dcb78ee57 upstream.

In certain system configurations it may not be desirable to use some
C-states assumed to be available by intel_idle and the driver needs
to be prevented from using them even before the cpuidle sysfs
interface becomes accessible to user space.  Currently, the only way
to achieve that is by setting the 'max_cstate' module parameter to a
value lower than the index of the shallowest of the C-states in
question, but that may be overly intrusive, because it effectively
makes all of the idle states deeper than the 'max_cstate' one go
away (and the C-state to avoid may be in the middle of the range
normally regarded as available).

To allow that limitation to be overcome, introduce a new module
parameter called 'states_off' to represent a list of idle states to
be disabled by default in the form of a bitmask and update the
documentation to cover it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:22 +08:00
Rafael J. Wysocki b84c30be8f intel_idle: Introduce 'use_acpi' module parameter
commit 3a5be9b8f4 upstream.

For diagnostics, it is generally useful to be able to make intel_idle
take the system's ACPI tables into consideration even if that is not
required for the processor model in there, so introduce a new module
parameter, 'use_acpi', to make that happen and update the documentation
to cover it.

While at it, fix the 'no_acpi' module parameter name in the
documentation.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:22 +08:00
Rafael J. Wysocki 25bca34f26 intel_idle: Clean up irtl_2_usec()
commit 86e9466ae6 upstream.

Move the irtl_ns_units[] definition into irtl_2_usec() which is the
only user of it, use div_u64() for the division in there (as the
divisor is small enough) and use the NSEC_PER_USEC symbol for the
divisor.  Also convert the irtl_2_usec() comment to a proper
kerneldo one.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:21 +08:00
Rafael J. Wysocki 7d0d3657be intel_idle: Move 3 functions closer to their callers
commit 1aefbd7aeb upstream.

Move intel_idle_verify_cstate(), auto_demotion_disable() and
c1e_promotion_disable() closer to their callers.

While at it, annotate intel_idle_verify_cstate() with __init,
as it is only used during the initialization of the driver.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:21 +08:00
Rafael J. Wysocki 3796f4ee32 intel_idle: Annotate initialization code and data structures
commit 095928ae48 upstream.

Annotate the functions that are only used at the initialization time
with __init and the data structures used by them with __initdata or
__initconst.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:21 +08:00
Rafael J. Wysocki af099e57e6 intel_idle: Rearrange intel_idle_cpuidle_driver_init()
commit 3d3a1ae9b4 upstream.

Notice that intel_idle_state_table_update() only needs to be called
if icpu is not NULL, so fold it into intel_idle_init_cstates_icpu(),
and pass a pointer to the driver object to
intel_idle_cpuidle_driver_init() as an argument instead of
referencing it locally in there.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:20 +08:00
Rafael J. Wysocki c82844ca5f intel_idle: Fold intel_idle_probe() into intel_idle_init()
commit a6c86e3362 upstream.

There is no particular reason why intel_idle_probe() needs to be
a separate function and folding it into intel_idle_init() causes
the code to be somewhat easier to follow, so do just that.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:20 +08:00
Rafael J. Wysocki cd6efdfb7c intel_idle: Eliminate __setup_broadcast_timer()
commit cbd2c4c25d upstream.

The __setup_broadcast_timer() static function is only called in one
place and "true" is passed to it as the argument in there, so
effectively it is a wrapper arround tick_broadcast_enable().

To simplify the code, call tick_broadcast_enable() directly instead
of __setup_broadcast_timer() and drop the latter.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:20 +08:00
Menglong Dong ed011e007b net: tcp: add sysctl_tcp_wnd_shrink
Add the 'sysctl_tcp_wnd_shrink' to control the enable/disable of TCP
window shrink. By default, it is disabled.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
2024-06-11 20:51:19 +08:00
mengensun e1bf1991a5 net/tcp: switch to GSO being always on
when open gso, tcp Write queues have less overhead, and make some app
run faster.

test of redis-benchmark like follow:

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
2024-06-11 20:51:19 +08:00
Menglong Dong f0d423d51c net: tcp: raise zero-window probe without check wnd_end
In the origin logic, zero-window probe can not only be raised on
0 window, but also in other case, such as MTU probe fails.

Therefore, we need modify tcp_probe0_needed() to make it compatible
with origin logic.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
2024-06-11 20:51:19 +08:00
Linus Torvalds 980a335360 mm: make wait_on_page_writeback() wait for multiple pending writebacks
upstream commit: c2407cf7d2

Ever since commit 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common()
logic") we've had some very occasional reports of BUG_ON(PageWriteback)
in write_cache_pages(), which we thought we already fixed in commit
073861ed77 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").

But syzbot just reported another one, even with that commit in place.

And it turns out that there's a simpler way to trigger the BUG_ON() than
the one Hugh found with page re-use.  It all boils down to the fact that
the page writeback is ostensibly serialized by the page lock, but that
isn't actually really true.

Yes, the people _setting_ writeback all do so under the page lock, but
the actual clearing of the bit - and waking up any waiters - happens
without any page lock.

This gives us this fairly simple race condition:

  CPU1 = end previous writeback
  CPU2 = start new writeback under page lock
  CPU3 = write_cache_pages()

  CPU1          CPU2            CPU3
  ----          ----            ----

  end_page_writeback()
    test_clear_page_writeback(page)
    ... delayed...

                lock_page();
                set_page_writeback()
                unlock_page()

                                lock_page()
                                wait_on_page_writeback();

    wake_up_page(page, PG_writeback);
    .. wakes up CPU3 ..

                                BUG_ON(PageWriteback(page));

where the BUG_ON() happens because we woke up the PG_writeback bit
becasue of the _previous_ writeback, but a new one had already been
started because the clearing of the bit wasn't actually atomic wrt the
actual wakeup or serialized by the page lock.

The reason this didn't use to happen was that the old logic in waiting
on a page bit would just loop if it ever saw the bit set again.

The nice proper fix would probably be to get rid of the whole "wait for
writeback to clear, and then set it" logic in the writeback path, and
replace it with an atomic "wait-to-set" (ie the same as we have for page
locking: we set the page lock bit with a single "lock_page()", not with
"wait for lock bit to clear and then set it").

However, out current model for writeback is that the waiting for the
writeback bit is done by the generic VFS code (ie write_cache_pages()),
but the actual setting of the writeback bit is done much later by the
filesystem ".writepages()" function.

IOW, to make the writeback bit have that same kind of "wait-to-set"
behavior as we have for page locking, we'd have to change our roughly
~50 different writeback functions.  Painful.

Instead, just make "wait_on_page_writeback()" loop on the very unlikely
situation that the PG_writeback bit is still set, basically re-instating
the old behavior.  This is very non-optimal in case of contention, but
since we only ever set the bit under the page lock, that situation is
controlled.

Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
Fixes: 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:18 +08:00
Paolo Bonzini e31375a771 KVM: Do not leak memory for duplicate debugfs directories
commit 85cd39af14 upstream.

KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine.  The directory name is built from the process
pid and a VM fd.  While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:

  [  471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!

Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.

When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked.  The slow memory leak was spotted by syzkaller, where it
caused OOM reports.

Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created.  This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.

Cc: stable@vger.kernel.org
Fixes: 536a6f88c4 ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:18 +08:00
Andrew Sy Kim f2f46e7af4 ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
[upstream commit 35dfb01314]

When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.

Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-06-11 20:51:18 +08:00
Menglong Dong 0fe440f03b net: tcp: handle window shrink properly
Window shrink is not allowed and also not handled for now, but it's
needed in some case.

In the origin logic, 0 probe is triggered only when there is no any
data in the retrans queue and the receive window can't hold the data
of the 1th packet in the send queue.

Now, let's change it and trigger the 0 probe in such cases:

- if the retrans queue has data and the 1th packet in it is not within
  the receive window
- no data in the retrans queue and the 1th packet in the send queue is
  out of the end of the receive window

Signed-off-by: Menglong Dong <imagedong@tencent.com>
2024-06-11 20:51:17 +08:00
Menglong Dong bb00f4ca4c net: tcp: send zero-window when no memory
For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

Therefore, now we force to receive one packet on current socket when
the protocol memory is out of the limitation. Then, this socket will
stay in 'no mem' status, util protocol memory is available.

When a socket is in 'no mem' status, it's receive window will become
0, which means window shrink happens. And the sender need to handle
such window shrink properly, which is done in the next commit.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
2024-06-11 20:51:17 +08:00
caelli 1ea94d5505 driver: update e1000e to 3.8.4
E1000e driver is update to 3.8.4 on x86, arm64
still use 3.2.6.

Signed-off-by: caelli <caelli@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:17 +08:00
Liu Chun 73b70ea3f0 kdump: the capture kernel can't use dma memory
In arm64 system, when the memory that less than 4G is a little,
the capture kernel cannot use dma memory. Therefore, it is necessary
to enable CONFIG_EXEC_FILE and fixes the issue of reserved memory
to pass low memory to the kdump kernel.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Liu Chun <kaicliu@tencent.com>
2024-06-11 20:51:11 +08:00
Liu Chun d151d105a1 drm: Fixed system hang caused by memory failure
When the dma memory is insufficient, the wrong release of
resources will cause the system to hang.

[   35.975823] [TTM] Initializing pool allocator
[   35.980166] [TTM] Initializing DMA pool allocator
[   35.984864] [drm:hibmc_mm_init [hibmc_drm]] *ERROR* Error initializing VRAM MM; -12
[   35.992517] ------------[ cut here ]------------
[   35.997154] WARNING: CPU: 0 PID: 116 at drivers/gpu/drm/drm_modeset_lock.c:266 drm_modeset_lock+0xd8/0xf8 [drm]
[   36.007192] Modules linked in: hibmc_drm(+) drm_vram_helper ttm drm_kms_helper drm autofs4 overlay squashfs
[   36.016890] CPU: 0 PID: 116 Comm: kworker/0:2 Not tainted 5.4.119-0.20230227git9d7d3558a64d.19 #1
[   36.025719] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019
[   36.033173] Workqueue: events work_for_cpu_fn
[   36.037510] pstate: a0800009 (NzCv daif -PAN +UAO)
[   36.042297] pc : drm_modeset_lock+0xd8/0xf8 [drm]
[   36.046995] lr : drm_modeset_lock+0x44/0xf8 [drm]
[   36.051676] sp : ffff80005462fc30
[   36.054974] x29: ffff80005462fc30 x28: 0000000000000000
[   36.060260] x27: ffff2057ebe20000 x26: 0000000000000000
[   36.065546] x25: 0000000000000000 x24: ffff80004cf6f8e8
[   36.070833] x23: 0000000000000000 x22: ffff2057f4739800
[   36.076119] x21: ffff800049803908 x20: ffff2057f4739998
[   36.081405] x19: ffff80005462fcc0 x18: 0000000000000010
[   36.086690] x17: 0000000000000000 x16: ffff800048b61b88
[   36.091976] x15: ffffffffffffffff x14: 204d41525620676e
[   36.097261] x13: 697a696c61697469 x12: 6e6920726f727245
[   36.102547] x11: 202a524f5252452a x10: 205d5d6d72645f63
[   36.107832] x9 : ffff800048b61bcc x8 : ffff800048703a60
[   36.113118] x7 : 065448]  work_for_cpu_fn+0x20/0x30
[   36.169181]  process_one_work+0x1f8/0x488
[   36.173173]  worker_thread+0x248/0x528
[   36.176906]  kthread+0x124/0x128
[   36.180121]  ret_from_fork+0x10/0x18
[   36.183679] ---[ end trace aae0476f91651f5d ]---
[   36.188284] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
[   36.197028] Mem abort info:
[   36.199809]   ESR = 0x96000005
[   36.202851]   EC = 0x25: DABT (current EL), IL = 32 bits
[   36.208137]   SET = 0, FnV = 0
[   36.211176]   EA = 0, S1PTW = 0
[   36.214303] Data abort info:
[   36.217172]   ISV = 0, ISS = 0x00000005
[   36.220990]   CM = 0, WnR = 0
[   36.223946] user pgtable: 64k pages, 48-bit VAs, pgdp=00002057df5a0600
[   36.230444] [0000000000000018] pgd=0000000000000000, pud=0000000000000000
[   36.237200] Internal error: Oops: 96000005 [#1] SMP
[   36.242056] Modules linked in: hibmc_drm(+) drm_vram_helper ttm drm_kms_helper drm autofs4 overlay squashfs
[   36.251751] CPU: 0 PID: 116 Comm: kworker/0:2 Tainted: G        W         5.4.119-0.20230227git9d7d3558a64d.19 #1
[   36.261963] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019
[   36.269411] Workqueue: events work_for_cpu_fn
[   36.273747] pstate: a0800009 (NzCv daif -PAN +UAO)
[   36.278518] pc : ww_mutex_lock+0x2c/0x70
[   36.282439] lr : drm_modeset_lock+0x44/0xf8 [drm]
[   36.287120] sp : ffff80005462fc20
[   36.290418] x29: ffff80005462fc20 x28: 0000000000000000
[   36.295704] x27: ffff2057ebe20000 x26: 0000000000000000
[   36.300989] x25: 0000000000000000 x24: ffff80004cf6f8e8
[   36.306274] x23: 0000000000000000 x22: ffff2057f4739800
[   36.311560] x21: ffff2057f4739af8 x20: 0000000000000018
[   36.316846] x19: ffff80005462fcc0 x18: 0000000000000010
[   36.322131] x17: 0000000000000000 x16: ffff800048b61b88
[   36.327418] x15: ffffffffffffffff x14: 204d41525620676e
[   36.332703] x13: 697a696c61697469 x12: 6e6920726f727245
[   36.337989] x11: 202a524f5252452a x10: 205d5d6d72645f63
[   36.343274] x9 : ffff800008ce4594 x8 : ffff800048703a60
[   36.348560] x7 : 0000000000000469 x6 : ffff80004998e5e6
[   36.353845] x5 : 0000000000000000 x4 : ffff80005462fcc0
[   36.359131] x3 : 0000000000000018 x2 : ffff2057ebe30000
[   36.364417] x1 : 0000000000000000 x0 : 0000000000000018
[   36.369703] Call trace:
[   36.372138]  ww_mutex_lock+0x2c/0x70
[   36.375712]  drm_modeset_lock+0x44/0xf8 [drm]
[   36.380064]  drm_modeset_lock_all_ctx+0x68/0xf8 [drm]
[   36.385100]  drm_atomic_helper_shutdown+0x54/0xd0 [drm_kms_helper]
[   36.391251]  hibmc_unload+0x2c/0xa8 [hibmc_drm]
[   36.395762]  hibmc_pci_probe+0x318/0x430 [hibmc_drm]
[   36.400703]  local_pci_probe+0x44/0xa8
[   36.404435]  work_for_cpu_fn+0x20/0x30
[   36.408167]  process_one_work+0x1f8/0x488
[   36.412158]  worker_thread+0x248/0x528
[   36.415890]  kthread+0x124/0x128
[   36.419103]  ret_from_fork+0x10/0x18
[   36.422662] Code: d503201f d503201f d2800001 aa0103e5 (c8e57c02)
[   36.428727] ---[ end trace aae0476f91651f5e ]---
[   37.169300] systemd-udevd[307]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.

Signed-off-by: Chun Liu <kaicliu@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:02 +08:00
Kairui Song 1b7d9fa70f arm64: kexec_file: add crash dump support
Upstream: 3751e728ce
Link: 40e94ab32e

commit 3751e728ce
Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date:   Mon Dec 16 11:12:47 2019 +0900

    arm64: kexec_file: add crash dump support

    Enabling crash dump (kdump) includes
    * prepare contents of ELF header of a core dump file, /proc/vmcore,
      using crash_prepare_elf64_headers(), and
    * add two device tree properties, "linux,usable-memory-range" and
      "linux,elfcorehdr", which represent respectively a memory range
      to be used by crash dump kernel and the header's location

    Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Reviewed-by: James Morse <james.morse@arm.com>
    Tested-and-reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:02 +08:00
Kairui Song b7e9b568c2 libfdt: include fdt_addresses.c
Upstream: c273a2bd8a
Link: 887436bdb7

commit c273a2bd8a
Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date:   Mon Dec 9 12:03:44 2019 +0900

    libfdt: include fdt_addresses.c

    In the implementation of kexec_file_loaded-based kdump for arm64,
    fdt_appendprop_addrrange() will be needed.

    So include fdt_addresses.c in making libfdt.

    Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
    Cc: Rob Herring <robh+dt@kernel.org>
    Cc: Frank Rowand <frowand.list@gmail.com>
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:02 +08:00
Kairui Song 6fb78c4cc6 arm64: kdump: remove dependency on arm64_dma32_phys_limit
From: Yi Li <adamliyi@msn.com>
Link: 696027f109

The patch b2da6ad294
(arm64: kdump: reimplement crashkernel=X) depends on commit 1a8e1cef76
("arm64: use both ZONE_DMA and ZONE_DMA32").

Commit 1a8e1cef76 is not ported to 5.4 kernel. So use arm64_dma_phys_limit.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:01 +08:00
Kairui Song fd02a1b5bc kdump: update Documentation about crashkernel
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 023deaec32

For arm64, the behavior of crashkernel=X has been changed, which
tries low allocation in DMA zone or DMA32 zone if CONFIG_ZONE_DMA
is disabled, and fall back to high allocation if it fails.

We can also use "crashkernel=X,high" to select a high region above
DMA zone, which also tries to allocate at least 256M low memory in
DMA zone automatically (or the DMA32 zone if CONFIG_ZONE_DMA is disabled).

"crashkernel=Y,low" can be used to allocate specified size low memory.

So update the Documentation.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:01 +08:00
Kairui Song 3fd41ff677 arm64: kdump: add memory for devices by DT property linux,usable-memory-range
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 2012a3b392

When reserving crashkernel in high memory, some low memory is reserved
for crash dump kernel devices and never mapped by the first kernel.
This memory range is advertised to crash dump kernel via DT property
under /chosen,
	linux,usable-memory-range = <BASE1 SIZE1 [BASE2 SIZE2]>

We reused the DT property linux,usable-memory-range and made the low
memory region as the second range "BASE2 SIZE2", which keeps compatibility
with existing user-space and older kdump kernels.

Crash dump kernel reads this property at boot time and call memblock_add()
to add the low memory region after memblock_cap_memory_range() has been
called.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:01 +08:00
Kairui Song d001dccf2b x86, arm64: Add ARCH_WANT_RESERVE_CRASH_KERNEL config
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: c8013ee6cd

We make the functions reserve_crashkernel[_low]() as generic for
x86 and arm64. Since reserve_crashkernel[_low]() implementations
are quite similar on other architectures as well, we can have more
users of this later.

So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and
select this by X86 and ARM64.

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:00 +08:00
Kairui Song bd482067c3 arm64: kdump: reimplement crashkernel=X
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 70e586365f

There are following issues in arm64 kdump:
1. We use crashkernel=X to reserve crashkernel below 4G, which
will fail when there is no enough low memory.
2. If reserving crashkernel above 4G, in this case, crash dump
kernel will boot failure because there is no low memory available
for allocation.
3. Since commit 1a8e1cef76 ("arm64: use both ZONE_DMA and ZONE_DMA32"),
if the memory reserved for crash dump kernel falled in ZONE_DMA32,
the devices in crash dump kernel need to use ZONE_DMA will alloc
fail.

To solve these issues, change the behavior of crashkernel=X and
introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation
in DMA zone or DMA32 zone if CONFIG_ZONE_DMA is disabled, and fall back
to high allocation if it fails.
We can also use "crashkernel=X,high" to select a region above DMA zone,
which also tries to allocate at least 256M in DMA zone automatically
(or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
"crashkernel=Y,low" can be used to allocate specified size low memory.

Another minor change, there may be two regions reserved for crash
dump kernel, in order to distinct from the high region and make no
effect to the use of existing kexec-tools, rename the low region as
"Crash kernel (low)".

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:00 +08:00
Kairui Song f30355b620 arm64: kdump: introduce some macroes for crash kernel reservation
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: 667118f8c1

Introduce macro CRASH_ALIGN for alignment, macro CRASH_ADDR_LOW_MAX
for upper bound of low crash memory, macro CRASH_ADDR_HIGH_MAX for
upper bound of high crash memory, use macroes instead.

Besides, keep consistent with x86, use CRASH_ALIGN as the lower bound
of crash kernel reservation.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Tested-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:51:00 +08:00
Kairui Song 21ff8ff8f3 x86/elf: Move vmcore_elf_check_arch_cross to arch/x86/include/asm/elf.h
From: Chen Zhou <chenzhou10@huawei.com>
Link: https://lkml.org/lkml/2021/1/30/53
Link: b332ab8970

Move macro vmcore_elf_check_arch_cross from arch/x86/include/asm/kexec.h
to arch/x86/include/asm/elf.h to fix the following compiling warning:

In file included from arch/x86/kernel/setup.c:39:0:
./arch/x86/include/asm/kexec.h:77:0: warning: "vmcore_elf_check_arch_cross" redefined
 # define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)

In file included from arch/x86/kernel/setup.c:9:0:
./include/linux/crash_dump.h:39:0: note: this is the location of the previous definition
 #define vmcore_elf_check_arch_cross(x) 0

The root cause is that vmcore_elf_check_arch_cross under CONFIG_CRASH_CORE
depend on CONFIG_KEXEC_CORE. Commit 532b66d2279d ("x86: kdump: move
reserve_crashkernel[_low]() into crash_core.c") triggered the issue.

Suggested by Mike, simply move vmcore_elf_check_arch_cross from
arch/x86/include/asm/kexec.h to arch/x86/include/asm/elf.h to fix
the warning.

Fixes: 532b66d2279d ("x86: kdump: move reserve_crashkernel[_low]() into crash_core.c")
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-06-11 20:50:59 +08:00