OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Alex Shi	c0ddf78f1f	dist: release 5.4.119-20.0009.28 Upstream: no Signed-off-by: Alex Shi <alexsshi@tencent.com>	2024-06-11 20:52:39 +08:00
Christoph Hellwig	b868ff6759	virtio-blk: remove VIRTIO_BLK_F_SCSI support [upstream commit: `782e067dba`] Since the need for a special flag to support SCSI passthrough on a block device was added in May 2017 the SCSI passthrough support in virtio-blk has been disabled. It has always been a bad idea (just ask the original author..) and we have virtio-scsi for proper passthrough. The feature also never made it into the virtio 1.0 or later specifications. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: XiaoLei Zhu <leonzzhu@tencent.com>	2024-06-11 20:52:37 +08:00
Peter Zijlstra	66c8ef4a22	cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE commit `32d4fd5751` upstream. Commit `c227233ad6` ("intel_idle: enable interrupts before C1 on Xeons") wrecked intel_idle in two ways: - must not have tracing in idle functions - must return with IRQs disabled Additionally, it added a branch for no good reason. Fixes: `c227233ad6` ("intel_idle: enable interrupts before C1 on Xeons") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ rjw: Moved the intel_idle() kerneldoc comment next to the function ] Cc: 5.16+ <stable@vger.kernel.org> # 5.16+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:29 +08:00
Artem Bityutskiy	6f7b0b5f2b	intel_idle: make SPR C1 and C1E be independent commit `1548fac47a` upstream. This patch partially reverts the changes made by the following commit: `da0e58c038` intel_idle: add 'preferred_cstates' module argument As that commit describes, on early Sapphire Rapids Xeon platforms the C1 and C1E states were mutually exclusive, so that users could only have either C1 and C6, or C1E and C6. However, Intel firmware engineers managed to remove this limitation and make C1 and C1E to be completely independent, just like on previous Xeon platforms. Therefore, this patch: * Removes commentary describing the old, and now non-existing SPR C1E limitation. * Marks SPR C1E as available by default. * Removes the 'preferred_cstates' parameter handling for SPR. Both C1 and C1E will be available regardless of 'preferred_cstates' value. We expect that all SPR systems are shipping with new firmware, which includes the C1/C1E improvement. Cc: v5.18+ <stable@vger.kernel.org> # v5.18+ Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:29 +08:00
Artem Bityutskiy	3d77bca89a	intel_idle: Fix the 'preferred_cstates' module parameter commit `39c184a6a9` upstream. Problem description. When user boots kernel up with the 'intel_idle.preferred_cstates=4' option, we enable C1E and disable C1 states on Sapphire Rapids Xeon (SPR). In order for C1E to work on SPR, we have to enable the C1E promotion bit on all CPUs. However, we enable it only on one CPU. Fix description. The 'intel_idle' driver already has the infrastructure for disabling C1E promotion on every CPU. This patch uses the same infrastructure for enabling C1E promotion on every CPU. It changes the boolean 'disable_promotion_to_c1e' variable to a tri-state 'c1e_promotion' variable. Tested on a 2-socket SPR system. I verified the following combinations: * C1E promotion enabled and disabled in BIOS. * Booted with and without the 'intel_idle.preferred_cstates=4' kernel argument. In all 4 cases C1E promotion was correctly set on all CPUs. Also tested on an old Broadwell system, just to make sure it does not cause a regression. C1E promotion was correctly disabled on that system, both C1 and C1E were exposed (as expected). Fixes: `da0e58c038` ("intel_idle: add 'preferred_cstates' module argument") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> [ rjw: Minor changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:28 +08:00
Artem Bityutskiy	0f46fd086e	intel_idle: Fix SPR C6 optimization commit `7eac3bd38d` upstream. The Sapphire Rapids (SPR) C6 optimization was added to the end of the 'spr_idle_state_table_update()' function. However, the function has a 'return' which may happen before the optimization has a chance to run. And this may prevent the optimization from happening. This is an unlikely scenario, but possible if user boots with, say, the 'intel_idle.preferred_cstates=6' kernel boot option. This patch fixes the issue by eliminating the problematic 'return' statement. Fixes: `3a9cf77b60` ("intel_idle: add core C6 optimization for SPR") Suggested-by: Jan Beulich <jbeulich@suse.com> Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> [ rjw: Minor changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:28 +08:00
Artem Bityutskiy	b39482cff5	intel_idle: add core C6 optimization for SPR commit `3a9cf77b60` upstream. Add a Sapphire Rapids Xeon C6 optimization, similar to what we have for Sky Lake Xeon: if package C6 is disabled, adjust C6 exit latency and target residency to match core C6 values, instead of using the default package C6 values. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:28 +08:00
Artem Bityutskiy	53e20e738f	intel_idle: add 'preferred_cstates' module argument commit `da0e58c038` upstream. On Sapphire Rapids Xeon (SPR) the C1 and C1E states are basically mutually exclusive - only one of them can be enabled. By default, 'intel_idle' driver enables C1 and disables C1E. However, some users prefer to use C1E instead of C1, because it saves more energy. This patch adds a new module parameter ('preferred_cstates') for enabling C1E and disabling C1. Here is the idea behind it. 1. This option has effect only for "mutually exclusive" C-states like C1 and C1E on SPR. 2. It does not have any effect on independent C-states, which do not require other C-states to be disabled (most states on most platforms as of today). 3. For mutually exclusive C-states, the 'intel_idle' driver always has a reasonable default, such as enabling C1 on SPR by default. On other platforms, the default may be different. 4. Users can override the default using the 'preferred_cstates' parameter. 5. The parameter accepts the preferred C-states bit-mask, similarly to the existing 'states_off' parameter. 6. This parameter is not limited to C1/C1E, and leaves room for supporting other mutually exclusive C-states, if they come in the future. Today 'intel_idle' can only be compiled-in, which means that on SPR, in order to disable C1 and enable C1E, users should boot with the following kernel argument: intel_idle.preferred_cstates=4 Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:27 +08:00
Tony Luck	223b89f1b6	x86/cpu: Add Sapphire Rapids CPU model number commit `be25d1b5ea` upstream. Latest edition (039) of "Intel Architecture Instruction Set Extensions and Future Features Programming Reference" includes three new CPU model numbers. Linux already has the two Ice Lake server ones. Add the new model number for Sapphire Rapids. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200603173352.15506-1-tony.luck@intel.com Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:27 +08:00
Artem Bityutskiy	f79fb80ce5	intel_idle: add SPR support commit `9edf3c0ffe` upstream. Add Sapphire Rapids Xeon support. Up until very recently, the C1 and C1E C-states were independent, but this has changed in some new chips, including Sapphire Rapids Xeon (SPR). In these chips the C1 and C1E states cannot be enabled at the same time. The "C1E promotion" bit in 'MSR_IA32_POWER_CTL' also has its semantics changed a bit. Here are the C1, C1E, and "C1E promotion" bit rules on Xeons before SPR. 1. If C1E promotion bit is disabled. a. C1 requests end up with C1 C-state. b. C1E requests end up with C1E C-state. 2. If C1E promotion bit is enabled. a. C1 requests end up with C1E C-state. b. C1E requests end up with C1E C-state. Here are the C1, C1E, and "C1E promotion" bit rules on Sapphire Rapids Xeon. 1. If C1E promotion bit is disabled. a. C1 requests end up with C1 C-state. b. C1E requests end up with C1 C-state. 2. If C1E promotion bit is enabled. a. C1 requests end up with C1E C-state. b. C1E requests end up with C1E C-state. Before SPR Xeon, the 'intel_idle' driver was disabling C1E promotion and was exposing C1 and C1E as independent C-states. But on SPR, C1 and C1E cannot be enabled at the same time. This patch adds both C1 and C1E states. However, C1E is marked as with the "CPUIDLE_FLAG_UNUSABLE" flag, which means that in won't be registered by default. The C1E promotion bit will be cleared, which means that by default only C1 and C6 will be registered on SPR. The next patch will add an option for enabling C1E and disabling C1 on SPR. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:27 +08:00
Artem Bityutskiy	686588a7bb	intel_idle: enable interrupts before C1 on Xeons commit `c227233ad6` upstream. Enable local interrupts before requesting C1 on the last two generations of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. This decreases average C1 interrupt latency by about 5-10%, as measured with the 'wult' tool. The '->enter()' function of the driver enters C-states with local interrupts disabled by executing the 'monitor' and 'mwait' pair of instructions. If an interrupt happens, the CPU exits the C-state and continues executing instructions after 'mwait'. It does not jump to the interrupt handler, because local interrupts are disabled. The cpuidle subsystem enables interrupts a bit later, after doing some housekeeping. With this patch, we enable local interrupts before requesting C1. In this case, if the CPU wakes up because of an interrupt, it will jump to the interrupt handler right away. The cpuidle housekeeping will be done after the pending interrupt(s) are handled. Enabling interrupts before entering a C-state has measurable impact for faster C-states, like C1. Deeper, but slower C-states like C6 do not really benefit from this sort of change, because their latency is a lot higher comparing to the delay added by cpuidle housekeeping. This change was also tested with cyclictest and dbench. In case of Ice Lake, the average cyclictest latency decreased by 5.1%, and the average 'dbench' throughput increased by about 0.8%. Both tests were run for 4 hours with only C1 enabled (all other idle states, including 'POLL', were disabled). CPU frequency was pinned to HFM, and uncore frequency was pinned to the maximum value. The other platforms had similar single-digit percentage improvements. It is worth noting that this patch affects 'cpuidle' statistics a tiny bit. Before this patch, C1 residency did not include the interrupt handling time, but with this patch, it will include it. This is similar to what happens in case of the 'POLL' state, which also runs with interrupts enabled. Suggested-by: Len Brown <len.brown@intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:26 +08:00
Artem Bityutskiy	dcaa41dcb4	intel_idle: add Iclelake-D support commit `22141d5f41` upstream. This patch adds Icelake Xeon D support to the intel_idle driver. Since Icelake D and Icelake SP C-state characteristics the same, we use Icelake SP C-states table for Icelake D as well. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:26 +08:00
Tom Rix	ce6df7b720	intel_idle: remove definition of DEBUG commit `651bc5816c` upstream. Defining DEBUG should only be done in development. So remove DEBUG. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:26 +08:00
Peter Zijlstra	91d50bd49f	intel_idle: Build fix commit `4d916140bf` upstream. Because CONFIG_ soup. Fixes: `6e1d2bc675` ("intel_idle: Fix intel_idle() vs tracing") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201130115402.GO3040@hirez.programming.kicks-ass.net Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:25 +08:00
Peter Zijlstra	69a0944244	intel_idle: Fix intel_idle() vs tracing commit `6e1d2bc675` upstream. cpuidle->enter() callbacks should not call into tracing because RCU has already been disabled. Instead of doing the broadcast thing itself, simply advertise to the cpuidle core that those states stop the timer. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20201123143510.GR3021@hirez.programming.kicks-ass.net Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:25 +08:00
Chen Yu	24b0b982e5	intel_idle: Fix max_cstate for processor models without C-state tables commit `4e0ba5577d` upstream. Currently intel_idle driver gets the c-state information from ACPI _CST if the processor model is not recognized by it. However the c-state in _CST starts with index 1 which is different from the index in intel_idle driver's internal c-state table. While intel_idle_max_cstate_reached() was previously introduced to deal with intel_idle driver's internal c-state table, re-using this function directly on _CST is incorrect. Fix this by subtracting 1 from the index when checking max_cstate in the _CST case. For example, append intel_idle.max_cstate=1 in boot command line, Before the patch: grep . /sys/devices/system/cpu/cpu0/cpuidle/state/name POLL After the patch: grep . /sys/devices/system/cpu/cpu0/cpuidle/state/name /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI Fixes: `18734958e9` ("intel_idle: Use ACPI _CST for processor models without C-state tables") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Cc: 5.6+ <stable@vger.kernel.org> # 5.6+ Signed-off-by: Chen Yu <yu.c.chen@intel.com> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:25 +08:00
Mel Gorman	dc516a70f4	intel_idle: Ignore _CST if control cannot be taken from the platform commit `75af76d0a3` upstream. `e6d4f08a67` ("intel_idle: Use ACPI _CST on server systems") avoids enabling c-states that have been disabled by the platform with the exception of C1E. Unfortunately, BIOS implementations are not always consistent in terms of how capabilities are advertised and control cannot always be handed over. If control cannot be handed over then intel_idle reports that "ACPI _CST not found or not usable" but does not clear acpi_state_table.count meaning the information is still partially used. This patch ignores ACPI information if CST control cannot be requested from the platform. This was only observed on a number of Haswell platforms that had identical CPUs but not identical BIOS versions. While this problem may be rare overall, 24 separate test cases bisected to this specific commit across 4 separate test machines and is worth addressing. If the situation occurs, the kernel behaves as it did before commit `e6d4f08a67` and uses any c-states that are discovered. The affected test cases were all ones that involved a small number of processes -- exec microbenchmark, pipe microbenchmark, git test suite, netperf, tbench with one client and system call microbenchmark. Each case benefits from being able to use turboboost which is prevented if the lower c-states are unavailable. This may mask real regressions specific to older hardware so it is worth addressing. C-state status before and after the patch 5.9.0-vanilla POLL latency:0 disabled:0 default:enabled 5.9.0-vanilla C1 latency:2 disabled:0 default:enabled 5.9.0-vanilla C1E latency:10 disabled:0 default:enabled 5.9.0-vanilla C3 latency:33 disabled:1 default:disabled 5.9.0-vanilla C6 latency:133 disabled:1 default:disabled 5.9.0-ignore-cst-v1r1 POLL latency:0 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C1 latency:2 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C1E latency:10 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C3 latency:33 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C6 latency:133 disabled:0 default:enabled Patch enables C3/C6. Netperf UDP_STREAM netperf-udp 5.5.0 5.9.0 vanilla ignore-cst-v1r1 Hmean send-64 193.41 ( 0.00%) 226.54 * 17.13%* Hmean send-128 392.16 ( 0.00%) 450.54 * 14.89%* Hmean send-256 769.94 ( 0.00%) 881.85 * 14.53%* Hmean send-1024 2994.21 ( 0.00%) 3468.95 * 15.85%* Hmean send-2048 5725.60 ( 0.00%) 6628.99 * 15.78%* Hmean send-3312 8468.36 ( 0.00%) 10288.02 * 21.49%* Hmean send-4096 10135.46 ( 0.00%) 12387.57 * 22.22%* Hmean send-8192 17142.07 ( 0.00%) 19748.11 * 15.20%* Hmean send-16384 28539.71 ( 0.00%) 30084.45 * 5.41%* Fixes: `e6d4f08a67` ("intel_idle: Use ACPI _CST on server systems") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: 5.6+ <stable@vger.kernel.org> # 5.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:24 +08:00
Chen Zhuo	5cdab9b63d	cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic commit `bf9282dc26` upstream. This allows moving the leave_mm() call into generic code before rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:24 +08:00
Rafael J. Wysocki	fd63805ee8	intel_idle: Add __initdata annotations to init time variables commit `7f843dd712` upstream. Annotate static variables cpuidle_state_table and mwait_substates with __initdata, because they are only used during the initialization of the driver. Also notice that static variable icpu could be annotated analogously and the structure pointed to by it could be __initconst, but two of its fields are accessed via icpu in intel_idle_cpu_init() and auto_demotion_disable(), so introduce two new static variables, auto_demotion_disable_flags and disable_promotion_to_c1e, to hold the values of these fields, set them during the initialization and use them in those functions instead of accessing the source data structure via icpu. That allows icpu to be annotated with __initdata, so do that, and it will also allow some __initconst annotations to be added subsequently. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:24 +08:00
Rafael J. Wysocki	8362f4dbb9	intel_idle: Relocate definitions of cpuidle callbacks commit `30a996fbb3` upstream. Move the definitions of intel_idle() and intel_idle_s2idle() before the definitions of cpuidle_state structures referring to them to avoid having to use additional declarations of them (and drop those declarations). No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:23 +08:00
Rafael J. Wysocki	c26b45ea7d	intel_idle: Clean up definitions of cpuidle callbacks commit `bc721c1e45` upstream. Add proper kerneldoc descriptions to intel_idle() and intel_idle_s2idle(), annotate the latter with __cpuidle and reorder the declarations of local variables in both of them to reflect the mwait_idle_with_hints() arguments order. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:23 +08:00
Rafael J. Wysocki	525c095aab	intel_idle: Simplify LAPIC timer reliability checks commit `40ab82e08d` upstream. The lapic_timer_always_reliable variable really takes only two values and some arithmetic in intel_idle() related to comparing it with the target C-state's MWAIT hint value is unnecessary. Simplify the code by replacing lapic_timer_always_reliable with a bool variable lapic_timer_always_reliable and dropping the LAPIC_TIMER_ALWAYS_RELIABLE symbol along with the excess computations in intel_idle(). While at it, add a comment explaining the branch taken in intel_idle() if the LAPIC timer is only reliable in C1 and modify the related debug message in intel_idle_init() accordingly (the modification of this message in the only expected functional impact of the change made here). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:23 +08:00
Rafael J. Wysocki	29626393d1	intel_idle: Introduce 'states_off' module parameter commit `4dcb78ee57` upstream. In certain system configurations it may not be desirable to use some C-states assumed to be available by intel_idle and the driver needs to be prevented from using them even before the cpuidle sysfs interface becomes accessible to user space. Currently, the only way to achieve that is by setting the 'max_cstate' module parameter to a value lower than the index of the shallowest of the C-states in question, but that may be overly intrusive, because it effectively makes all of the idle states deeper than the 'max_cstate' one go away (and the C-state to avoid may be in the middle of the range normally regarded as available). To allow that limitation to be overcome, introduce a new module parameter called 'states_off' to represent a list of idle states to be disabled by default in the form of a bitmask and update the documentation to cover it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:22 +08:00
Rafael J. Wysocki	b84c30be8f	intel_idle: Introduce 'use_acpi' module parameter commit `3a5be9b8f4` upstream. For diagnostics, it is generally useful to be able to make intel_idle take the system's ACPI tables into consideration even if that is not required for the processor model in there, so introduce a new module parameter, 'use_acpi', to make that happen and update the documentation to cover it. While at it, fix the 'no_acpi' module parameter name in the documentation. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:22 +08:00
Rafael J. Wysocki	25bca34f26	intel_idle: Clean up irtl_2_usec() commit `86e9466ae6` upstream. Move the irtl_ns_units[] definition into irtl_2_usec() which is the only user of it, use div_u64() for the division in there (as the divisor is small enough) and use the NSEC_PER_USEC symbol for the divisor. Also convert the irtl_2_usec() comment to a proper kerneldo one. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:21 +08:00
Rafael J. Wysocki	7d0d3657be	intel_idle: Move 3 functions closer to their callers commit `1aefbd7aeb` upstream. Move intel_idle_verify_cstate(), auto_demotion_disable() and c1e_promotion_disable() closer to their callers. While at it, annotate intel_idle_verify_cstate() with __init, as it is only used during the initialization of the driver. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:21 +08:00
Rafael J. Wysocki	3796f4ee32	intel_idle: Annotate initialization code and data structures commit `095928ae48` upstream. Annotate the functions that are only used at the initialization time with __init and the data structures used by them with __initdata or __initconst. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:21 +08:00
Rafael J. Wysocki	af099e57e6	intel_idle: Rearrange intel_idle_cpuidle_driver_init() commit `3d3a1ae9b4` upstream. Notice that intel_idle_state_table_update() only needs to be called if icpu is not NULL, so fold it into intel_idle_init_cstates_icpu(), and pass a pointer to the driver object to intel_idle_cpuidle_driver_init() as an argument instead of referencing it locally in there. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:20 +08:00
Rafael J. Wysocki	c82844ca5f	intel_idle: Fold intel_idle_probe() into intel_idle_init() commit `a6c86e3362` upstream. There is no particular reason why intel_idle_probe() needs to be a separate function and folding it into intel_idle_init() causes the code to be somewhat easier to follow, so do just that. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:20 +08:00
Rafael J. Wysocki	cd6efdfb7c	intel_idle: Eliminate __setup_broadcast_timer() commit `cbd2c4c25d` upstream. The __setup_broadcast_timer() static function is only called in one place and "true" is passed to it as the argument in there, so effectively it is a wrapper arround tick_broadcast_enable(). To simplify the code, call tick_broadcast_enable() directly instead of __setup_broadcast_timer() and drop the latter. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 20:51:20 +08:00
Menglong Dong	ed011e007b	net: tcp: add sysctl_tcp_wnd_shrink Add the 'sysctl_tcp_wnd_shrink' to control the enable/disable of TCP window shrink. By default, it is disabled. Signed-off-by: Menglong Dong <imagedong@tencent.com>	2024-06-11 20:51:19 +08:00
mengensun	e1bf1991a5	net/tcp: switch to GSO being always on when open gso, tcp Write queues have less overhead, and make some app run faster. test of redis-benchmark like follow: Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Mengen Sun <mengensun@tencent.com>	2024-06-11 20:51:19 +08:00
Menglong Dong	f0d423d51c	net: tcp: raise zero-window probe without check wnd_end In the origin logic, zero-window probe can not only be raised on 0 window, but also in other case, such as MTU probe fails. Therefore, we need modify tcp_probe0_needed() to make it compatible with origin logic. Signed-off-by: Menglong Dong <imagedong@tencent.com>	2024-06-11 20:51:19 +08:00
Linus Torvalds	980a335360	mm: make wait_on_page_writeback() wait for multiple pending writebacks upstream commit: `c2407cf7d2` Ever since commit `2a9127fcf2` ("mm: rewrite wait_on_page_bit_common() logic") we've had some very occasional reports of BUG_ON(PageWriteback) in write_cache_pages(), which we thought we already fixed in commit `073861ed77` ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)"). But syzbot just reported another one, even with that commit in place. And it turns out that there's a simpler way to trigger the BUG_ON() than the one Hugh found with page re-use. It all boils down to the fact that the page writeback is ostensibly serialized by the page lock, but that isn't actually really true. Yes, the people _setting_ writeback all do so under the page lock, but the actual clearing of the bit - and waking up any waiters - happens without any page lock. This gives us this fairly simple race condition: CPU1 = end previous writeback CPU2 = start new writeback under page lock CPU3 = write_cache_pages() CPU1 CPU2 CPU3 ---- ---- ---- end_page_writeback() test_clear_page_writeback(page) ... delayed... lock_page(); set_page_writeback() unlock_page() lock_page() wait_on_page_writeback(); wake_up_page(page, PG_writeback); .. wakes up CPU3 .. BUG_ON(PageWriteback(page)); where the BUG_ON() happens because we woke up the PG_writeback bit becasue of the _previous_ writeback, but a new one had already been started because the clearing of the bit wasn't actually atomic wrt the actual wakeup or serialized by the page lock. The reason this didn't use to happen was that the old logic in waiting on a page bit would just loop if it ever saw the bit set again. The nice proper fix would probably be to get rid of the whole "wait for writeback to clear, and then set it" logic in the writeback path, and replace it with an atomic "wait-to-set" (ie the same as we have for page locking: we set the page lock bit with a single "lock_page()", not with "wait for lock bit to clear and then set it"). However, out current model for writeback is that the waiting for the writeback bit is done by the generic VFS code (ie write_cache_pages()), but the actual setting of the writeback bit is done much later by the filesystem ".writepages()" function. IOW, to make the writeback bit have that same kind of "wait-to-set" behavior as we have for page locking, we'd have to change our roughly ~50 different writeback functions. Painful. Instead, just make "wait_on_page_writeback()" loop on the very unlikely situation that the PG_writeback bit is still set, basically re-instating the old behavior. This is very non-optimal in case of contention, but since we only ever set the bit under the page lock, that situation is controlled. Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com Fixes: `2a9127fcf2` ("mm: rewrite wait_on_page_bit_common() logic") Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:18 +08:00
Paolo Bonzini	e31375a771	KVM: Do not leak memory for duplicate debugfs directories commit `85cd39af14` upstream. KVM creates a debugfs directory for each VM in order to store statistics about the virtual machine. The directory name is built from the process pid and a VM fd. While generally unique, it is possible to keep a file descriptor alive in a way that causes duplicate directories, which manifests as these messages: [ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present! Even though this should not happen in practice, it is more or less expected in the case of KVM for testcases that call KVM_CREATE_VM and close the resulting file descriptor repeatedly and in parallel. When this happens, debugfs_create_dir() returns an error but kvm_create_vm_debugfs() goes on to allocate stat data structs which are later leaked. The slow memory leak was spotted by syzkaller, where it caused OOM reports. Since the issue only affects debugfs, do a lookup before calling debugfs_create_dir, so that the message is downgraded and rate-limited. While at it, ensure kvm->debugfs_dentry is NULL rather than an error if it is not created. This fixes kvm_destroy_vm_debugfs, which was not checking IS_ERR_OR_NULL correctly. Cc: stable@vger.kernel.org Fixes: `536a6f88c4` ("KVM: Create debugfs dir and stat files for each VM") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:18 +08:00
Andrew Sy Kim	f2f46e7af4	ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1 [upstream commit `35dfb01314`] When expire_nodest_conn=1 and a destination is deleted, IPVS does not expire the existing connections until the next matching incoming packet. If there are many connection entries from a single client to a single destination, many packets may get dropped before all the connections are expired (more likely with lots of UDP traffic). An optimization can be made where upon deletion of a destination, IPVS queues up delayed work to immediately expire any connections with a deleted destination. This ensures any reused source ports from a client (within the IPVS timeouts) are scheduled to new real servers instead of silently dropped. Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2024-06-11 20:51:18 +08:00
Menglong Dong	0fe440f03b	net: tcp: handle window shrink properly Window shrink is not allowed and also not handled for now, but it's needed in some case. In the origin logic, 0 probe is triggered only when there is no any data in the retrans queue and the receive window can't hold the data of the 1th packet in the send queue. Now, let's change it and trigger the 0 probe in such cases: - if the retrans queue has data and the 1th packet in it is not within the receive window - no data in the retrans queue and the 1th packet in the send queue is out of the end of the receive window Signed-off-by: Menglong Dong <imagedong@tencent.com>	2024-06-11 20:51:17 +08:00
Menglong Dong	bb00f4ca4c	net: tcp: send zero-window when no memory For now, skb will be dropped when no memory, which makes client keep retrans util timeout and it's not friendly to the users. Therefore, now we force to receive one packet on current socket when the protocol memory is out of the limitation. Then, this socket will stay in 'no mem' status, util protocol memory is available. When a socket is in 'no mem' status, it's receive window will become 0, which means window shrink happens. And the sender need to handle such window shrink properly, which is done in the next commit. Signed-off-by: Menglong Dong <imagedong@tencent.com>	2024-06-11 20:51:17 +08:00
caelli	1ea94d5505	driver: update e1000e to 3.8.4 E1000e driver is update to 3.8.4 on x86, arm64 still use 3.2.6. Signed-off-by: caelli <caelli@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:17 +08:00
Liu Chun	73b70ea3f0	kdump: the capture kernel can't use dma memory In arm64 system, when the memory that less than 4G is a little, the capture kernel cannot use dma memory. Therefore, it is necessary to enable CONFIG_EXEC_FILE and fixes the issue of reserved memory to pass low memory to the kdump kernel. Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com> Signed-off-by: Liu Chun <kaicliu@tencent.com>	2024-06-11 20:51:11 +08:00
Liu Chun	d151d105a1	drm: Fixed system hang caused by memory failure When the dma memory is insufficient, the wrong release of resources will cause the system to hang. [ 35.975823] [TTM] Initializing pool allocator [ 35.980166] [TTM] Initializing DMA pool allocator [ 35.984864] [drm:hibmc_mm_init [hibmc_drm]] ERROR Error initializing VRAM MM; -12 [ 35.992517] ------------[ cut here ]------------ [ 35.997154] WARNING: CPU: 0 PID: 116 at drivers/gpu/drm/drm_modeset_lock.c:266 drm_modeset_lock+0xd8/0xf8 [drm] [ 36.007192] Modules linked in: hibmc_drm(+) drm_vram_helper ttm drm_kms_helper drm autofs4 overlay squashfs [ 36.016890] CPU: 0 PID: 116 Comm: kworker/0:2 Not tainted 5.4.119-0.20230227git9d7d3558a64d.19 #1 [ 36.025719] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019 [ 36.033173] Workqueue: events work_for_cpu_fn [ 36.037510] pstate: a0800009 (NzCv daif -PAN +UAO) [ 36.042297] pc : drm_modeset_lock+0xd8/0xf8 [drm] [ 36.046995] lr : drm_modeset_lock+0x44/0xf8 [drm] [ 36.051676] sp : ffff80005462fc30 [ 36.054974] x29: ffff80005462fc30 x28: 0000000000000000 [ 36.060260] x27: ffff2057ebe20000 x26: 0000000000000000 [ 36.065546] x25: 0000000000000000 x24: ffff80004cf6f8e8 [ 36.070833] x23: 0000000000000000 x22: ffff2057f4739800 [ 36.076119] x21: ffff800049803908 x20: ffff2057f4739998 [ 36.081405] x19: ffff80005462fcc0 x18: 0000000000000010 [ 36.086690] x17: 0000000000000000 x16: ffff800048b61b88 [ 36.091976] x15: ffffffffffffffff x14: 204d41525620676e [ 36.097261] x13: 697a696c61697469 x12: 6e6920726f727245 [ 36.102547] x11: 202a524f5252452a x10: 205d5d6d72645f63 [ 36.107832] x9 : ffff800048b61bcc x8 : ffff800048703a60 [ 36.113118] x7 : 065448] work_for_cpu_fn+0x20/0x30 [ 36.169181] process_one_work+0x1f8/0x488 [ 36.173173] worker_thread+0x248/0x528 [ 36.176906] kthread+0x124/0x128 [ 36.180121] ret_from_fork+0x10/0x18 [ 36.183679] ---[ end trace aae0476f91651f5d ]--- [ 36.188284] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018 [ 36.197028] Mem abort info: [ 36.199809] ESR = 0x96000005 [ 36.202851] EC = 0x25: DABT (current EL), IL = 32 bits [ 36.208137] SET = 0, FnV = 0 [ 36.211176] EA = 0, S1PTW = 0 [ 36.214303] Data abort info: [ 36.217172] ISV = 0, ISS = 0x00000005 [ 36.220990] CM = 0, WnR = 0 [ 36.223946] user pgtable: 64k pages, 48-bit VAs, pgdp=00002057df5a0600 [ 36.230444] [0000000000000018] pgd=0000000000000000, pud=0000000000000000 [ 36.237200] Internal error: Oops: 96000005 [#1] SMP [ 36.242056] Modules linked in: hibmc_drm(+) drm_vram_helper ttm drm_kms_helper drm autofs4 overlay squashfs [ 36.251751] CPU: 0 PID: 116 Comm: kworker/0:2 Tainted: G W 5.4.119-0.20230227git9d7d3558a64d.19 #1 [ 36.261963] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019 [ 36.269411] Workqueue: events work_for_cpu_fn [ 36.273747] pstate: a0800009 (NzCv daif -PAN +UAO) [ 36.278518] pc : ww_mutex_lock+0x2c/0x70 [ 36.282439] lr : drm_modeset_lock+0x44/0xf8 [drm] [ 36.287120] sp : ffff80005462fc20 [ 36.290418] x29: ffff80005462fc20 x28: 0000000000000000 [ 36.295704] x27: ffff2057ebe20000 x26: 0000000000000000 [ 36.300989] x25: 0000000000000000 x24: ffff80004cf6f8e8 [ 36.306274] x23: 0000000000000000 x22: ffff2057f4739800 [ 36.311560] x21: ffff2057f4739af8 x20: 0000000000000018 [ 36.316846] x19: ffff80005462fcc0 x18: 0000000000000010 [ 36.322131] x17: 0000000000000000 x16: ffff800048b61b88 [ 36.327418] x15: ffffffffffffffff x14: 204d41525620676e [ 36.332703] x13: 697a696c61697469 x12: 6e6920726f727245 [ 36.337989] x11: 202a524f5252452a x10: 205d5d6d72645f63 [ 36.343274] x9 : ffff800008ce4594 x8 : ffff800048703a60 [ 36.348560] x7 : 0000000000000469 x6 : ffff80004998e5e6 [ 36.353845] x5 : 0000000000000000 x4 : ffff80005462fcc0 [ 36.359131] x3 : 0000000000000018 x2 : ffff2057ebe30000 [ 36.364417] x1 : 0000000000000000 x0 : 0000000000000018 [ 36.369703] Call trace: [ 36.372138] ww_mutex_lock+0x2c/0x70 [ 36.375712] drm_modeset_lock+0x44/0xf8 [drm] [ 36.380064] drm_modeset_lock_all_ctx+0x68/0xf8 [drm] [ 36.385100] drm_atomic_helper_shutdown+0x54/0xd0 [drm_kms_helper] [ 36.391251] hibmc_unload+0x2c/0xa8 [hibmc_drm] [ 36.395762] hibmc_pci_probe+0x318/0x430 [hibmc_drm] [ 36.400703] local_pci_probe+0x44/0xa8 [ 36.404435] work_for_cpu_fn+0x20/0x30 [ 36.408167] process_one_work+0x1f8/0x488 [ 36.412158] worker_thread+0x248/0x528 [ 36.415890] kthread+0x124/0x128 [ 36.419103] ret_from_fork+0x10/0x18 [ 36.422662] Code: d503201f d503201f d2800001 aa0103e5 (c8e57c02) [ 36.428727] ---[ end trace aae0476f91651f5e ]--- [ 37.169300] systemd-udevd[307]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Signed-off-by: Chun Liu <kaicliu@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:02 +08:00
Kairui Song	1b7d9fa70f	arm64: kexec_file: add crash dump support Upstream: `3751e728ce` Link: `40e94ab32e` commit `3751e728ce` Author: AKASHI Takahiro <takahiro.akashi@linaro.org> Date: Mon Dec 16 11:12:47 2019 +0900 arm64: kexec_file: add crash dump support Enabling crash dump (kdump) includes * prepare contents of ELF header of a core dump file, /proc/vmcore, using crash_prepare_elf64_headers(), and * add two device tree properties, "linux,usable-memory-range" and "linux,elfcorehdr", which represent respectively a memory range to be used by crash dump kernel and the header's location Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Reviewed-by: James Morse <james.morse@arm.com> Tested-and-reviewed-by: Bhupesh Sharma <bhsharma@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:02 +08:00
Kairui Song	b7e9b568c2	libfdt: include fdt_addresses.c Upstream: `c273a2bd8a` Link: `887436bdb7` commit `c273a2bd8a` Author: AKASHI Takahiro <takahiro.akashi@linaro.org> Date: Mon Dec 9 12:03:44 2019 +0900 libfdt: include fdt_addresses.c In the implementation of kexec_file_loaded-based kdump for arm64, fdt_appendprop_addrrange() will be needed. So include fdt_addresses.c in making libfdt. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:02 +08:00
Kairui Song	6fb78c4cc6	arm64: kdump: remove dependency on arm64_dma32_phys_limit From: Yi Li <adamliyi@msn.com> Link: `696027f109` The patch `b2da6ad294` (arm64: kdump: reimplement crashkernel=X) depends on commit `1a8e1cef76` ("arm64: use both ZONE_DMA and ZONE_DMA32"). Commit `1a8e1cef76` is not ported to 5.4 kernel. So use arm64_dma_phys_limit. Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:01 +08:00
Kairui Song	fd02a1b5bc	kdump: update Documentation about crashkernel From: Chen Zhou <chenzhou10@huawei.com> Link: https://lkml.org/lkml/2021/1/30/53 Link: `023deaec32` For arm64, the behavior of crashkernel=X has been changed, which tries low allocation in DMA zone or DMA32 zone if CONFIG_ZONE_DMA is disabled, and fall back to high allocation if it fails. We can also use "crashkernel=X,high" to select a high region above DMA zone, which also tries to allocate at least 256M low memory in DMA zone automatically (or the DMA32 zone if CONFIG_ZONE_DMA is disabled). "crashkernel=Y,low" can be used to allocate specified size low memory. So update the Documentation. Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Tested-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Acked-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:01 +08:00
Kairui Song	3fd41ff677	arm64: kdump: add memory for devices by DT property linux,usable-memory-range From: Chen Zhou <chenzhou10@huawei.com> Link: https://lkml.org/lkml/2021/1/30/53 Link: `2012a3b392` When reserving crashkernel in high memory, some low memory is reserved for crash dump kernel devices and never mapped by the first kernel. This memory range is advertised to crash dump kernel via DT property under /chosen, linux,usable-memory-range = <BASE1 SIZE1 [BASE2 SIZE2]> We reused the DT property linux,usable-memory-range and made the low memory region as the second range "BASE2 SIZE2", which keeps compatibility with existing user-space and older kdump kernels. Crash dump kernel reads this property at boot time and call memblock_add() to add the low memory region after memblock_cap_memory_range() has been called. Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Tested-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Acked-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:01 +08:00
Kairui Song	d001dccf2b	x86, arm64: Add ARCH_WANT_RESERVE_CRASH_KERNEL config From: Chen Zhou <chenzhou10@huawei.com> Link: https://lkml.org/lkml/2021/1/30/53 Link: `c8013ee6cd` We make the functions reserve_crashkernel[_low]() as generic for x86 and arm64. Since reserve_crashkernel[_low]() implementations are quite similar on other architectures as well, we can have more users of this later. So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and select this by X86 and ARM64. Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Acked-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:00 +08:00
Kairui Song	bd482067c3	arm64: kdump: reimplement crashkernel=X From: Chen Zhou <chenzhou10@huawei.com> Link: https://lkml.org/lkml/2021/1/30/53 Link: `70e586365f` There are following issues in arm64 kdump: 1. We use crashkernel=X to reserve crashkernel below 4G, which will fail when there is no enough low memory. 2. If reserving crashkernel above 4G, in this case, crash dump kernel will boot failure because there is no low memory available for allocation. 3. Since commit `1a8e1cef76` ("arm64: use both ZONE_DMA and ZONE_DMA32"), if the memory reserved for crash dump kernel falled in ZONE_DMA32, the devices in crash dump kernel need to use ZONE_DMA will alloc fail. To solve these issues, change the behavior of crashkernel=X and introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation in DMA zone or DMA32 zone if CONFIG_ZONE_DMA is disabled, and fall back to high allocation if it fails. We can also use "crashkernel=X,high" to select a region above DMA zone, which also tries to allocate at least 256M in DMA zone automatically (or the DMA32 zone if CONFIG_ZONE_DMA is disabled). "crashkernel=Y,low" can be used to allocate specified size low memory. Another minor change, there may be two regions reserved for crash dump kernel, in order to distinct from the high region and make no effect to the use of existing kexec-tools, rename the low region as "Crash kernel (low)". Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Tested-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Acked-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:00 +08:00
Kairui Song	f30355b620	arm64: kdump: introduce some macroes for crash kernel reservation From: Chen Zhou <chenzhou10@huawei.com> Link: https://lkml.org/lkml/2021/1/30/53 Link: `667118f8c1` Introduce macro CRASH_ALIGN for alignment, macro CRASH_ADDR_LOW_MAX for upper bound of low crash memory, macro CRASH_ADDR_HIGH_MAX for upper bound of high crash memory, use macroes instead. Besides, keep consistent with x86, use CRASH_ALIGN as the lower bound of crash kernel reservation. Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Tested-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Acked-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:51:00 +08:00
Kairui Song	21ff8ff8f3	x86/elf: Move vmcore_elf_check_arch_cross to arch/x86/include/asm/elf.h From: Chen Zhou <chenzhou10@huawei.com> Link: https://lkml.org/lkml/2021/1/30/53 Link: `b332ab8970` Move macro vmcore_elf_check_arch_cross from arch/x86/include/asm/kexec.h to arch/x86/include/asm/elf.h to fix the following compiling warning: In file included from arch/x86/kernel/setup.c:39:0: ./arch/x86/include/asm/kexec.h:77:0: warning: "vmcore_elf_check_arch_cross" redefined # define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64) In file included from arch/x86/kernel/setup.c:9:0: ./include/linux/crash_dump.h:39:0: note: this is the location of the previous definition #define vmcore_elf_check_arch_cross(x) 0 The root cause is that vmcore_elf_check_arch_cross under CONFIG_CRASH_CORE depend on CONFIG_KEXEC_CORE. Commit 532b66d2279d ("x86: kdump: move reserve_crashkernel[_low]() into crash_core.c") triggered the issue. Suggested by Mike, simply move vmcore_elf_check_arch_cross from arch/x86/include/asm/kexec.h to arch/x86/include/asm/elf.h to fix the warning. Fixes: 532b66d2279d ("x86: kdump: move reserve_crashkernel[_low]() into crash_core.c") Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Acked-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Bin Lai <robinlai@tencent.com>	2024-06-11 20:50:59 +08:00

1 2 3 4 5 ...

873692 Commits All Branches Search

873692 Commits

All Branches