Commit Graph

873783 Commits

Author SHA1 Message Date
Jinliang Zheng 9d1d946a72 hygon: turn CONFIG_MICROCODE_HYGON on
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 19:00:32 +08:00
fuhao 1ff035c011 x86/resctrl: Add Hygon QoS support
Add support for Hygon QoS feature.

Signed-off-by: fuhao <fuh@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 19:00:32 +08:00
Pu Wen d7b445324e anolis: perf/x86/uncore: Add L3 PMU support for Hygon family 18h model 6h
ANBZ: #5455

Adjust the L3 PMU slicemask and threadmask for Hygon family 18h
model 6h processor.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2536
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 19:00:32 +08:00
fuhao e45c93c358 ALSA: hda: Fix single byte writing issue for Hygon family 18h model 5h
On Hygon family 18h model 5h controller, some registers such as
GCTL, SD_CTL and SD_CTL_3B should be accessed in dword, or the
writing will fail.

Conflicts:
	sound/pci/hda/hda_intel.c

Signed-off-by: fuhao <fuh@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 19:00:07 +08:00
fuhao eaad4d0de2 ALSA: hda: Add support for Hygon family 18h model 5h HD-Audio
Add the new PCI ID 0x1d94 0x14a9 for Hygon family 18h model 5h
HDA controller.

Conflicts:
	sound/pci/hda/hda_intel.c

Signed-off-by: fuhao <fuh@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 18:56:59 +08:00
fuhao 768858da65 EDAC/amd64: Adjust UMC channel for Hygon family 18h model 6h
Hygon family 18h model 6h has 2 cs mapped to 1 umc, so adjust for it.

Fixes: 6cf6141e
 ("EDAC/amd64: Add support for Hygon family 18h model 6h")

Signed-off-by: fuhao <fuh@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:30 +08:00
fuhao af62cfaf8b x86/amd_nb: Get DF ID from F5 device for Hygon family 18h model 6h
The DF ID should be get from DF F5 device for Hygon family 18h
model 6h processor.

Fixes: 3e980dd8
 ("x86/amd_nb: Add support for Hygon family 18h model 6h")

Signed-off-by: fuhao <fuh@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:30 +08:00
fuhao 6aaeaee473 EDAC/amd64: Fix intlv_num_chan for Hygon family 18h model 4h
Make the modification in commit 0ac06d63 to intlv_num_chan only
for Hygon family 18h model 4h.

Fixes: 0ac06d63
 ("EDAC/amd64: Adjust address translation for Hygon family 18h model 4h")

Signed-off-by: fuhao <fuh@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:30 +08:00
fuhao f687e8b76a EDAC/amd64: Revert hi_addr_offset for Hygon family 18h model 4h
The HiAddrOffset is always the top 4 bits of normalized address,
so revert the modification in commit 0ac06d63 to the original
implementation.

Fixes: 0ac06d63
 ("EDAC/amd64: Adjust address translation for Hygon family 18h model 4h")

Signed-off-by: fuhao <fuh@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:30 +08:00
Pu Wen b799eb7121 EDAC/amd64: Add support for Hygon family 18h model 6h
Add F0/F6 device IDs for Hygon family 18h model 6h processor.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:30 +08:00
Pu Wen 2bf609d3dd x86/amd_nb: Add support for Hygon family 18h model 6h
Hygon family 18h model 6h processor has the same DF F1 device ID
as M05H_DF_F1.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:30 +08:00
Pu Wen 17704da187 x86/cpu: Get CPU topology for Hygon family 18h model 6h
Add support to derive CPU topology for Hygon family 18h model 6h
processor.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:30 +08:00
Pu Wen 735e1788da hwmon/k10temp: Add support for Hygon family 18h model 5h
Add 18H_M05H DF F3 device ID to get the temperature for Hygon
family 18h model 5h processor.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 24e1f7cdda EDAC/amd64: Add support for Hygon family 18h model 5h
Hygon family 18h model 5h processor has the same F0/F6 device IDs as
F17h_M30h, and the syndrome size is the same as model 4h processor.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 2c598bc5d1 x86/amd_nb: Add support for Hygon family 18h model 5h
Add root and DF F1/F3/F4 device IDs for Hygon family 18h model
5h processors. But some model 5h processors have the legacy(M04H)
DF devices, so add a if conditional to read the df1 register.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 9c869204b2 x86/cpu: Get CPU topology and LLC ID for Hygon family 18h model 5h
Add support to derive CPU topology for Hygon family 18h model 5h
processor, and calculate LLC ID for it from the number of threads
sharing the cache.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 6b45f25208 i2c-piix4: Remove the IMC detecting for Hygon SMBus
Remove IMC detecting path for Hygon processors.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 35c1b7081e hwmon/k10temp: Add support for Hygon family 18h model 4h
The DF F3 device ID used to get the temperature for Hygon family 18h
model 4h processor is the same as 17H_M30H, but with different offsets,
which may span two distributed ranges. The second offset range can be
considered as private for Hygon, so use struct hygon_private to describe
it.

Add a pointer priv in k10temp_data to point to the private data.
Add functions k10temp_get_ccd_support_2nd() and hygon_read_temp()
to support reading the second offset range.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 9680d833c1 EDAC/mce_amd: Use struct cpuinfo_x86.logical_die_id for Hygon NodeId
The cpuinfo_x86.cpu_die_id is get from CPUID or MSR in the commit
028c221ed1 ("x86/CPU/AMD: Save AMD NodeId as cpu_die_id"). But the
value may not be continuous for Hygon model 4h~6h processors.

Use cpuinfo_x86.logical_die_id will always format continuous die
(or node) IDs, because it will convert the physical die ID to logical
die ID.

So use topology_logical_die_id() instead of topology_die_id() to
decode UMC ECC errors for Hygon processors.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 2c453f4b4e EDAC/amd64: Adjust address translation for Hygon family 18h model 4h
Add Hygon family 18h model 4h processor support for DramOffset and
HiAddrOffset, and get the socket interleaving number from DramBase-
Address(D18F0x110).

Update intlv_num_chan and num_intlv_bits support for Hygon family 18h
model 4h processor.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 36d1aba7a1 EDAC/amd64: Add support for Hygon family 18h model 4h
Hygon family 18h model 4h processor has the same F0/F6 device IDs
as F17h_M30h.

Add support to determine which DDR memory types it supports, but
determine syndrome sizes from different bits of the DRAM ECC control
reg.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 98cb40237c EDAC/amd64: Get UMC channel from the 6th nibble for Hygon
On Hygon family 18h platforms, we look at the 6th nibble(bit 20~23)
in the instance_id to derive the channel number.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:29 +08:00
Pu Wen 038d8bb6ce iommu/hygon: Add support for Hygon family 18h model 4h IOAPIC
The SB IOAPIC is on the device 0xb from Hygon family 18h model 4h.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Pu Wen 2336e8989a x86/amd_nb: Add northbridge support for Hygon family 18h model 4h
Add dedicated functions to initialize the northbridge for Hygon family
18h model 4h processors.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Pu Wen 88dbf4a46d x86/amd_nb: Add Hygon family 18h model 4h PCI IDs
Add the PCI device IDs for Hygon family 18h model 4h processors.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Pu Wen 871c9d1964 x86/microcode/hygon: Add microcode loading support for Hygon processors
Add support for loading Hygon microcode, which is compatible with AMD one.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Pu Wen 602852c638 x86/cpu/hygon: Modify the CPU topology deriving method for Hygon
Hygon processors before model 4h have not use the CPUID leaf 0xB, so use
commit e0ceeae708 ("x86/CPU/hygon: Fix phys_proc_id calculation logic
for multi-die processors") to derive the socket ID when running on host.
If kernel running on guest, use the hypervisor's default.

For model 4h, Hygon processors use CPUID leaf 0xB to identify SMT and
CORE level types, so use function detect_extended_topology() to derive
the core ID, socket ID and APIC ID. But it still set __max_die_per_package
to nodes_per_socket because it lacks the DIE level type.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Muralidhara M K c6a94a3c1e x86/MCE/AMD: Use an u64 for bank_map
commit 4c1cdec319 upstream.

Thee maximum number of MCA banks is 64 (MAX_NR_BANKS), see

  a0bc32b3ca ("x86/mce: Increase maximum number of banks to 64").

However, the bank_map which contains a bitfield of which banks to
initialize is of type unsigned int and that overflows when those bit
numbers are >= 32, leading to UBSAN complaining correctly:

  UBSAN: shift-out-of-bounds in arch/x86/kernel/cpu/mce/amd.c:1365:38
  shift exponent 32 is too large for 32-bit type 'int'

Change the bank_map to a u64 and use the proper BIT_ULL() macro when
modifying bits in there.

  [ bp: Rewrite commit message. ]

Fixes: a0bc32b3ca ("x86/mce: Increase maximum number of banks to 64")
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127151601.1068324-1-muralimk@amd.com
[fix conflict during backport]
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Mario Limonciello e9cf096764 hwmon: (k10temp) Don't show Tdie for all Zen/Zen2/Zen3 CPU/APU
commit 02a2484cf8 upstream.

Tdie is an offset calculation that should only be shown when temp_offset
is actually put into a table.  This is useless to show for all CPU/APU.
Show it only when necessary.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Pu Wen 6c20e20ff5 x86/cstate: Allow ACPI C1 FFH MWAIT use on Hygon systems
commit 280b68a3b3 upstream.

Hygon systems support the MONITOR/MWAIT instructions and these can be
used for ACPI C1 in the same way as on AMD and Intel systems.

The BIOS declares a C1 state in _CST to use FFH and CPUID_Fn00000005_EDX
is non-zero on Hygon systems.

Allow ffh_cstate_init() to succeed on Hygon systems to default using FFH
MWAIT instead of HALT for ACPI C1.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210528081417.31474-1-puwen@hygon.cn
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Pu Wen 0684f6615b x86/cpu/hygon: Set __max_die_per_package on Hygon
commit 59eca2fa19 upstream.

Set the maximum DIE per package variable on Hygon using the
nodes_per_socket value in order to do per-DIE manipulations for drivers
such as powercap.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210302020217.1827-1-puwen@hygon.cn
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Yazen Ghannam a4192109d3 x86/topology: Set cpu_die_id only if DIE_TYPE found
commit cb09a37972 upstream.

CPUID Leaf 0x1F defines a DIE_TYPE level (nb: ECX[8:15] level type == 0x5),
but CPUID Leaf 0xB does not. However, detect_extended_topology() will
set struct cpuinfo_x86.cpu_die_id regardless of whether a valid Die ID
was found.

Only set cpu_die_id if a DIE_TYPE level is found. CPU topology code may
use another value for cpu_die_id, e.g. the AMD NodeId on AMD-based
systems. Code ordering should be maintained so that the CPUID Leaf 0x1F
Die ID value will take precedence on systems that may use another value.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-5-Yazen.Ghannam@amd.com
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:28 +08:00
Akshay Gupta f3d38fa0fa x86/mce: Increase maximum number of banks to 64
commit a0bc32b3ca upstream.

...because future AMD systems will support up to 64 MCA banks per CPU.

MAX_NR_BANKS is used to allocate a number of data structures, and it is
used as a ceiling for values read from MCG_CAP[Count]. Therefore, this
change will have no functional effect on existing systems with 32 or
fewer MCA banks per CPU.

However, this will increase the size of the following structures:

Global bitmaps:
- core.c / mce_banks_ce_disabled
- core.c / all_banks
- core.c / valid_banks
- core.c / toclear
- Total: 32 new bits * 4 bitmaps = 16 new bytes

Per-CPU bitmaps:
- core.c / mce_poll_banks
- intel.c / mce_banks_owned
- Total: 32 new bits * 2 bitmaps = 8 new bytes

The bitmaps are arrays of longs. So this change will only affect 32-bit
execution, since there will be one additional long used. There will be
no additional memory use on 64-bit execution, because the size of long
is 64 bits.

Global structs:
- amd.c / struct smca_bank smca_banks[]: 16 bytes per bank
- core.c / struct mce_bank_dev mce_bank_devs[]: 56 bytes per bank
- Total: 32 new banks * (16 + 56) bytes = 2304 new bytes

Per-CPU structs:
- core.c / struct mce_bank mce_banks_array[]: 16 bytes per bank
- Total: 32 new banks * 16 bytes = 512 new bytes

32-bit
Total global size increase: 2320 bytes
Total per-CPU size increase: 520 bytes

64-bit
Total global size increase: 2304 bytes
Total per-CPU size increase: 512 bytes

This additional memory should still fit within the existing .data
section of the kernel binary. However, in the case where it doesn't
fit, an additional page (4kB) of memory will be added to the binary to
accommodate the extra data which will be the maximum size increase of
vmlinux.

Signed-off-by: Akshay Gupta <Akshay.Gupta@amd.com>
[ Adjust commit message and code comment. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200828192412.320052-1-Yazen.Ghannam@amd.com
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:27 +08:00
Pu Wen 6be72939b0 i2c: designware: Add device HID for Hygon I2C controller
commit 384b02d6b8 upstream.

Add device HID HYGO0010 to match the Hygon ACPI Vendor ID (HYGO) that
was registered in http://www.uefi.org/acpi_id_list, and the I2C
controller on Hygon paltform will use the HID.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Wolfram Sang <wsa@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[fix conflict during backport]
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:27 +08:00
Jiasen Lin b47eb972d4 NTB: Add Hygon Device ID
commit 9b5b99a89f upstream.

Signed-off-by: Jiasen Lin <linjiasen@hygon.cn>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:27 +08:00
Jinliang Zheng 8fbb888f6d EDAC/amd64: Add AMD family 17h model 60h PCI IDs - conflict resolve
commit b6bea24d41 upstream.

Conflict resolve when introduce commit aa9873f9fa1e ("EDAC/amd64:
Add AMD family 17h model 60h PCI IDs").

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:48:27 +08:00
Jinliang Zheng 42724daea4 configs: turn CONFIG_PROC_CHILDREN on
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: mengensun <mengensun@tencent.com>
2024-11-14 17:46:47 +08:00
Jianping Liu 4b200fd1d1 config: enable CONFIG_VFIO_NOIOMMU
Qemu not support iommu, PCG want using vfio ko with
enable_unsafe_noiommu_mode=1 to improve the speed.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: aurelianliu <aurelianliu@tencent.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
2024-11-14 17:46:40 +08:00
Jianping Liu e5685c444b config: set CONFIG_NGBE=m
To support wangxun NIC, CONFIG_NGBE=m.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-14 17:46:24 +08:00
Yongliang Gao f4188f5fe4 config: remove CONFIG_VIRTIO_BLK_SCSI
VIRTIO_BLK_F_SCSI support is retired this commit:
commit 8fbea7d49c27 ("virtio-blk: remove VIRTIO_BLK_F_SCSI support")
we update config file to remove CONFIG_VIRTIO_BLK_SCSI

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
2024-11-14 17:46:11 +08:00
Yongliang Gao 74f7566cbf config: remove CONFIG_NET_CLS_RSVP and CONFIG_NET_CLS_RSVP6
rsvp classifier is retired this commit:
commit 2bc2a81619d8 ("net/sched: Retire rsvp classifier")
we update config file to remove CONFIG_NET_CLS_RSVP and
CONFIG_NET_CLS_RSVP6

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
2024-11-14 17:45:45 +08:00
Jianping Liu fa1f2753b2 x86/mce: revert "Add NMIs setup in machine_check func"
The function of do_machine_check has called nmi_enter before mce_panic,
so it needn't to call nmi_enter at the outside of do_machine_check.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-11-14 17:32:06 +08:00
Jianping Liu 251e419790 dist: delete kernel-modules-public rpm
Some customer want install OS from USB card or BMC. OS treat BMC as a
USB storage device. So, do not put usb storage ko in kernel
modules-public-removable-media rpm, delete kernel-modules-public rpm.

More details:
--story=1020422414120102612
--bug=1020426283131556911

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-11-12 19:47:08 +08:00
caelli bb0729d7db dist: fix uname version
Currently, uname -r outputs 5.4.241-24-0017.xx, rpm package
is kernel-5.4.241-24.0017.xx, which is not match, fix this
by changing uname outputs to 5.4.241-24.0017.xx.

Signed-off-by: caelli <caelli@tencent.com>
Reviewed-by: yuehongwu <yuehongwu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-12 19:46:38 +08:00
yuehongwu 6782d5a027 dist: delete tlinux4 tag in dist/Makefile
Signed-off-by: yuehongwu <yuehongwu@tencent.com>
Reviewed-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-12 19:46:16 +08:00
Jianping Liu c100c43b51 Makefile: set EXTRAVERSION from 1 to 30
To distinguish between OC version and private version, set kernel
EXTRAVERSION from 1 to 30.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-11-12 17:11:05 +08:00
chinaljp030 1d6573295e
!261 Backport the support for cluster scheduler level
Merge pull request !261 from XueSinian/linux-5.4/devel-rm-flag-SD_FLAG-cluster
2024-11-11 09:28:56 +00:00
Wang ShaoBo a190985d36 arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology()
mainline inclusion
commit 4cc4cc28ec upstream.

----------------------------------------------------------------------

When testing cpu online and offline, warning happened like this:

[  146.746743] WARNING: CPU: 92 PID: 974 at kernel/sched/topology.c:2215 build_sched_domains+0x81c/0x11b0
[  146.749988] CPU: 92 PID: 974 Comm: kworker/92:2 Not tainted 5.15.0 #9
[  146.750402] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.79 08/21/2021
[  146.751213] Workqueue: events cpuset_hotplug_workfn
[  146.751629] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  146.752048] pc : build_sched_domains+0x81c/0x11b0
[  146.752461] lr : build_sched_domains+0x414/0x11b0
[  146.752860] sp : ffff800040a83a80
[  146.753247] x29: ffff800040a83a80 x28: ffff20801f13a980 x27: ffff20800448ae00
[  146.753644] x26: ffff800012a858e8 x25: ffff800012ea48c0 x24: 0000000000000000
[  146.754039] x23: ffff800010ab7d60 x22: ffff800012f03758 x21: 000000000000005f
[  146.754427] x20: 000000000000005c x19: ffff004080012840 x18: ffffffffffffffff
[  146.754814] x17: 3661613030303230 x16: 30303078303a3239 x15: ffff800011f92b48
[  146.755197] x14: ffff20be3f95cef6 x13: 2e6e69616d6f642d x12: 6465686373204c4c
[  146.755578] x11: ffff20bf7fc83a00 x10: 0000000000000040 x9 : 0000000000000000
[  146.755957] x8 : 0000000000000002 x7 : ffffffffe0000000 x6 : 0000000000000002
[  146.756334] x5 : 0000000090000000 x4 : 00000000f0000000 x3 : 0000000000000001
[  146.756705] x2 : 0000000000000080 x1 : ffff800012f03860 x0 : 0000000000000001
[  146.757070] Call trace:
[  146.757421]  build_sched_domains+0x81c/0x11b0
[  146.757771]  partition_sched_domains_locked+0x57c/0x978
[  146.758118]  rebuild_sched_domains_locked+0x44c/0x7f0
[  146.758460]  rebuild_sched_domains+0x2c/0x48
[  146.758791]  cpuset_hotplug_workfn+0x3fc/0x888
[  146.759114]  process_one_work+0x1f4/0x480
[  146.759429]  worker_thread+0x48/0x460
[  146.759734]  kthread+0x158/0x168
[  146.760030]  ret_from_fork+0x10/0x20
[  146.760318] ---[ end trace 82c44aad6900e81a ]---

For some architectures like risc-v and arm64 which use common code
clear_cpu_topology() in shutting down CPUx, When CONFIG_SCHED_CLUSTER
is set, cluster_sibling in cpu_topology of each sibling adjacent
to CPUx is missed clearing, this causes checking failed in
topology_span_sane() and rebuilding topology failure at end when CPU online.

Different sibling's cluster_sibling in cpu_topology[] when CPU92 offline
(CPU 92, 93, 94, 95 are in one cluster):

Before revision:
CPU                 [92]      [93]      [94]      [95]
cluster_sibling     [92]     [92-95]   [92-95]   [92-95]

After revision:
CPU                 [92]      [93]      [94]      [95]
cluster_sibling     [92]     [93-95]   [93-95]   [93-95]

Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Barry Song <song.bao.hua@hisilicon.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20211110095856.469360-1-bobo.shaobowang@huawei.com

Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
2024-11-11 08:21:03 +08:00
Guan Jing 467d9b6f8f sched/fair: Fix kabi borken in sched_domain_shared
commit 222d84a0de0d2dbd75b2c73f469d74868955f3b5 openeuler.

--------------------------------

The sched_domain_shared structure is only used as pointer, and other
drivers don't use it directly.

Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
2024-11-11 08:20:56 +08:00
Chen Yu 66aa97cb7b sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
mainline inclusion
from mainline-v6.0-rc1
commit 70fb5ccf2e upstream.

--------------------------------

[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.

In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.

Introduce the quadratic function:

y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE

x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:

nr_scan = llc_weight * y'

Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...

For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.

When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.

This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca28512 ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.

[Test result]

netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.

Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.

Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)

Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)

Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)

*Consider the Standard Deviation, this -7.36% regression might not be valid.

Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.

There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.

Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:

	if (llc_weight <= 16)
		nr_scan = nr_scan * 32 / llc_weight;

For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.

There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com

Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
2024-11-10 06:59:46 +08:00