Commit Graph

873722 Commits

Author SHA1 Message Date
Jianping Liu ac8052d038 dist,sepc: supprot kernel-debug in core and modules and devel rpm
When CONFIG="generic-release", %{rpm_name} is kernel, when
CONFIG="generic-debug", %{rpm_name} is kernel-debug.

Provides: kernel-debug-core in kernel-debug-core rpm
Provides: kernel-debug-modules in kernel-debug-modules rpm
Provides: kernel-debug-devel in kernel-debug-devel rpm

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-09-27 11:03:34 +08:00
Jianping Liu 3ffed28eb5 dist,Makefile: generic-debug config only build kernel rpm
We intend to archive kernle-debug rpm in yum. Release kernel will
build perf/tools/bpf-tools rpm, to avoid kernle-debug build the same
rpm, disable them.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-09-25 17:18:16 +08:00
Alex Shi 4c67e8518e compiler: fix instrumentation_begin redefine issue
[tapd]
https://tapd.woa.com/20422414/prong/stories/view/1020422414114939518

The contents (instrumentation_begin/instrumentation_end) already defined
in include/linux/instrumentation.h

Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-09-25 17:17:50 +08:00
Jianping Liu a8e4f391c2 dist: release 5.4.119-20.0009.34
Upstream: no

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-09-23 13:08:36 +08:00
Toke Høiland-Jørgensen b902278a4a bpf: Always return target ifindex in bpf_fib_lookup
commit d1c362e1dd upstream.

The bpf_fib_lookup() helper performs a neighbour lookup for the destination
IP and returns BPF_FIB_LKUP_NO_NEIGH if this fails, with the expectation
that the BPF program will pass the packet up the stack in this case.
However, with the addition of bpf_redirect_neigh() that can be used instead
to perform the neighbour lookup, at the cost of a bit of duplicated work.

For that we still need the target ifindex, and since bpf_fib_lookup()
already has that at the time it performs the neighbour lookup, there is
really no reason why it can't just return it in any case. So let's just
always return the ifindex if the FIB lookup itself succeeds.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Link: https://lore.kernel.org/bpf/20201009184234.134214-1-toke@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-09-23 13:03:46 +08:00
Jianping Liu b11f5368c2 dist: release 5.4.119-20.0009.33
Upstream: no

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-09-19 20:00:47 +08:00
Jianping Liu 23786a77d2 config,arm64: open some infiniband config
To support ZhongXing NIC RDMA, open CONFIGs as below:
CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ON_DEMAND_PAGING=y
2024-09-19 19:54:38 +08:00
Jason Xing 8690c0362b sssnic: add one dependency in Kconfig
Getting rid of prefix 'CONFIG_' can solve the issue. In the
previous patch, I missed one place.

Most importantly, I added dependency on CONFIG_PCI_ATS since
This drivers relys on CONFIG_PCI_ATS, so we need to adjust in
Kconfig files.

Fixes: ee280c4189a13 ("sssnic: support this new driver")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
2024-06-27 10:57:00 +08:00
刘诗 67295d85cf drm: support virtualbox display
insmod oc iso by virtualbox, need adjust accuracy.

Signed-off-by: aurelianliu <aurelianliu@tencent.com>
2024-06-13 15:03:27 +08:00
Jianping Liu e33c4e5ff8 dist: release 5.4.119-20.0009.32
Upstream: no

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 21:03:06 +08:00
Jianping Liu d0ebc6b8e3 config: sync config to origin 5.4.119-20.0009.32
Sync config.default_kasan and config.performance to the same with
5.4.119-20.0009.32. Note, config.default_kasan and config.performance
are useless now.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 21:01:05 +08:00
Fuhai Wang 47f0d31be8 netfilter: nf_tables: reject QUEUE/DROP verdict parameters
commit f342de4e2f33e0e39165d8639387aa6c19dff660 upstream.

This reverts commit e0abdadcc6.

core.c:nf_hook_slow assumes that the upper 16 bits of NF_DROP
verdicts contain a valid errno, i.e. -EPERM, -EHOSTUNREACH or similar,
or 0.

Due to the reverted commit, its possible to provide a positive
value, e.g. NF_ACCEPT (1), which results in use-after-free.

Its not clear to me why this commit was made.

NF_QUEUE is not used by nftables; "queue" rules in nftables
will result in use of "nft_queue" expression.

If we later need to allow specifiying errno values from userspace
(do not know why), this has to call NF_DROP_GETERR and check that
"err <= 0" holds true.

Fixes: e0abdadcc6 ("netfilter: nf_tables: accept QUEUE/DROP verdict parameters")
Cc: stable@vger.kernel.org
Reported-by: Notselwyn <notselwyn@pwning.tech>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Fuhai Wang <fuhaiwang@tencent.com>
Signed-off-by: caelli <caelli@tencent.com>
2024-06-11 20:52:46 +08:00
Vidya Sagar 0baa807ed7 PCI/MSI: Set device flag indicating only 32-bit MSI support
commit 2053230af1 upstream.

The MSI-X Capability requires devices to support 64-bit Message Addresses,
but the MSI Capability can support either 32- or 64-bit addresses.

Previously, we set dev->no_64bit_msi for a few broken devices that
advertise 64-bit MSI support but don't correctly support it.

In addition, check the MSI "64-bit Address Capable" bit for all devices and
set dev->no_64bit_msi for devices that don't advertise 64-bit support.
This allows msi_verify_entries() to catch arch code defects that assign
64-bit addresses when they're not supported.

The warning is helpful to find defects like the one fixed by
https://lore.kernel.org/r/20201117165312.25847-1-vidyas@nvidia.com

[bhelgaas: set no_64bit_msi in pci_msi_init(), commit log]
Link: https://lore.kernel.org/r/20201124105035.24573-1-vidyas@nvidia.com
Link: https://lore.kernel.org/r/20201203185110.1583077-4-helgaas@kernel.org
Signed-off-by: Vidya Sagar <vidyas@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 20:52:46 +08:00
Bjorn Helgaas fa40951618 PCI/MSI: Move MSI/MSI-X init to msi.c
commit cbc40d5c33 upstream.

Conflicts:
        drivers/pci/probe.c
	conflict with the comment's position

Move pci_msi_setup_pci_dev(), which disables MSI and MSI-X interrupts, from
probe.c to msi.c so it's with all the other MSI code and more consistent
with other capability initialization.  This means we must compile msi.c
always, even without CONFIG_PCI_MSI, so wrap the rest of msi.c in an #ifdef
and adjust the Makefile accordingly.  No functional change intended.

Link: https://lore.kernel.org/r/20201203185110.1583077-2-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 20:52:45 +08:00
Chen Siyu f6f1ca6db5 update can driver for phytium D2000 Soc
fix the problem for setting bitrate on phytium D2000 Soc

Signed-off-by: Chen Siyu <chensiyu1321@phytium.com.cn>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 20:52:45 +08:00
wangyong 70f373423d Merge CVE-2021-3760 bugfix patch, and the source of the patch
is as follows:
	1b1499a nfc: nci: fix the UAF of rf_conn_info object

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 20:52:45 +08:00
wjl00563 6d6ee45625 powerpc: export arch_trigger_cpumask_backtrace
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 20:52:44 +08:00
Duanqiang Wen 1b321e2778 config: add txgbe mod default option
in X86_64 or arm64 arch,
set CONFIG_TXGBE=m.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
2024-06-11 20:52:44 +08:00
Duanqiang Wen b72144b5bd net: wangxun: txgbe: support wangxun 10GbE driver
add support for Wangxun 10GbE driver, source files and
functions are the same as wangxun out of box drivevr
release version txgbe-1.3.5.1.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
2024-06-11 20:52:44 +08:00
Duanqiang Wen 50bf5d4ebd config: add ngbe mod default option
in X86_64 or arm64 arch,
set CONFIG_WANGXUN=y, and CONFIG_NGBE=m.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
2024-06-11 20:52:43 +08:00
Duanqiang Wen 467b4bd334 net: wangxun: ngbe: support wangxun 1GbE driver
add support for wangxun 1GbE driver, source files and functions
are the same as wangxun oob ngbe-1.2.5.3.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
2024-06-11 20:52:43 +08:00
Jason Xing 2ae9cf86d2 sssnic: support this new driver
All files are extracted from 3snic-eth-3s9xx-driver-sssnic-1.0.6.4-1-src.tar.gz.

Here is some key information:
1)please do not remove those #ifdef for compatability something like this
because it could hinder your steps.
2) replace four files as wrote in scripts/release.sh file:
cp -p -f $MK_DIR/replace/makefile_hw $HW_DIR/Makefile
cp -p -f $MK_DIR/replace/makefile_nic $NIC_DIR/Makefile
cp -p -f $MK_DIR/replace/sss_linux_kernel.h $CUR_DIR/include/kernel/sss_linux_kernel.h
cp -p -f $MK_DIR/replace/sss_hwdev_link.c $CUR_DIR/hw/sss_hwdev_link.c
The reason here is very simple: compatability.
3) Add vlan config dependency in Kconfig and do not add more specific configs into
some config files.
4) get rid of "Werror" in Makefile.
5) If someone is willing to update to a new version, please keep the makefile and config
untouched which I rewrote for the compatability.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-06-11 20:52:42 +08:00
Jianping Liu bf0d1166c8 dist: provide kernel version info in kernel*core*.rpm
Other software, such as anaconda, need kernel*core*.rpm provide kernel
version info.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-06-11 20:52:42 +08:00
Alex Shi 2c9d8a93d6 config/x86: add usb net driver support
Enable them as modules.

Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:52:42 +08:00
Alex Shi 217e598d36 Version: 5.4.119-20.0009
This is new version of 5.4.119-20.0009.

Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:52:41 +08:00
Amir Goldstein 8c16d4b7aa fsnotify: invalidate dcache before IN_DELETE event
commit a37d9a17f0 upstream.

Apparently, there are some applications that use IN_DELETE event as an
invalidation mechanism and expect that if they try to open a file with
the name reported with the delete event, that it should not contain the
content of the deleted file.

Commit 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.

This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.

To fix the regression, create a new hook fsnotify_delete() that takes
the unlinked inode as an argument and use a helper d_delete_notify() to
pin the inode, so we can pass it to fsnotify_delete() after d_delete().

Backporting hint: this regression is from v5.3. Although patch will
apply with only trivial conflicts to v5.4 and v5.10, it won't build,
because fsnotify_delete() implementation is different in each of those
versions (see fsnotify_link()).

A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
filesystem that do not need to call d_delete().

Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com
Reported-by: Ivan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:52:41 +08:00
Kees Cook 75e3d56004 exec: Force single empty string when argv is empty
commit dcd46d897a upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
[vegard: fixed conflicts due to missing
 886d7de631da71e30909980fdbf318f7caade262^- and
 3950e975431bc914f7e81b8f2a2dbdf2064acb0f^-]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: sixtywang <sixtywang@tencent.com>
2024-06-11 20:52:41 +08:00
Amir Goldstein b329e7371a inotify: show inotify mask flags in proc fdinfo
commit a32e697cda upstream.

The inotify mask flags IN_ONESHOT and IN_EXCL_UNLINK are not "internal
to kernel" and should be exposed in procfs fdinfo so CRIU can restore
them.

Fixes: 6933599697 ("inotify: hide internal kernel bits from fdinfo")
Link: https://lore.kernel.org/r/20220422120327.3459282-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:52:40 +08:00
Jiakun Shuai 6c49ee52b1 OC-Phytium config for OpenCloudOS 8
oc-phytium.config is presets for the kernel driver function support
of OpenCloudOS 8 on the Phytium desktop and embedded platforms.

Signed-off-by: Jiakun Shuai <shuaijiakun1288@phytium.com.cn>
2024-06-11 20:52:40 +08:00
Jiakun Shuai ffb71a516d Default disabled Phytium drivers on OpenCloudOS 8.8
This commit documents the Phytium kernel driver module that is
disabled by default on OpenCloudOS 8.8. Users can configure and
enable these modules as needed.

The reason why the module is disabled by default: It may cause conflicts
on platforms other than Phytium, or it is disabled by default due
to single and rare usage scenarios, in order to improve the efficiency
of kernel building.

Signed-off-by: Jiakun Shuai <shuaijiakun1288@phytium.com.cn>
2024-06-11 20:52:40 +08:00
Alex Shi c0ddf78f1f dist: release 5.4.119-20.0009.28
Upstream: no

Signed-off-by: Alex Shi <alexsshi@tencent.com>
2024-06-11 20:52:39 +08:00
Christoph Hellwig b868ff6759 virtio-blk: remove VIRTIO_BLK_F_SCSI support
[upstream commit: 782e067dba]

Since the need for a special flag to support SCSI passthrough on a
block device was added in May 2017 the SCSI passthrough support in
virtio-blk has been disabled.  It has always been a bad idea
(just ask the original author..) and we have virtio-scsi for proper
passthrough.  The feature also never made it into the virtio 1.0
or later specifications.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: XiaoLei Zhu <leonzzhu@tencent.com>
2024-06-11 20:52:37 +08:00
Peter Zijlstra 66c8ef4a22 cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
commit 32d4fd5751 upstream.

Commit c227233ad6 ("intel_idle: enable interrupts before C1 on
Xeons") wrecked intel_idle in two ways:

 - must not have tracing in idle functions
 - must return with IRQs disabled

Additionally, it added a branch for no good reason.

Fixes: c227233ad6 ("intel_idle: enable interrupts before C1 on Xeons")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ rjw: Moved the intel_idle() kerneldoc comment next to the function ]
Cc: 5.16+ <stable@vger.kernel.org> # 5.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:29 +08:00
Artem Bityutskiy 6f7b0b5f2b intel_idle: make SPR C1 and C1E be independent
commit 1548fac47a upstream.

This patch partially reverts the changes made by the following commit:

da0e58c038 intel_idle: add 'preferred_cstates' module argument

As that commit describes, on early Sapphire Rapids Xeon platforms the C1 and
C1E states were mutually exclusive, so that users could only have either C1 and
C6, or C1E and C6.

However, Intel firmware engineers managed to remove this limitation and make C1
and C1E to be completely independent, just like on previous Xeon platforms.

Therefore, this patch:
 * Removes commentary describing the old, and now non-existing SPR C1E
   limitation.
 * Marks SPR C1E as available by default.
 * Removes the 'preferred_cstates' parameter handling for SPR. Both C1 and
   C1E will be available regardless of 'preferred_cstates' value.

We expect that all SPR systems are shipping with new firmware, which includes
the C1/C1E improvement.

Cc: v5.18+ <stable@vger.kernel.org> # v5.18+
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:29 +08:00
Artem Bityutskiy 3d77bca89a intel_idle: Fix the 'preferred_cstates' module parameter
commit 39c184a6a9 upstream.

Problem description.

When user boots kernel up with the 'intel_idle.preferred_cstates=4' option,
we enable C1E and disable C1 states on Sapphire Rapids Xeon (SPR). In order
for C1E to work on SPR, we have to enable the C1E promotion bit on all
CPUs.  However, we enable it only on one CPU.

Fix description.

The 'intel_idle' driver already has the infrastructure for disabling C1E
promotion on every CPU. This patch uses the same infrastructure for
enabling C1E promotion on every CPU. It changes the boolean
'disable_promotion_to_c1e' variable to a tri-state 'c1e_promotion'
variable.

Tested on a 2-socket SPR system. I verified the following combinations:

 * C1E promotion enabled and disabled in BIOS.
 * Booted with and without the 'intel_idle.preferred_cstates=4' kernel
   argument.

In all 4 cases C1E promotion was correctly set on all CPUs.

Also tested on an old Broadwell system, just to make sure it does not cause
a regression. C1E promotion was correctly disabled on that system, both C1
and C1E were exposed (as expected).

Fixes: da0e58c038 ("intel_idle: add 'preferred_cstates' module argument")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
[ rjw: Minor changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:28 +08:00
Artem Bityutskiy 0f46fd086e intel_idle: Fix SPR C6 optimization
commit 7eac3bd38d upstream.

The Sapphire Rapids (SPR) C6 optimization was added to the end of the
'spr_idle_state_table_update()' function. However, the function has a
'return' which may happen before the optimization has a chance to run.
And this may prevent the optimization from happening.

This is an unlikely scenario, but possible if user boots with, say,
the 'intel_idle.preferred_cstates=6' kernel boot option.

This patch fixes the issue by eliminating the problematic 'return'
statement.

Fixes: 3a9cf77b60 ("intel_idle: add core C6 optimization for SPR")
Suggested-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
[ rjw: Minor changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:28 +08:00
Artem Bityutskiy b39482cff5 intel_idle: add core C6 optimization for SPR
commit 3a9cf77b60 upstream.

Add a Sapphire Rapids Xeon C6 optimization, similar to what we have for Sky Lake
Xeon: if package C6 is disabled, adjust C6 exit latency and target residency to
match core C6 values, instead of using the default package C6 values.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:28 +08:00
Artem Bityutskiy 53e20e738f intel_idle: add 'preferred_cstates' module argument
commit da0e58c038 upstream.

On Sapphire Rapids Xeon (SPR) the C1 and C1E states are basically mutually
exclusive - only one of them can be enabled. By default, 'intel_idle' driver
enables C1 and disables C1E. However, some users prefer to use C1E instead of
C1, because it saves more energy.

This patch adds a new module parameter ('preferred_cstates') for enabling C1E
and disabling C1. Here is the idea behind it.

1. This option has effect only for "mutually exclusive" C-states like C1 and
   C1E on SPR.
2. It does not have any effect on independent C-states, which do not require
   other C-states to be disabled (most states on most platforms as of today).
3. For mutually exclusive C-states, the 'intel_idle' driver always has a
   reasonable default, such as enabling C1 on SPR by default. On other
   platforms, the default may be different.
4. Users can override the default using the 'preferred_cstates' parameter.
5. The parameter accepts the preferred C-states bit-mask, similarly to the
   existing 'states_off' parameter.
6. This parameter is not limited to C1/C1E, and leaves room for supporting
   other mutually exclusive C-states, if they come in the future.

Today 'intel_idle' can only be compiled-in, which means that on SPR, in order
to disable C1 and enable C1E, users should boot with the following kernel
argument: intel_idle.preferred_cstates=4

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:27 +08:00
Tony Luck 223b89f1b6 x86/cpu: Add Sapphire Rapids CPU model number
commit be25d1b5ea upstream.

Latest edition (039) of "Intel Architecture Instruction Set Extensions
and Future Features Programming Reference" includes three new CPU model
numbers. Linux already has the two Ice Lake server ones. Add the new
model number for Sapphire Rapids.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200603173352.15506-1-tony.luck@intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:27 +08:00
Artem Bityutskiy f79fb80ce5 intel_idle: add SPR support
commit 9edf3c0ffe upstream.

Add Sapphire Rapids Xeon support.

Up until very recently, the C1 and C1E C-states were independent, but this
has changed in some new chips, including Sapphire Rapids Xeon (SPR). In these
chips the C1 and C1E states cannot be enabled at the same time. The "C1E
promotion" bit in 'MSR_IA32_POWER_CTL' also has its semantics changed a bit.

Here are the C1, C1E, and "C1E promotion" bit rules on Xeons before SPR.

1. If C1E promotion bit is disabled.
   a. C1  requests end up with C1  C-state.
   b. C1E requests end up with C1E C-state.
2. If C1E promotion bit is enabled.
   a. C1  requests end up with C1E C-state.
   b. C1E requests end up with C1E C-state.

Here are the C1, C1E, and "C1E promotion" bit rules on Sapphire Rapids Xeon.
1. If C1E promotion bit is disabled.
   a. C1  requests end up with C1 C-state.
   b. C1E requests end up with C1 C-state.
2. If C1E promotion bit is enabled.
   a. C1  requests end up with C1E C-state.
   b. C1E requests end up with C1E C-state.

Before SPR Xeon, the 'intel_idle' driver was disabling C1E promotion and was
exposing C1 and C1E as independent C-states. But on SPR, C1 and C1E cannot be
enabled at the same time.

This patch adds both C1 and C1E states. However, C1E is marked as with the
"CPUIDLE_FLAG_UNUSABLE" flag, which means that in won't be registered by
default. The C1E promotion bit will be cleared, which means that by default
only C1 and C6 will be registered on SPR.

The next patch will add an option for enabling C1E and disabling C1 on SPR.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:27 +08:00
Artem Bityutskiy 686588a7bb intel_idle: enable interrupts before C1 on Xeons
commit c227233ad6 upstream.

Enable local interrupts before requesting C1 on the last two generations
of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
This decreases average C1 interrupt latency by about 5-10%, as measured
with the 'wult' tool.

The '->enter()' function of the driver enters C-states with local
interrupts disabled by executing the 'monitor' and 'mwait' pair of
instructions. If an interrupt happens, the CPU exits the C-state and
continues executing instructions after 'mwait'. It does not jump to
the interrupt handler, because local interrupts are disabled. The
cpuidle subsystem enables interrupts a bit later, after doing some
housekeeping.

With this patch, we enable local interrupts before requesting C1. In
this case, if the CPU wakes up because of an interrupt, it will jump
to the interrupt handler right away. The cpuidle housekeeping will be
done after the pending interrupt(s) are handled.

Enabling interrupts before entering a C-state has measurable impact
for faster C-states, like C1. Deeper, but slower C-states like C6 do
not really benefit from this sort of change, because their latency is
a lot higher comparing to the delay added by cpuidle housekeeping.

This change was also tested with cyclictest and dbench. In case of Ice
Lake, the average cyclictest latency decreased by 5.1%, and the average
'dbench' throughput increased by about 0.8%. Both tests were run for 4
hours with only C1 enabled (all other idle states, including 'POLL',
were disabled). CPU frequency was pinned to HFM, and uncore frequency
was pinned to the maximum value. The other platforms had similar
single-digit percentage improvements.

It is worth noting that this patch affects 'cpuidle' statistics a tiny
bit.  Before this patch, C1 residency did not include the interrupt
handling time, but with this patch, it will include it. This is similar
to what happens in case of the 'POLL' state, which also runs with
interrupts enabled.

Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:26 +08:00
Artem Bityutskiy dcaa41dcb4 intel_idle: add Iclelake-D support
commit 22141d5f41 upstream.

This patch adds Icelake Xeon D support to the intel_idle driver.

Since Icelake D and Icelake SP C-state characteristics the same,
we use Icelake SP C-states table for Icelake D as well.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:26 +08:00
Tom Rix ce6df7b720 intel_idle: remove definition of DEBUG
commit 651bc5816c upstream.

Defining DEBUG should only be done in development.
So remove DEBUG.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:26 +08:00
Peter Zijlstra 91d50bd49f intel_idle: Build fix
commit 4d916140bf upstream.

Because CONFIG_ soup.

Fixes: 6e1d2bc675 ("intel_idle: Fix intel_idle() vs tracing")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201130115402.GO3040@hirez.programming.kicks-ass.net
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:25 +08:00
Peter Zijlstra 69a0944244 intel_idle: Fix intel_idle() vs tracing
commit 6e1d2bc675 upstream.

cpuidle->enter() callbacks should not call into tracing because RCU
has already been disabled. Instead of doing the broadcast thing
itself, simply advertise to the cpuidle core that those states stop
the timer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20201123143510.GR3021@hirez.programming.kicks-ass.net
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:25 +08:00
Chen Yu 24b0b982e5 intel_idle: Fix max_cstate for processor models without C-state tables
commit 4e0ba5577d upstream.

Currently intel_idle driver gets the c-state information from ACPI
_CST if the processor model is not recognized by it. However the
c-state in _CST starts with index 1 which is different from the
index in intel_idle driver's internal c-state table.

While intel_idle_max_cstate_reached() was previously introduced to
deal with intel_idle driver's internal c-state table, re-using
this function directly on _CST is incorrect.

Fix this by subtracting 1 from the index when checking max_cstate
in the _CST case.

For example, append intel_idle.max_cstate=1 in boot command line,
Before the patch:
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
POLL
After the patch:
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI

Fixes: 18734958e9 ("intel_idle: Use ACPI _CST for processor models without C-state tables")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Cc: 5.6+ <stable@vger.kernel.org> # 5.6+
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:25 +08:00
Mel Gorman dc516a70f4 intel_idle: Ignore _CST if control cannot be taken from the platform
commit 75af76d0a3 upstream.

e6d4f08a67 ("intel_idle: Use ACPI _CST on server systems") avoids
enabling c-states that have been disabled by the platform with the
exception of C1E.

Unfortunately, BIOS implementations are not always consistent in terms
of how capabilities are advertised and control cannot always be handed
over. If control cannot be handed over then intel_idle reports that "ACPI
_CST not found or not usable" but does not clear acpi_state_table.count
meaning the information is still partially used.

This patch ignores ACPI information if CST control cannot be requested from
the platform. This was only observed on a number of Haswell platforms that
had identical CPUs but not identical BIOS versions.  While this problem
may be rare overall, 24 separate test cases bisected to this specific
commit across 4 separate test machines and is worth addressing. If the
situation occurs, the kernel behaves as it did before commit e6d4f08a67
and uses any c-states that are discovered.

The affected test cases were all ones that involved a small number of
processes -- exec microbenchmark, pipe microbenchmark, git test suite,
netperf, tbench with one client and system call microbenchmark. Each
case benefits from being able to use turboboost which is prevented if the
lower c-states are unavailable. This may mask real regressions specific
to older hardware so it is worth addressing.

C-state status before and after the patch

5.9.0-vanilla            POLL     latency:0      disabled:0 default:enabled
5.9.0-vanilla            C1       latency:2      disabled:0 default:enabled
5.9.0-vanilla            C1E      latency:10     disabled:0 default:enabled
5.9.0-vanilla            C3       latency:33     disabled:1 default:disabled
5.9.0-vanilla            C6       latency:133    disabled:1 default:disabled
5.9.0-ignore-cst-v1r1    POLL     latency:0      disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C1       latency:2      disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C1E      latency:10     disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C3       latency:33     disabled:0 default:enabled
5.9.0-ignore-cst-v1r1    C6       latency:133    disabled:0 default:enabled

Patch enables C3/C6.

Netperf UDP_STREAM

netperf-udp
                                      5.5.0                  5.9.0
                                    vanilla        ignore-cst-v1r1
Hmean     send-64         193.41 (   0.00%)      226.54 *  17.13%*
Hmean     send-128        392.16 (   0.00%)      450.54 *  14.89%*
Hmean     send-256        769.94 (   0.00%)      881.85 *  14.53%*
Hmean     send-1024      2994.21 (   0.00%)     3468.95 *  15.85%*
Hmean     send-2048      5725.60 (   0.00%)     6628.99 *  15.78%*
Hmean     send-3312      8468.36 (   0.00%)    10288.02 *  21.49%*
Hmean     send-4096     10135.46 (   0.00%)    12387.57 *  22.22%*
Hmean     send-8192     17142.07 (   0.00%)    19748.11 *  15.20%*
Hmean     send-16384    28539.71 (   0.00%)    30084.45 *   5.41%*

Fixes: e6d4f08a67 ("intel_idle: Use ACPI _CST on server systems")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: 5.6+ <stable@vger.kernel.org> # 5.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:24 +08:00
Chen Zhuo 5cdab9b63d cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic
commit bf9282dc26 upstream.

This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:24 +08:00
Rafael J. Wysocki fd63805ee8 intel_idle: Add __initdata annotations to init time variables
commit 7f843dd712 upstream.

Annotate static variables cpuidle_state_table and mwait_substates
with __initdata, because they are only used during the initialization
of the driver.

Also notice that static variable icpu could be annotated analogously
and the structure pointed to by it could be __initconst, but two of
its fields are accessed via icpu in intel_idle_cpu_init() and
auto_demotion_disable(), so introduce two new static variables,
auto_demotion_disable_flags and disable_promotion_to_c1e, to hold
the values of these fields, set them during the initialization and
use them in those functions instead of accessing the source data
structure via icpu.

That allows icpu to be annotated with __initdata, so do that,
and it will also allow some __initconst annotations to be added
subsequently.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:24 +08:00
Rafael J. Wysocki 8362f4dbb9 intel_idle: Relocate definitions of cpuidle callbacks
commit 30a996fbb3 upstream.

Move the definitions of intel_idle() and intel_idle_s2idle() before
the definitions of cpuidle_state structures referring to them to
avoid having to use additional declarations of them (and drop those
declarations).

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
2024-06-11 20:51:23 +08:00