OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jianping Liu	139963f9a9	Merge linux 6.6.26 Conflicts: arch/x86/include/asm/cpufeature.h arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/disabled-features.h arch/x86/include/asm/required-features.h arch/x86/kernel/mpparse.c scripts/mod/modpost.c	2024-04-13 17:34:05 +08:00
Jianping Liu	d8ee0e238a	Merge linux 6.6.24 Conflicts: arch/x86/kernel/mpparse.c arch/x86/kvm/reverse_cpuid.h	2024-04-13 14:08:38 +08:00
Jianping Liu	6453a3f30a	Merge linux 6.6.23 Conflicts: drivers/crypto/intel/qat/qat_4xxx/adf_4xxx_hw_data.c Keep drivers/crypto/intel/qat/ code to the same with intel commits in ock.	2024-04-13 13:16:30 +08:00
Jianping Liu	35fb263f3a	Merge linux 6.6.22	2024-04-13 12:44:02 +08:00
Jianping Liu	d34457a4ba	Merge linux 6.6.21	2024-04-13 12:42:06 +08:00
Jianping Liu	e707420134	Support zhaoxin, Merge pull request !117 from LeoLiu-oc/next-6.6-85-address_space	2024-04-12 20:46:41 +08:00
leoliu-oc	c1b0dbcc39	KVM: x86: Introduce support for Zhaoxin ZXPAUSE instruction This patch introduces support for the ZXPAUSE instruction, a new addition akin to Intel's TPAUSE. Two primary distinctions set apart ZXPAUSE from TPAUSE: 1. ZXPAUSE utilizes a delta tsc, determined from the lesser value between (MSR_ZX_PAUSE_CONTROL[31:2] << 2) and the EDX:EAX input to the ZXPAUSE instruction, subtracted from the current tsc value. In contrast, TPAUSE employs a target tsc, computed from the lesser value between (MSR_IA32_UMWAIT_CONTROL[31:2] << 2) and the EDX:EAX input to the TPAUSE instruction. 2. As of now, ZXPAUSE exclusively supports the C0.1 optimization state, whereas TPAUSE potentially extends support to both C0.1 and C0.2. Successful integration of this patch hinges on QEMU's backing for ZXPAUSE, a contribution we're currently forwarding to QEMU. It also requires the preceding patch in this patchset, which offers Linux kernel support for ZXPAUSE. The choice of the name "vmx->msr_ia32_umwait_control" is deliberate. In patches for other Linux versions (e.g., 5.5), a "vmx->msr_ia32_umwait_control" already exists. By sharing this variable name with Intel, it ensures compatibility. The difference is merely software-based and poses no real-world conflicts. Currently, if the Guest writes to the ZXPAUSE/TPAUSE CONTROL MSR, we simply bypass the WRMSR instruction. If the Guest attempts to use ZXPAUSE/TPAUSE to transition the vCPU into an optimized state, it will succeed, with the duration of the optimized state being the value passed in EDX:EAX. Of course, this state can be interrupted by external interrupts and other events specified in the specification. Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>	2024-04-12 20:42:02 +08:00
leoliu-oc	da20c0c659	x86/delay: add support for Zhaoxin ZXPAUSE instruction ZXPAUSE instructs the processor to enter an implementation-dependent optimized state. The instruction execution wakes up when the time-stamp counter reaches or exceeds the implicit EDX:EAX 64-bit input value. The instruction execution also wakes up due to the expiration of the operating system time-limit or by an external interrupt. ZXPAUSE is available on processors with X86_FEATURE_ZXPAUSE. ZXPAUSE allows the processor to enter a light-weight power/performance optimized state (C0.1 state) for a period specified by the instruction or until the system time limit. MSR_ZX_PAUSE_CONTROL MSR register allows the OS to enable/disable C0.2 on the processor and to set the maximum time the processor can reside in C0.1 or C0.2. By default C0.2 is disabled. A sysfs interface to adjust the time and the C0.2 enablement is provided in a follow up change. Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>	2024-04-12 20:42:02 +08:00
Davide Caratti	43eca11b7c	mptcp: don't account accept() of non-MPC client as fallback to TCP commit 7a1b3490f47e88ec4cbde65f1a77a0f4bc972282 upstream. Current MPTCP servers increment MPTcpExtMPCapableFallbackACK when they accept non-MPC connections. As reported by Christoph, this is "surprising" because the counter might become greater than MPTcpExtMPCapableSYNRX. MPTcpExtMPCapableFallbackACK counter's name suggests it should only be incremented when a connection was seen using MPTCP options, then a fallback to TCP has been done. Let's do that by incrementing it when the subflow context of an inbound MPC connection attempt is dropped. Also, update mptcp_connect.sh kselftest, to ensure that the above MIB does not increment in case a pure TCP client connects to a MPTCP server. Fixes: `fc518953bc` ("mptcp: add and use MIB counter infrastructure") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/449 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240329-upstream-net-20240329-fallback-mib-v1-1-324a8981da48@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-10 16:36:06 +02:00
Matthieu Baerts (NGI0)	5b5ff82491	selftests: mptcp: connect: fix shellcheck warnings commit e3aae1098f109f0bd33c971deff1926f4e4441d0 upstream. shellcheck recently helped to prevent issues. It is then good to fix the other harmless issues in order to spot "real" ones later. Here, two categories of warnings are now ignored: - SC2317: Command appears to be unreachable. The cleanup() function is invoked indirectly via the EXIT trap. - SC2086: Double quote to prevent globbing and word splitting. This is recommended, but the current usage is correct and there is no need to do all these modifications to be compliant with this rule. For the modifications: - SC2034: ksft_skip appears unused. - SC2181: Check exit code directly with e.g. 'if mycmd;', not indirectly with $?. - SC2004: $/${} is unnecessary on arithmetic variables. - SC2155: Declare and assign separately to avoid masking return values. - SC2166: Prefer [ p ] && [ q ] as [ p -a q ] is not well defined. - SC2059: Don't use variables in the printf format string. Use printf '..%s..' "$foo". Now this script is shellcheck (0.9.0) compliant. We can easily spot new issues. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240306-upstream-net-next-20240304-selftests-mptcp-shared-code-shellcheck-v2-8-bc79e6e5e6a0@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-10 16:36:06 +02:00
Edward Liaw	782baf52e7	selftests/mm: include strings.h for ffsl commit 176517c9310281d00dd3210ab4cc4d3cdc26b17e upstream. Got a compilation error on Android for ffsl after 91b80cc5b39f ("selftests: mm: fix map_hugetlb failure on 64K page size systems") included vm_util.h. Link: https://lkml.kernel.org/r/20240329185814.16304-1-edliaw@google.com Fixes: `af605d26a8` ("selftests/mm: merge util.h into vm_util.h") Signed-off-by: Edward Liaw <edliaw@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-10 16:36:04 +02:00
Jakub Kicinski	690e877ca2	selftests: reuseaddr_conflict: add missing new line at the end of the output commit 31974122cfdeaf56abc18d8ab740d580d9833e90 upstream. The netdev CI runs in a VM and captures serial, so stdout and stderr get combined. Because there's a missing new line in stderr the test ends up corrupting KTAP: # Successok 1 selftests: net: reuseaddr_conflict which should have been: # Success ok 1 selftests: net: reuseaddr_conflict Fixes: `422d8dc6fd` ("selftest: add a reuseaddr test") Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240329160559.249476-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-10 16:35:53 +02:00
Antoine Tenart	39864092cf	selftests: net: gro fwd: update vxlan GRO test expectations commit 0fb101be97ca27850c5ecdbd1269423ce4d1f607 upstream. UDP tunnel packets can't be GRO in-between their endpoints as this causes different issues. The UDP GRO fwd vxlan tests were relying on this and their expectations have to be fixed. We keep both vxlan tests and expected no GRO from happening. The vxlan UDP GRO bench test was removed as it's not providing any valuable information now. Fixes: `a062260a9d` ("selftests: net: add UDP GRO forwarding self-tests") Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-10 16:35:52 +02:00
Geliang Tang	117eed2997	selftests: mptcp: join: fix dev in check_endpoint commit 40061817d95bce6dd5634a61a65cd5922e6ccc92 upstream. There's a bug in pm_nl_check_endpoint(), 'dev' didn't be parsed correctly. If calling it in the 2nd test of endpoint_tests() too, it fails with an error like this: creation [FAIL] expected '10.0.2.2 id 2 subflow dev dev' \ found '10.0.2.2 id 2 subflow dev ns2eth2' The reason is '$2' should be set to 'dev', not '$1'. This patch fixes it. Fixes: `69c6ce7b6e` ("selftests: mptcp: add implicit endpoint test case") Cc: stable@vger.kernel.org Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240329-upstream-net-20240329-fallback-mib-v1-2-324a8981da48@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-10 16:35:49 +02:00
Borislav Petkov (AMD)	d2be2f872f	x86/CPU/AMD: Add X86_FEATURE_ZEN1 [ Upstream commit 232afb557835d6f6859c73bf610bad308c96b131 ] Add a synthetic feature flag specifically for first generation Zen machines. There's need to have a generic flag for all Zen generations so make X86_FEATURE_ZEN be that flag. Fixes: 30fa92832f40 ("x86/CPU/AMD: Add ZenX generations flags") Suggested-by: Brian Gerst <brgerst@gmail.com> Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/dc3835e3-0731-4230-bbb9-336bbe3d042b@amd.com Stable-dep-of: c7b2edd8377b ("perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-10 16:35:47 +02:00
Ido Schimmel	7fb8b3de7f	selftests: vxlan_mdb: Fix failures with old libnet [ Upstream commit f1425529c33def8b46faae4400dd9e2bbaf16a05 ] Locally generated IP multicast packets (such as the ones used in the test) do not perform routing and simply egress the bound device. However, as explained in commit `8bcfb4ae4d` ("selftests: forwarding: Fix failing tests with old libnet"), old versions of libnet (used by mausezahn) do not use the "SO_BINDTODEVICE" socket option. Specifically, the library started using the option for IPv6 sockets in version 1.1.6 and for IPv4 sockets in version 1.2. This explains why on Ubuntu - which uses version 1.1.6 - the IPv4 overlay tests are failing whereas the IPv6 ones are passing. Fix by specifying the source and destination MAC of the packets which will cause mausezahn to use a packet socket instead of an IP socket. Fixes: `62199e3f16` ("selftests: net: Add VXLAN MDB test") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Closes: https://lore.kernel.org/netdev/5bb50349-196d-4892-8ed2-f37543aa863f@alu.unizg.hr/ Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20240325075030.2379513-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-10 16:35:42 +02:00
Jakub Kicinski	5c05bdd95f	tools: ynl: fix setting presence bits in simple nests [ Upstream commit f6c8f5e8694c7a78c94e408b628afa6255cc428a ] When we set members of simple nested structures in requests we need to set "presence" bits for all the nesting layers below. This has nothing to do with the presence type of the last layer. Fixes: `be5bea1cc0` ("net: add basic C code generators for Netlink") Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240321020214.1250202-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-10 16:35:41 +02:00
Natanael Copa	3a9569441b	tools/resolve_btfids: fix build with musl libc commit 62248b22d01e96a4d669cde0d7005bd51ebf9e76 upstream. Include the header that defines u32. This fixes build of 6.6.23 and 6.1.83 kernels for Alpine Linux, which uses musl libc. I assume that GNU libc indirecly pulls in linux/types.h. Fixes: 9707ac4fe2f5 ("tools/resolve_btfids: Refactor set sorting with types from btf_ids.h") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218647 Cc: stable@vger.kernel.org Signed-off-by: Natanael Copa <ncopa@alpinelinux.org> Tested-by: Greg Thelen <gthelen@google.com> Link: https://lore.kernel.org/r/20240328110103.28734-1-ncopa@alpinelinux.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-03 15:29:03 +02:00
Edward Liaw	07cf57eba5	selftests/mm: fix ARM related issue with fork after pthread_create commit 8c864371b2a15a23ce35aa7e2bd241baaad6fbe8 upstream. Following issue was observed while running the uffd-unit-tests selftest on ARM devices. On x86_64 no issues were detected: pthread_create followed by fork caused deadlock in certain cases wherein fork required some work to be completed by the created thread. Used synchronization to ensure that created thread's start function has started before invoking fork. [edliaw@google.com: refactored to use atomic_bool] Link: https://lkml.kernel.org/r/20240325194100.775052-1-edliaw@google.com Fixes: `760aee0b71` ("selftests/mm: add tests for RO pinning vs fork()") Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Signed-off-by: Edward Liaw <edliaw@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-03 15:28:55 +02:00
Edward Liaw	fe295de2d5	selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM commit 105840ebd76d8dbc1a7d734748ae320076f3201e upstream. The sigbus-wp test requires the UFFD_FEATURE_WP_HUGETLBFS_SHMEM flag for shmem and hugetlb targets. Otherwise it is not backwards compatible with kernels <5.19 and fails with EINVAL. Link: https://lkml.kernel.org/r/20240321232023.2064975-1-edliaw@google.com Fixes: `73c1ea939b` ("selftests/mm: move uffd sig/events tests into uffd unit tests") Signed-off-by: Edward Liaw <edliaw@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Peter Xu <peterx@redhat.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-03 15:28:55 +02:00
Kan Liang	11b4dc6494	perf top: Use evsel's cpus to replace user_requested_cpus commit 5fa695e7da4975e8d21ce49f3718d6cf00ecb75e upstream. perf top errors out on a hybrid machine $perf top Error: The cycles:P event is not supported. The perf top expects that the "cycles" is collected on all CPUs in the system. But for hybrid there is no single "cycles" event which can cover all CPUs. Perf has to split it into two cycles events, e.g., cpu_core/cycles/ and cpu_atom/cycles/. Each event has its own CPU mask. If a event is opened on the unsupported CPU. The open fails. That's the reason of the above error out. Perf should only open the cycles event on the corresponding CPU. The commit `ef91871c96` ("perf evlist: Propagate user CPU maps intersecting core PMU maps") intersect the requested CPU map with the CPU map of the PMU. Use the evsel's cpus to replace user_requested_cpus. The evlist's threads are also propagated to the evsel's threads in __perf_evlist__propagate_maps(). For a system-wide event, perf appends a dummy event and assign it to the evsel's threads. For a per-thread event, the evlist's thread_map is assigned to the evsel's threads. The same as the other tools, e.g., perf record, using the evsel's threads when opening an event. Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Hector Martin <marcan@marcan.st> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Closes: https://lore.kernel.org/linux-perf-users/ZXNnDrGKXbEELMXV@kernel.org/ Link: https://lore.kernel.org/r/20231214144612.1092028-1-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-03 15:28:53 +02:00
Vitaly Chikunov	78142322a1	selftests/mm: Fix build with _FORTIFY_SOURCE [ Upstream commit 8b65ef5ad4862904e476a8f3d4e4418c950ddb90 ] Add missing flags argument to open(2) call with O_CREAT. Some tests fail to compile if _FORTIFY_SOURCE is defined (to any valid value) (together with -O), resulting in similar error messages such as: In file included from /usr/include/fcntl.h:342, from gup_test.c:1: In function 'open', inlined from 'main' at gup_test.c:206:10: /usr/include/bits/fcntl2.h:50:11: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments 50 \| __open_missing_mode (); \| ^~~~~~~~~~~~~~~~~~~~~~ _FORTIFY_SOURCE is enabled by default in some distributions, so the tests are not built by default and are skipped. open(2) man-page warns about missing flags argument: "if it is not supplied, some arbitrary bytes from the stack will be applied as the file mode." Link: https://lkml.kernel.org/r/20240318023445.3192922-1-vt@altlinux.org Fixes: `aeb85ed4f4` ("tools/testing/selftests/vm/gup_benchmark.c: allow user specified file") Fixes: `fbe37501b2` ("mm: huge_memory: debugfs for file-backed THP split") Fixes: `c942f5bd17` ("selftests: soft-dirty: add test for mprotect") Signed-off-by: Vitaly Chikunov <vt@altlinux.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-03 15:28:53 +02:00
Muhammad Usama Anjum	ccf2d9d2ae	selftests/mm: gup_test: conform test to TAP format output [ Upstream commit cb6e7cae18868422a23d62670110c61fd1b15029 ] Conform the layout, informational and status messages to TAP. No functional change is intended other than the layout of output messages. Link: https://lkml.kernel.org/r/20240102053807.2114200-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 8b65ef5ad486 ("selftests/mm: Fix build with _FORTIFY_SOURCE") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-03 15:28:52 +02:00
Geliang Tang	bae5b98dcf	selftests: mptcp: diag: return KSFT_FAIL not test_cnt commit 45bcc0346561daa3f59e19a753cc7f3e08e8dff1 upstream. The test counter 'test_cnt' should not be returned in diag.sh, e.g. what if only the 4th test fail? Will do 'exit 4' which is 'exit ${KSFT_SKIP}', the whole test will be marked as skipped instead of 'failed'! So we should do ret=${KSFT_FAIL} instead. Fixes: `df62f2ec3d` ("selftests/mptcp: add diag interface tests") Cc: stable@vger.kernel.org Fixes: `42fb6cddec` ("selftests: mptcp: more stable diag tests") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-03 15:28:40 +02:00
Jason A. Donenfeld	49391e9f1e	wireguard: selftests: set RISCV_ISA_FALLBACK on riscv{32,64} [ Upstream commit e995f5dd9a9cef818af32ec60fc38d68614afd12 ] This option is needed to continue booting with QEMU. Recent changes that made this optional meant that it gets unset in the test harness, and so WireGuard CI has been broken. Fix this by simply setting this option. Cc: stable@vger.kernel.org Fixes: `496ea826d1` ("RISC-V: provide Kconfig & commandline options to control parsing "riscv,isa"") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-03 15:28:37 +02:00
SeongJae Park	9a06d17abc	selftests/mqueue: Set timeout to 180 seconds [ Upstream commit 85506aca2eb4ea41223c91c5fe25125953c19b13 ] While mq_perf_tests runs with the default kselftest timeout limit, which is 45 seconds, the test takes about 60 seconds to complete on i3.metal AWS instances. Hence, the test always times out. Increase the timeout to 180 seconds. Fixes: `852c8cbf34` ("selftests/kselftest/runner.sh: Add 45 second timeout per test") Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-04-03 15:28:20 +02:00
Dan Schatzberg	6bacfda06c	rue/scx/sched_ext: Add scx_rusty, a rust userspace hybrid scheduler Upstream: no Rusty is a multi-domain BPF / userspace hybrid scheduler where the BPF part does simple round robin in each domain and the userspace part calculates the load factor of each domain and tells the BPF part how to load balance the domains. This scheduler demonstrates dividing scheduling logic between BPF and userspace and using rust to build the userspace part. An earlier variant of this scheduler was used to balance across six domains, each representing a chiplet in a six-chiplet AMD processor, and could match the performance of production setup using CFS. See the --help message for more details. v5: * Renamed to scx_rusty and improve build scripts. * Load metrics are now tracked in BPF using the running average implementation in tools/sched_ext/ravg[_impl].bpf.h and ravg.read.rs.h. Before, the userspace part was iterating all tasks to calculate load metrics and make LB decisions. Now, high level LB decisions are made by simply reading per-domain load averages and Picking migrating target tasks only accesses the load metrics for a fixed number of recently active tasks in the pushing domains. This greatly reduces CPU overhead and makes rusty a lot more scalable. v4: * tools/sched_ext/atropos renamed to tools/sched_ext/scx_atropos for consistency. * LoadBalancer sometimes couldn't converge on balanced state due to restrictions it put on each balancing operation. Fixed. * Topology information refactored into struct Topology and Tuner is added. Tuner runs in shorter cycles (100ms) than LoadBalancer and dynamically adjusts scheduling behaviors, currently, based on the per-domain utilization states. * ->select_cpu() has been revamped. Combined with other improvements, this allows atropos to outperform CFS in various sub-saturation scenarios when tested with fio over dm-crypt. * Many minor code cleanups and improvements. v3: * The userspace code is substantially restructured and rewritten. The binary is renamed to scx_atropos and can now figure out the domain topology automatically based on L3 cache configuration. The LB logic which was rather broken in the previous postings are revamped and should behave better. * Updated to support weighted vtime scheduling (can be turned off with --fifo-sched). Added a couple options (--slice_us, --kthreads-local) to modify scheduling behaviors. * Converted to use BPF inline iterators. v2: * Updated to use generic BPF cpumask helpers. Signed-off-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com>	2024-03-29 11:22:52 +08:00
David Vernet	10ff85e486	rue/scx/sched_ext: Add a basic, userland vruntime scheduler Upstream: no This patch adds a new scx_userland BPF scheduler that implements a fairly unsophisticated sorted-list vruntime scheduler in userland to demonstrate how most scheduling decisions can be delegated to userland. The scheduler doesn't implement load balancing, and treats all tasks as part of a single domain. v3: * Use SCX_BUG[_ON]() to simplify error handling. v2: * Converted to BPF inline iterators. Signed-off-by: David Vernet <dvernet@meta.com> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-03-29 11:22:37 +08:00
Tejun Heo	80c6f83f39	rue/scx/sched_ext: Documentation: scheduler: Document extensible scheduler class Upstream: no Add Documentation/scheduler/sched-ext.rst which gives a high-level overview and pointers to the examples. v4: * README improved, reformatted in markdown and renamed to README.md. v3: * Added tools/sched_ext/README. * Dropped _example prefix from scheduler names. v2: * Apply minor edits suggested by Bagas. Caveats section dropped as all of them are addressed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com>	2024-03-29 11:22:37 +08:00
Tejun Heo	b9bc414794	rue/scx/sched_ext: Add vtime-ordered priority queue to dispatch_q's Upstream: no Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. When executing or consuming the dsq, the FIFO is always processed first and the priority queue is processed iff the FIFO is empty. The design decision was made to allow both FIFO and priority queue to be available at the same timeq for all dsq's for three reasons. First, the new priority queue is useful for the local dsq's too but they also need the FIFO when consuming tasks from other dsq's as the vtimes may not be comparable across them. Second, the interface surface is smaller this way - the only additional interface necessary is scx_bpf_dispsatch_vtime(). Third, the overhead isn't meaningfully different whether they're available at the same time or not. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v3: * SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. * Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: * p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com>	2024-03-29 11:22:36 +08:00
Tejun Heo	fbd153f2cb	rue/scx/sched_ext: Implement core-sched support Upstream: no The core-sched support is composed of the following parts: * task_struct->scx.core_sched_at is added. This is a timestamp which can be used to order tasks. Depending on whether the BPF scheduler implements custom ordering, it tracks either global FIFO ordering of all tasks or local-DSQ ordering within the dispatched tasks on a CPU. * prio_less() is updated to call scx_prio_less() when comparing SCX tasks. scx_prio_less() calls ops.core_sched_before() if available or uses the core_sched_at timestamp. For global FIFO ordering, the BPF scheduler doesn't need to do anything. Otherwise, it should implement ops.core_sched_before() which reflects the ordering. * When core-sched is enabled, balance_scx() balances all SMT siblings so that they all have tasks dispatched if necessary before pick_task_scx() is called. pick_task_scx() picks between the current task and the first dispatched task on the local DSQ based on availability and the core_sched_at timestamps. Note that FIFO ordering is expected among the already dispatched tasks whether running or on the local DSQ, so this path always compares core_sched_at instead of calling into ops.core_sched_before(). qmap_core_sched_before() is added to scx_qmap. It scales the distances from the heads of the queues to compare the tasks across different priority queues and seems to behave as expected. v3: * Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi. v2: * Sched core added the const qualifiers to prio_less task arguments. Explicitly drop them for ops.core_sched_before() task arguments. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Josh Don <joshdon@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>	2024-03-29 11:22:36 +08:00
David Vernet	fe5374b887	rue/scx/sched_ext: Implement sched_ext_ops.cpu_acquire/release() Upstream: no Scheduler classes are strictly ordered and when a higher priority class has tasks to run, the lower priority ones lose access to the CPU. Being able to monitor and act on these events are necessary for use cases includling strict core-scheduling and latency management. This patch adds two operations ops.cpu_acquire() and .cpu_release(). The former is invoked when a CPU becomes available to the BPF scheduler and the opposite for the latter. This patch also implements scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger requeueing of all tasks in the local dsq of the CPU so that the tasks can be reassigned to other available CPUs. scx_pair is updated to use .cpu_acquire/release() along with %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU is preempted by a higher priority scheduler class. scx_qmap is updated to use .cpu_acquire/release() to empty the local dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers that want to have a tight control over latency. v3: * Drop the const qualifier from scx_cpu_release_args.task. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. v2: * Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local() from ops.cpu_release() nested inside ops.init() and other sleepable operations. Signed-off-by: David Vernet <dvernet@meta.com> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:35 +08:00
Tejun Heo	c775f74a4a	rue/scx/sched_ext: Add a cgroup scheduler which uses flattened hierarchy Upstream: no This patch adds scx_flatcg example scheduler which implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer by compounding the active weight share at each level. This flattening of hierarchy can bring a substantial performance gain when the cgroup hierarchy is nested multiple levels. in a simple benchmark using wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two apache instances competing with 2:1 weight ratio nested four level deep. However, the gain comes at the cost of not being able to properly handle thundering herd of cgroups. For example, if many cgroups which are nested behind a low priority parent cgroup wake up around the same time, they may be able to consume more CPU cycles than they are entitled to. In many use cases, this isn't a real concern especially given the performance gain. Also, there are ways to mitigate the problem further by e.g. introducing an extra scheduling layer on cgroup delegation boundaries. v2: * Use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:29 +08:00
Tejun Heo	eeaadf5a53	rue/scx/sched_ext: Add a cgroup-based core-scheduling scheduler Upstream: no This patch adds scx_pair example scheduler which implements a variant of core scheduling where a hyperthread pair only run tasks from the same cgroup. The BPF scheduler achieves this by putting tasks into per-cgroup queues, time-slicing the cgroup to run for each pair first, and then scheduling within the cgroup. See the header comment in scx_pair.bpf.c for more details. Note that scx_pair's cgroup-boundary guarantee breaks down for tasks running in higher priority scheduler classes. This will be addressed by a followup patch which implements a mechanism to track CPU preemption. v4: * Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]() to simplify error handling. v2: * Improved stride parameter input verification. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:29 +08:00
Tejun Heo	cb7eaf75f3	rue/scx/sched_ext: Add cgroup support Upstream: no Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. Because different BPF schedulers may implement different subsets of CPU control features, allow BPF schedulers to pick which cgroup interface files to enable using SCX_OPS_CGROUP_KNOB_* flags. For now, only the weight knobs are supported but adding more should be straightforward. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v4: * Example schedulers moved into their own patches. * Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: * Make scx_example_pair switch all tasks by default. * Convert to BPF inline iterators. * scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. * scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: * Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com>	2024-03-29 11:22:28 +08:00
Tejun Heo	19c6483f64	rue/scx/sched_ext: Implement tickless support Upstream: no Allow BPF schedulers to indicate tickless operation by setting p->scx.slice to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into tickless operation. scx_central is updated to use tickless operations for all tasks and instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT and task state tracking added by the previous patches. Currently, there is no way to pin the timer on the central CPU, so it may end up on one of the worker CPUs; however, outside of that, the worker CPUs can go tickless both while running sched_ext tasks and idling. With schbench running, scx_central shows: root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 142024 656 664 449 Local timer interrupts LOC: 161663 663 665 449 Local timer interrupts Without it: root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 188778 3142 3793 3993 Local timer interrupts LOC: 198993 5314 6323 6438 Local timer interrupts While scx_central itself is too barebone to be useful as a production scheduler, a more featureful central scheduler can be built using the same approach. Google's experience shows that such an approach can have significant benefits for certain applications such as VM hosting. v3: * Pin the central scheduler's timer on the central_cpu using BPF_F_TIMER_CPU_PIN. v2: * Convert to BPF inline iterators. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:28 +08:00
Tejun Heo	78e845ec45	rue/scx/sched_ext: Make watchdog handle ops.dispatch() looping stall Upstream: no The dispatch path retries if the local DSQ is still empty after ops.dispatch() either dispatched or consumed a task. This is both out of necessity and for convenience. It has to retry because the dispatch path might lose the tasks to dequeue while the rq lock is released while trying to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch() implementation easier as it only needs to make some forward progress each iteration. However, this makes it possible for ops.dispatch() to stall CPUs by repeatedly dispatching ineligible tasks. If all CPUs are stalled that way, the watchdog or sysrq handler can't run and the system can't be saved. Let's address the issue by breaking out of the dispatch loop after 32 iterations. It is unlikely but not impossible for ops.dispatch() to legitimately go over the iteration limit. We want to come back to the dispatch path in such cases as not doing so risks stalling the CPU by idling with runnable tasks pending. As the previous task is still current in balance_scx(), resched_curr() doesn't do anything - it will just get cleared. Let's instead use scx_kick_bpf() which will trigger reschedule after switching to the next task which will likely be the idle task. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com>	2024-03-29 11:22:27 +08:00
Tejun Heo	fa0819556c	rue/scx/sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Upstream: no This patch adds a new example scheduler, scx_central, which demonstrates central scheduling where one CPU is responsible for making all scheduling decisions in the system using scx_bpf_kick_cpu(). The central CPU makes scheduling decisions for all CPUs in the system, queues tasks on the appropriate local dsq's and preempts the worker CPUs. The worker CPUs in turn preempt the central CPU when it needs tasks to run. Currently, every CPU depends on its own tick to expire the current task. A follow-up patch implementing tickless support for sched_ext will allow the worker CPUs to go full tickless so that they can run completely undisturbed. v2: * Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Julia Lawall <julia.lawall@inria.fr>	2024-03-29 11:22:27 +08:00
Tejun Heo	cae0d0aef6	rue/scx/sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Upstream: no It's often useful to wake up and/or trigger reschedule on other CPUs. This patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to kick the target CPU into the scheduling path. As a sched_ext task relinquishes its CPU only after its slice is depleted, this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the slice of the target CPU's current task to guarantee that sched_ext's scheduling path runs on the CPU. v4: * Move example scheduler to its own patch. v3: * Make scx_example_central switch all tasks by default. * Convert to BPF inline iterators. v2: * Julia Lawall reported that scx_example_central can overflow the dispatch buffer and malfunction. As scheduling for other CPUs can't be handled by the automatic retry mechanism, fix by implementing an explicit overflow and retry handling. * Updated to use generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:26 +08:00
Tejun Heo	810c41ba27	rue/scx/sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext Upstream: no Currently, to use sched_ext, each task has to be put into sched_ext using sched_setscheduler(2). However, some BPF schedulers and use cases might prefer to service all eligible tasks. This patch adds a new kfunc helper, scx_bpf_switch_all(), that BPF schedulers can call from ops.init() to switch all SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE tasks into sched_ext. This has the benefit that the scheduler swaps are transparent to the users and applications. As we know that CFS is not being used when scx_bpf_switch_all() is used, we can also disable hot path entry points with static_branches. Both the simple and qmap example schedulers are updated to switch all tasks by default to ease testing. '-p' option is added which enables the original behavior of switching only tasks which are explicitly on SCHED_EXT. v2: In the example schedulers, switch all tasks by default. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Suggested-by: Barret Rhoden <brho@google.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:22 +08:00
Tejun Heo	721f169ade	rue/scx/sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Upstream: no BPF schedulers might not want to schedule certain tasks - e.g. kernel threads. This patch adds p->scx.disallow which can be set by BPF schedulers in such cases. The field can be changed anytime and setting it in ops.prep_enable() guarantees that the task can never be scheduled by sched_ext. scx_qmap is updated with the -d option to disallow a specific PID: # echo $$ 1092 # egrep '(policy)\|(ext\.enabled)' /proc/self/sched policy : 0 ext.enabled : 0 # ./set-scx 1092 # egrep '(policy)\|(ext\.enabled)' /proc/self/sched policy : 7 ext.enabled : 0 Run "scx_qmap -d 1092" in another terminal. # grep rejected /sys/kernel/debug/sched/ext nr_rejected : 1 # egrep '(policy)\|(ext\.enabled)' /proc/self/sched policy : 0 ext.enabled : 0 # ./set-scx 1092 setparam failed for 1092 (Permission denied) * v2: Use atomic_long_t instead of atomic64_t for scx_kick_cpus_pnt_seqs to accommodate 32bit archs. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Suggested-by: Barret Rhoden <brho@google.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:21 +08:00
David Vernet	259e131da4	rue/scx/sched_ext: Implement runnable task stall watchdog Upstream: no The most common and critical way that a BPF scheduler can misbehave is by failing to run runnable tasks for too long. This patch implements a watchdog. * All tasks record when they become runnable. * A watchdog work periodically scans all runnable tasks. If any task has stayed runnable for too long, the BPF scheduler is aborted. * scheduler_tick() monitors whether the watchdog itself is stuck. If so, the BPF scheduler is aborted. Because the watchdog only scans the tasks which are currently runnable and usually very infrequently, the overhead should be negligible. scx_qmap is updated so that it can be told to stall user and/or kernel tasks. A detected task stall looks like the following: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s) scx_check_timeout_workfn+0x10e/0x1b0 process_one_work+0x287/0x560 worker_thread+0x234/0x420 kthread+0xe9/0x100 ret_from_fork+0x1f/0x30 A detected watchdog stall: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (watchdog failed to check in for 5.001s) scheduler_tick+0x2eb/0x340 update_process_times+0x7a/0x90 tick_sched_timer+0xd8/0x130 __hrtimer_run_queues+0x178/0x3b0 hrtimer_interrupt+0xfc/0x390 __sysvec_apic_timer_interrupt+0xb7/0x2b0 sysvec_apic_timer_interrupt+0x90/0xb0 asm_sysvec_apic_timer_interrupt+0x1b/0x20 default_idle+0x14/0x20 arch_cpu_idle+0xf/0x20 default_idle_call+0x50/0x90 do_idle+0xe8/0x240 cpu_startup_entry+0x1d/0x20 kernel_init+0x0/0x190 start_kernel+0x0/0x392 start_kernel+0x324/0x392 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x104/0x109 secondary_startup_64_no_verify+0xce/0xdb Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to inline scx_notify_sched_tick(). v2: Julia Lawall noticed that the watchdog code was mixing msecs and jiffies. Fix by using jiffies for everything. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Julia Lawall <julia.lawall@inria.fr>	2024-03-29 11:22:21 +08:00
Tejun Heo	4b19dc9296	rue/scx/sched_ext: Add scx_simple and scx_example_qmap example schedulers Upstream: no Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v5: * Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. * MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. * Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). * Use SCX_BUG[_ON]() to simplify error handling. v4: * Dropped _example prefix from scheduler names. v3: * Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. * Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: * Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-03-29 11:22:20 +08:00
David Vernet	991fb56f6c	selftests/bpf: Test pinning bpf timer to a core Upstream commit 0d7ae06860753bb30b3731302b994da071120d00 Now that we support pinning a BPF timer to the current core, we should test it with some selftests. This patch adds two new testcases to the timer suite, which verifies that a BPF timer both with and without BPF_F_TIMER_ABS, can be pinned to the calling core with BPF_F_TIMER_CPU_PIN. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-3-void@manifault.com	2024-03-27 18:09:18 +08:00
David Vernet	c12dbc1bdd	bpf: Add ability to pin bpf timer to calling CPU Upstream commit d6247ecb6c1e17d7a33317090627f5bfe563cbb2 BPF supports creating high resolution timers using bpf_timer_* helper functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which specifies that the timeout should be interpreted as absolute time. It would also be useful to be able to pin that timer to a core. For example, if you wanted to make a subset of cores run without timer interrupts, and only have the timer be invoked on a single core. This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag. When specified, the HRTIMER_MODE_PINNED flag is passed to hrtimer_start(). A subsequent patch will update selftests to validate. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com	2024-03-27 18:09:14 +08:00
Ido Schimmel	b6dfcdbc80	selftests: forwarding: Fix ping failure due to short timeout [ Upstream commit e4137851d4863a9bdc6aabc613bcb46c06d91e64 ] The tests send 100 pings in 0.1 second intervals and force a timeout of 11 seconds, which is borderline (especially on debug kernels), resulting in random failures in netdev CI [1]. Fix by increasing the timeout to 20 seconds. It should not prolong the test unless something is wrong, in which case the test will rightfully fail. [1] # selftests: net/forwarding: vxlan_bridge_1d_port_8472_ipv6.sh # INFO: Running tests with UDP port 8472 # TEST: ping: local->local [ OK ] # TEST: ping: local->remote 1 [FAIL] # Ping failed [...] Fixes: `b07e9957f2` ("selftests: forwarding: Add VxLAN tests with a VLAN-unaware bridge for IPv6") Fixes: `728b35259e` ("selftests: forwarding: Add VxLAN tests with a VLAN-aware bridge for IPv6") Reported-by: Paolo Abeni <pabeni@redhat.com> Closes: https://lore.kernel.org/netdev/24a7051fdcd1f156c3704bca39e4b3c41dfc7c4b.camel@redhat.com/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240320065717.4145325-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:20:13 -04:00
Christophe JAILLET	dfde5becad	perf pmu: Fix a potential memory leak in perf_pmu__lookup() [ Upstream commit ef5de1613d7d92bdc975e6beb34bb0fa94f34078 ] The commit in Fixes has reordered some code, but missed an error handling path. 'goto err' now, in order to avoid a memory leak in case of error. Fixes: `f63a536f03` ("perf pmu: Merge JSON events with sysfs at load time") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Ian Rogers <irogers@google.com> Cc: kernel-janitors@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/9538b2b634894c33168dfe9d848d4df31fd4d801.1693085544.git.christophe.jaillet@wanadoo.fr Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:19:52 -04:00
Mark Rutland	f88698d645	perf print-events: make is_event_supported() more robust [ Upstream commit 25412c0364f7110faa6053c73e3fd47ca956b8c3 ] Currently the perf tool doesn't detect support for extended event types on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE hardware events into per-PMU events. This is due to the detection of extended event types not handling mandatory filters required by the M1/M2 PMU driver. PMU drivers and the core perf_events code can require that perf_event_attr::exclude_* filters are configured in a specific way and may reject certain configurations of filters, for example: (a) Many PMUs lack support for any event filtering, and require all perf_event_attr::exclude_* bits to be clear. This includes Alpha's CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in ARMv7, (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core requires that perf_event_attr::exclude_kernel is set. (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is set as the hardware PMU does not count while a guest is running (but might be extended in future to do so). In is_event_supported(), we try to account for cases (a) and (b), first attempting to open an event without any filters, and if this fails, retrying with perf_event_attr::exclude_kernel set. We do not account for case (c), or any other filters that drivers could theoretically require to be set. Thus is_event_supported() will fail to detect support for any events targeting an Apple M1/M2 PMU, even where events would be supported with perf_event_attr:::exclude_guest set. Since commit: `82fe2e45cd` ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type") ... we use is_event_supported() to detect support for extended types, with the PMU ID encoded into the perf_event_attr::type. As above, on an Apple M1/M2 system this will always fail to detect that the event is supported, and consequently we fail to detect support for extended types even when these are supported, as they have been since commit: `5c81672865` ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability") Due to this, the perf tool will not automatically expand plain PERF_TYPE_HARDWARE events into per-PMU events, even when all the necessary kernel support is present. This patch updates is_event_supported() to additionally try opening events with perf_event_attr::exclude_guest set, allowing support for events to be detected on Apple M1/M2 systems. I believe that this is sufficient for all contemporary CPU PMU drivers, though in future it may be necessary to check for other combinations of filter bits. I've deliberately changed the check to not expect a specific error code for missing filters, as today ;the kernel may return a number of different error codes for missing filters (e.g. -EACCESS, -EINVAL, or -EOPNOTSUPP) depending on why and where the filter configuration is rejected, and retrying for any error is more robust. Note that this does not remove the need for commit: a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON") ... which is still necessary so that named-pmu/event/ events work on kernels without extended type support, even if the event name happens to be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for the M1/M2 PMU's 'cycles' and 'instructions' events). Fixes: `82fe2e45cd` ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@arm.com> Tested-by: Marc Zyngier <maz@kernel.org> Cc: Hector Martin <marcan@marcan.st> Cc: James Clark <james.clark@arm.com> Cc: John Garry <john.g.garry@oracle.com> Cc: Leo Yan <leo.yan@linux.dev> Cc: Mike Leach <mike.leach@linaro.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240126145605.1005472-1-mark.rutland@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:19:51 -04:00
Ian Rogers	0bbe598b58	perf metric: Don't remove scale from counts [ Upstream commit 6d6be5eb45b423a37d746d3ee0fd0c78f76ead9f ] Counts were switched from the scaled saved value form to the aggregated count to avoid double accounting. When this happened the removing of scaling for a count should have been removed, however, it wasn't and this wasn't observed as it normally doesn't matter because a counter's scale is 1. A problem was observed with RAPL events that are scaled. Fixes: `37cc8ad77c` ("perf metric: Directly use counts rather than saved_value") Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: James Clark <james.clark@arm.com> Cc: Kaige Ye <ye@kaige.org> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240209204947.3873294-5-irogers@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:19:48 -04:00
Ian Rogers	40ae9bb1dd	perf stat: Avoid metric-only segv [ Upstream commit 2543947c77e0e224bda86b4e7220c2f6714da463 ] Cycles is recognized as part of a hard coded metric in stat-shadow.c, it may call print_metric_only with a NULL fmt string leading to a segfault. Handle the NULL fmt explicitly. Fixes: `088519f318` ("perf stat: Move the display functions to stat-display.c") Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: James Clark <james.clark@arm.com> Cc: Kaige Ye <ye@kaige.org> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240209204947.3873294-4-irogers@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:19:48 -04:00

1 2 3 4 5 ...

37945 Commits