OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Valentin Schneider	475ea6c602	sched: Don't defer CPU pick to migration_cpu_stop() Will reported that the 'XXX __migrate_task() can fail' in migration_cpu_stop() can happen, and it is sort of a big deal. Looking at it some more, one will note there is a glaring hole in the deferred CPU selection: (w/ CONFIG_CPUSET=n, so that the affinity mask passed via taskset doesn't get AND'd with cpu_online_mask) $ taskset -pc 0-2 $PID # offline CPUs 3-4 $ taskset -pc 3-5 $PID `\ $PID may stay on 0-2 due to the cpumask_any_distribute() picking an offline CPU and __migrate_task() refusing to do anything due to cpu_is_allowed(). set_cpus_allowed_ptr() goes to some length to pick a dest_cpu that matches the right constraints vs affinity and the online/active state of the CPUs. Reuse that instead of discarding it in the affine_move_task() case. Fixes: `6d337eab04` ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210526205751.842360-2-valentin.schneider@arm.com	2021-06-01 16:00:11 +02:00
Odin Ugedal	08f7c2f4d0	sched/fair: Fix ascii art by relpacing tabs When using something other than 8 spaces per tab, this ascii art makes not sense, and the reader might end up wondering what this advanced equation "is". Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210518125202.78658-4-odin@uged.al	2021-06-01 16:00:11 +02:00
Peter Zijlstra	15faafc6b4	sched,init: Fix DEBUG_PREEMPT vs early boot Extend `8fb12156b8` ("init: Pin init task to the boot CPU, initially") to cover the new PF_NO_SETAFFINITY requirement. While there, move wait_for_completion(&kthreadd_done) into kernel_init() to make it absolutely clear it is the very first thing done by the init thread. Fixes: `570a752b7a` ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/YLS4mbKUrA3Gnb4t@hirez.programming.kicks-ass.net	2021-06-01 16:00:11 +02:00
Peter Zijlstra	7b419f47fa	sched: Add CONFIG_SCHED_CORE help text Hugh noted that the SCHED_CORE Kconfig option could do with a help text. Requested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Link: https://lkml.kernel.org/r/YKyhtwhEgvtUDOyl@hirez.programming.kicks-ass.net	2021-06-01 16:00:10 +02:00
Frederic Weisbecker	a8ea6fc9b0	sched: Stop PF_NO_SETAFFINITY from being inherited by various init system threads Commit: `00b89fe019` ("sched: Make the idle task quack like a per-CPU kthread") ... added PF_KTHREAD \| PF_NO_SETAFFINITY to the idle kernel threads. Unfortunately these properties are inherited to the init/0 children through kernel_thread() calls: init/1 and kthreadd. There are several side effects to that: 1) kthreadd affinity can not be reset anymore from userspace. Also PF_NO_SETAFFINITY propagates to all kthreadd children, including the unbound kthreads Therefore it's not possible anymore to overwrite the affinity of any of them. Here is an example of warning reported by rcutorture: WARNING: CPU: 0 PID: 116 at kernel/rcu/tree_nocb.h:1306 rcu_bind_current_to_nocb+0x31/0x40 Call Trace: rcu_torture_fwd_prog+0x62/0x730 kthread+0x122/0x140 ret_from_fork+0x22/0x30 2) init/1 does an exec() in the end which clears both PF_KTHREAD and PF_NO_SETAFFINITY so we are fine once kernel_init() escapes to userspace. But until then, no initcall or init code can successfully call sched_setaffinity() to init/1. Also PF_KTHREAD looks legit on init/1 before it calls exec() but we better be careful with unknown introduced side effects. One way to solve the PF_NO_SETAFFINITY issue is to not inherit this flag on copy_process() at all. The cases where it matters are: * fork_idle(): explicitly set the flag already. * fork() syscalls: userspace tasks that shouldn't be concerned by that. * create_io_thread(): the callers explicitly attribute the flag to the newly created tasks. * kernel_thread(): - Fix the issues on init/1 and kthreadd - Fix the issues on kthreadd children. - Usermode helper created by an unbound workqueue. This shouldn't matter. In the worst case it gives more control to userspace on setting affinity to these short living tasks although this can be tuned with inherited unbound workqueues affinity already. Fixes: `00b89fe019` ("sched: Make the idle task quack like a per-CPU kthread") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20210525235849.441842-1-frederic@kernel.org	2021-05-26 08:58:53 +02:00
Masahiro Yamada	1699949d33	sched: Fix a stale comment in pick_next_task() fair_sched_class->next no longer exists since commit: `a87e749e8f` ("sched: Remove struct sched_class::next field"). Now the sched_class order is specified by the linker script. Rewrite the comment in a more generic way. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210519063709.323162-1-masahiroy@kernel.org	2021-05-19 13:03:21 +02:00
Qais Yousef	93b7385870	sched/uclamp: Fix locking around cpu_util_update_eff() cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the uclamp_mutex or rcu_read_lock() like other call sites, which is a mistake. The uclamp_mutex is required to protect against concurrent reads and writes that could update the cgroup hierarchy. The rcu_read_lock() is required to traverse the cgroup data structures in cpu_util_update_eff(). Surround the caller with the required locks and add some asserts to better document the dependency in cpu_util_update_eff(). Fixes: `7226017ad3` ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com	2021-05-19 10:53:02 +02:00
Qais Yousef	0c18f2ecfc	sched/uclamp: Fix wrong implementation of cpu.uclamp.min cpu.uclamp.min is a protection as described in cgroup-v2 Resource Distribution Model Documentation/admin-guide/cgroup-v2.rst which means we try our best to preserve the minimum performance point of tasks in this group. See full description of cpu.uclamp.min in the cgroup-v2.rst. But the current implementation makes it a limit, which is not what was intended. For example: tg->cpu.uclamp.min = 20% p0->uclamp[UCLAMP_MIN] = 0 p1->uclamp[UCLAMP_MIN] = 50% Previous Behavior (limit): p0->effective_uclamp = 0 p1->effective_uclamp = 20% New Behavior (Protection): p0->effective_uclamp = 20% p1->effective_uclamp = 50% Which is inline with how protections should work. With this change the cgroup and per-task behaviors are the same, as expected. Additionally, we remove the confusing relationship between cgroup and !user_defined flag. We don't want for example RT tasks that are boosted by default to max to change their boost value when they attach to a cgroup. If a cgroup wants to limit the max performance point of tasks attached to it, then cpu.uclamp.max must be set accordingly. Or if they want to set different boost value based on cgroup, then sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max and set the right cpu.uclamp.min for each group to let the RT tasks obtain the desired boost value when attached to that group. As it stands the dependency on !user_defined flag adds an extra layer of complexity that is not required now cpu.uclamp.min behaves properly as a protection. The propagation model of effective cpu.uclamp.min in child cgroups as implemented by cpu_util_update_eff() is still correct. The parent protection sets an upper limit of what the child cgroups will effectively get. Fixes: `3eac870a32` (sched/uclamp: Use TG's clamps to restrict TASK's clamps) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com	2021-05-19 10:53:02 +02:00
Yejune Deng	570a752b7a	lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed is_percpu_thread() more elegantly handles SMP vs UP, and further checks the presence of PF_NO_SETAFFINITY. This lets us catch cases where check_preemption_disabled() can race with a concurrent sched_setaffinity(). Signed-off-by: Yejune Deng <yejune.deng@gmail.com> [Amended changelog] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510151024.2448573-3-valentin.schneider@arm.com	2021-05-19 10:51:40 +02:00
Valentin Schneider	00b89fe019	sched: Make the idle task quack like a per-CPU kthread For all intents and purposes, the idle task is a per-CPU kthread. It isn't created via the same route as other pcpu kthreads however, and as a result it is missing a few bells and whistles: it fails kthread_is_per_cpu() and it doesn't have PF_NO_SETAFFINITY set. Fix the former by giving the idle task a kthread struct along with the KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle() call be called more than once on the same idle task. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510151024.2448573-2-valentin.schneider@arm.com	2021-05-18 12:53:53 +02:00
Mel Gorman	fcb5017045	delayacct: Document task_delayacct sysctl Update sysctl/kernel.rst. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210512114035.GH3672@suse.de	2021-05-18 12:53:53 +02:00
Peter Zijlstra	90a0ff4ec9	sched,stats: Further simplify sched_info There's no point doing delta==0 updates. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2021-05-18 12:53:53 +02:00
Paul Gortmaker	915a2bc3c6	sched/isolation: Reconcile rcu_nocbs= and nohz_full= We have a mismatch between RCU and isolation -- in relation to what is considered the maximum valid CPU number. This matters because nohz_full= and rcu_nocbs= are joined at the hip; in fact the former will enforce the latter. So we don't want a CPU mask to be valid for one and denied for the other. The difference 1st appeared as of v4.15; further details are below. As it is confusing to anyone who isn't looking at the code regularly, a reminder is in order; three values exist here: CONFIG_NR_CPUS - compiled in maximum cap on number of CPUs supported. nr_cpu_ids - possible # of CPUs (typically reflects what ACPI says) cpus_present - actual number of present/detected/installed CPUs. For this example, I'll refer to NR_CPUS=64 from "make defconfig" and nr_cpu_ids=6 for ACPI reporting on a board that could run a six core, and present=4 for a quad that is physically in the socket. From dmesg: smpboot: Allowing 6 CPUs, 2 hotplug CPUs setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:6 nr_node_ids:1 rcu: RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=6. smp: Brought up 1 node, 4 CPUs And from userspace, see: paul@trash:/sys/devices/system/cpu$ cat present 0-3 paul@trash:/sys/devices/system/cpu$ cat possible 0-5 paul@trash:/sys/devices/system/cpu$ cat kernel_max 63 Everything is fine if we boot 5x5 for rcu/nohz: Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-5 rcu_nocbs=2-5 root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. ..even though there is no CPU 4 or 5. Both RCU and nohz_full are OK. Now we push that > 6 but less than NR_CPU and with 15x15 we get: Command line: BOOT_IMAGE=/boot/bzImage rcu_nocbs=2-15 nohz_full=2-15 root=/dev/sda1 ro rcu: Note: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs. rcu: Offload RCU callbacks from CPUs: 2-5. These are both functionally equivalent, as we are only changing flags on phantom CPUs that don't exist, but note the kernel interpretation changes. And worse, it only changes for one of the two - which is the problem. RCU doesn't care if you want to restrict the flags on phantom CPUs but clearly nohz_full does after this change from v4.15. edb9382175c3: ("sched/isolation: Move isolcpus= handling to the housekeeping code") - if (cpulist_parse(str, non_housekeeping_mask) < 0) { - pr_warn("Housekeeping: Incorrect nohz_full cpumask\n"); + err = cpulist_parse(str, non_housekeeping_mask); + if (err < 0 \|\| cpumask_last(non_housekeeping_mask) >= nr_cpu_ids) { + pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n"); To be clear, the sanity check on "possible" (nr_cpu_ids) is new here. The goal was reasonable ; not wanting housekeeping to land on a not-possible CPU, but note two things: 1) this is an exclusion list, not an inclusion list; we are tracking non_housekeeping CPUs; not ones who are explicitly assigned housekeeping 2) we went one further in `9219565aa8` ("sched/isolation: Require a present CPU in housekeeping mask") - ensuring that housekeeping was sanity checking against present and not just possible CPUs. To be clear, this means the check added in v4.15 is doubly redundant. And more importantly, overly strict/restrictive. We care now, because the bitmap boot arg parsing now knows that a value of "N" is NR_CPUS; the size of the bitmap, but the bitmap code doesn't know anything about the subtleties of our max/possible/present CPU specifics as outlined above. So drop the check added in v4.15 (`edb9382175`) and make RCU and nohz_full both in alignment again on NR_CPUS so "N" works for both, and then they can fall back to nr_cpu_ids internally just as before. Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-N rcu_nocbs=2-N root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. As shown above, with this change, RCU and nohz_full are in sync, even with the use of the "N" placeholder. Same result is achieved with "15". Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20210419042659.1134916-1-paul.gortmaker@windriver.com	2021-05-13 14:12:47 +02:00
Alexey Dobriyan	e6fe3f422b	sched: Make multiple runqueue task counters 32-bit Make: struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-4-adobriyan@gmail.com	2021-05-12 21:34:17 +02:00
Alexey Dobriyan	8fc2858e57	sched: Make nr_iowait_cpu() return 32-bit value Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-3-adobriyan@gmail.com	2021-05-12 21:34:16 +02:00
Alexey Dobriyan	9745516841	sched: Make nr_iowait() return 32-bit value Creating 2**32 tasks to wait in D-state is impossible and wasteful. Return "unsigned int" and save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-2-adobriyan@gmail.com	2021-05-12 21:34:15 +02:00
Alexey Dobriyan	01aee8fd7f	sched: Make nr_running() return 32-bit value Creating 2**32 tasks is impossible due to futex pid limits and wasteful anyway. Nobody has done it. Bring nr_running() into 32-bit world to save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-1-adobriyan@gmail.com	2021-05-12 21:34:14 +02:00
Ingo Molnar	cc00c19888	sched: Fix leftover comment typos A few more snuck in. Also capitalize 'CPU' while at it. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-05-12 19:54:49 +02:00
Valentin Schneider	f1a0a376ca	sched/core: Initialize the idle task with preemption disabled As pointed out by commit `de9b8f5dcb` ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all possible CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com	2021-05-12 13:01:45 +02:00
Chris Hyser	9f26990074	kselftest: Add test for core sched prctl interface Provides a selftest and examples of using the interface. [peterz: updated to not use sched_debug] Signed-off-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123309.100860030@infradead.org	2021-05-12 11:43:32 +02:00
Chris Hyser	7ac592aa35	sched: prctl() core-scheduling interface This patch provides support for setting and copying core scheduling 'task cookies' between threads (PID), processes (TGID), and process groups (PGID). The value of core scheduling isn't that tasks don't share a core, 'nosmt' can do that. The value lies in exploiting all the sharing opportunities that exist to recover possible lost performance and that requires a degree of flexibility in the API. From a security perspective (and there are others), the thread, process and process group distinction is an existent hierarchal categorization of tasks that reflects many of the security concerns about 'data sharing'. For example, protecting against cache-snooping by a thread that can just read the memory directly isn't all that useful. With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a mechanism to create and share cookies. CREATE/SHARE_TO specify a target pid with enum pidtype used to specify the scope of the targeted tasks. For example, PIDTYPE_TGID will share the cookie with the process and all of it's threads as typically desired in a security scenario. API: prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie) prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL) where 'tgtpid/srcpid == 0' implies the current process and pidtype is kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}. For return values, EINVAL, ENOMEM are what they say. ESRCH means the tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT. [peterz: complete rewrite] Signed-off-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123309.039845339@infradead.org	2021-05-12 11:43:31 +02:00
Peter Zijlstra	85dd3f6120	sched: Inherit task cookie on fork() Note that sched_core_fork() is called from under tasklist_lock, and not from sched_fork() earlier. This avoids a few races later. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.980003687@infradead.org	2021-05-12 11:43:31 +02:00
Peter Zijlstra	6e33cad0af	sched: Trivial core scheduling cookie management In order to not have to use pid_struct, create a new, smaller, structure to manage task cookies for core scheduling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.919768100@infradead.org	2021-05-12 11:43:31 +02:00
Aubrey Li	97886d9dcd	sched: Migration changes for core scheduling - Don't migrate if there is a cookie mismatch Load balance tries to move task from busiest CPU to the destination CPU. When core scheduling is enabled, if the task's cookie does not match with the destination CPU's core cookie, this task may be skipped by this CPU. This mitigates the forced idle time on the destination CPU. - Select cookie matched idle CPU In the fast path of task wakeup, select the first cookie matched idle CPU instead of the first idle CPU. - Find cookie matched idlest CPU In the slow path of task wakeup, find the idlest CPU whose core cookie matches with task's cookie Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.860083871@infradead.org	2021-05-12 11:43:30 +02:00
Peter Zijlstra	d2dfa17bc7	sched: Trivial forced-newidle balancer When a sibling is forced-idle to match the core-cookie; search for matching tasks to fill the core. rcu_read_unlock() can incur an infrequent deadlock in sched_core_balance(). Fix this by using the RCU-sched flavor instead. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.800048269@infradead.org	2021-05-12 11:43:30 +02:00
Joel Fernandes (Google)	c6047c2e3a	sched/fair: Snapshot the min_vruntime of CPUs on force idle During force-idle, we end up doing cross-cpu comparison of vruntimes during pick_next_task. If we simply compare (vruntime-min_vruntime) across CPUs, and if the CPUs only have 1 task each, we will always end up comparing 0 with 0 and pick just one of the tasks all the time. This starves the task that was not picked. To fix this, take a snapshot of the min_vruntime when entering force idle and use it for comparison. This min_vruntime snapshot will only be used for cross-CPU vruntime comparison, and nothing else. A note about the min_vruntime snapshot and force idling: During selection: When we're not fi, we need to update snapshot. when we're fi and we were not fi, we must update snapshot. When we're fi and we were already fi, we must not update snapshot. Which gives: fib fi update 0 0 1 0 1 1 1 0 1 1 1 0 Where: fi: force-idled now fib: force-idled before So the min_vruntime snapshot needs to be updated when: !(fib && fi). Also, the cfs_prio_less() function needs to be aware of whether the core is in force idle or not, since it will be use this information to know whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass this information along via pick_task() -> prio_less(). Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.738542617@infradead.org	2021-05-12 11:43:29 +02:00
Joel Fernandes (Google)	7afbba119f	sched: Fix priority inversion of cookied task with sibling The rationale is as follows. In the core-wide pick logic, even if need_sync == false, we need to go look at other CPUs (non-local CPUs) to see if they could be running RT. Say the RQs in a particular core look like this: Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task. rq0 rq1 CFS1 (tagged) RT1 (no tag) CFS2 (tagged) Say schedule() runs on rq0. Now, it will enter the above loop and pick_task(RT) will return NULL for 'p'. It will enter the above if() block and see that need_sync == false and will skip RT entirely. The end result of the selection will be (say prio(CFS1) > prio(CFS2)): rq0 rq1 CFS1 IDLE When it should have selected: rq0 rq1 IDLE RT Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.678425748@infradead.org	2021-05-12 11:43:29 +02:00
Vineeth Pillai	8039e96fcc	sched/fair: Fix forced idle sibling starvation corner case If there is only one long running local task and the sibling is forced idle, it might not get a chance to run until a schedule event happens on any cpu in the core. So we check for this condition during a tick to see if a sibling is starved and then give it a chance to schedule. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.617407840@infradead.org	2021-05-12 11:43:29 +02:00
Peter Zijlstra	539f65125d	sched: Add core wide task selection and scheduling Instead of only selecting a local task, select a task for all SMT siblings for every reschedule on the core (irrespective which logical CPU does the reschedule). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.557559654@infradead.org	2021-05-12 11:43:28 +02:00
Peter Zijlstra	8a311c740b	sched: Basic tracking of matching tasks Introduce task_struct::core_cookie as an opaque identifier for core scheduling. When enabled; core scheduling will only allow matching task to be on the core; where idle matches everything. When task_struct::core_cookie is set (and core scheduling is enabled) these tasks are indexed in a second RB-tree, first on cookie value then on scheduling function, such that matching task selection always finds the most elegible match. NOTE: shudder at the overhead... NOTE: sigh, a 3rd copy of the scheduling function; the alternative is per class tracking of cookies and that just duplicates a lot of stuff for no raisin (the 2nd copy lives in the rt-mutex PI code). [Joel: folded fixes] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.496975854@infradead.org	2021-05-12 11:43:28 +02:00
Peter Zijlstra	21f56ffe44	sched: Introduce sched_class::pick_task() Because sched_class::pick_next_task() also implies sched_class::set_next_task() (and possibly put_prev_task() and newidle_balance) it is not state invariant. This makes it unsuitable for remote task selection. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [Vineeth: folded fixes] Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.437092775@infradead.org	2021-05-12 11:43:28 +02:00
Peter Zijlstra	875feb41fd	sched: Allow sched_core_put() from atomic context Stuff the meat of sched_core_put() into a work such that we can use sched_core_put() from atomic context. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.377455632@infradead.org	2021-05-12 11:43:27 +02:00
Peter Zijlstra	9ef7e7e33b	sched: Optimize rq_lockp() usage rq_lockp() includes a static_branch(), which is asm-goto, which is asm volatile which defeats regular CSE. This means that: if (!static_branch(&foo)) return simple; if (static_branch(&foo) && cond) return complex; Doesn't fold and we get horrible code. Introduce __rq_lockp() without the static_branch() on. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.316696988@infradead.org	2021-05-12 11:43:27 +02:00
Peter Zijlstra	9edeaea1bc	sched: Core-wide rq->lock Introduce the basic infrastructure to have a core wide rq->lock. This relies on the rq->__lock order being in increasing CPU number (inside a core). It is also constrained to SMT8 per lockdep (and SMT256 per preempt_count). Luckily SMT8 is the max supported SMT count for Linux (Mips, Sparc and Power are known to have this). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/YJUNfzSgptjX7tG6@hirez.programming.kicks-ass.net	2021-05-12 11:43:27 +02:00
Peter Zijlstra	d66f1b06b5	sched: Prepare for Core-wide rq->lock When switching on core-sched, CPUs need to agree which lock to use for their RQ. The new rule will be that rq->core_enabled will be toggled while holding all rq->__locks that belong to a core. This means we need to double check the rq->core_enabled value after each lock acquire and retry if it changed. This also has implications for those sites that take multiple RQ locks, they need to be careful that the second lock doesn't end up being the first lock. Verify the lock pointer after acquiring the first lock, because if they're on the same core, holding any of the rq->__lock instances will pin the core state. While there, change the rq->__lock order to CPU number, instead of rq address, this greatly simplifies the next patch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/YJUNY0dmrJMD/BIm@hirez.programming.kicks-ass.net	2021-05-12 11:43:26 +02:00
Peter Zijlstra	5cb9eaa3d2	sched: Wrap rq::lock access In preparation of playing games with rq->lock, abstract the thing using an accessor. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org	2021-05-12 11:43:26 +02:00
Peter Zijlstra	39d371b7c0	sched: Provide raw_spin_rq_lock() helpers In prepration for playing games with rq->lock, add some rq_lock wrappers. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.075967879@infradead.org	2021-05-12 11:43:26 +02:00
Peter Zijlstra	9099a14708	sched/fair: Add a few assertions Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.015639083@infradead.org	2021-05-12 11:43:26 +02:00
Peter Zijlstra	0cd7c741f0	delayacct: Add sysctl to enable at runtime Just like sched_schedstats, allow runtime enabling (and disabling) of delayacct. This is useful if one forgot to add the delayacct boot time option. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YJkhebGJAywaZowX@hirez.programming.kicks-ass.net	2021-05-12 11:43:25 +02:00
Peter Zijlstra	e4042ad492	delayacct: Default disabled Assuming this stuff isn't actually used much; disable it by default and avoid allocating and tracking the task_delay_info structure. taskstats is changed to still report the regular sched and sched_info and only skip the missing task_delay_info fields instead of not reporting anything. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210505111525.308018373@infradead.org	2021-05-12 11:43:25 +02:00
Peter Zijlstra	eee4d9fee2	delayacct: Add static_branch in scheduler hooks Cheaper when delayacct is disabled. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210505111525.248028369@infradead.org	2021-05-12 11:43:25 +02:00
Peter Zijlstra	63b3f96e1a	kvm: Select SCHED_INFO instead of TASK_DELAY_ACCT AFAICT KVM only relies on SCHED_INFO. Nothing uses the p->delays data that belongs to TASK_DELAY_ACCT. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Link: https://lkml.kernel.org/r/20210505111525.187225172@infradead.org	2021-05-12 11:43:24 +02:00
Peter Zijlstra	c5895d3f06	sched: Simplify sched_info_on() The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm. If SCHEDSTATS=Y (but disabled by default) sched_info_on() is unconditionally true -- this is the case for all distro kernel configs I checked. If for some reason SCHEDSTATS=N, but TASK_DELAY_ACCT=Y, then sched_info_on() can return false when delayacct is disabled, presumably because there would be no other users left; except kvm is. Instead of complicating matters further by accurately accounting sched_stat and kvm state, simply unconditionally enable when SCHED_INFO=Y, matching the common distro case. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210505111525.121458839@infradead.org	2021-05-12 11:43:24 +02:00
Peter Zijlstra	4e29fb7098	sched: Rename sched_info_{queued,dequeued} For consistency, rename {queued,dequeued} to {enqueue,dequeue}. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Link: https://lkml.kernel.org/r/20210505111525.061402904@infradead.org	2021-05-12 11:43:24 +02:00
Peter Zijlstra	4b7a08a0b6	delayacct: Use sched_clock() Like all scheduler statistics, use sched_clock() based time. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Link: https://lkml.kernel.org/r/20210505111525.001031466@infradead.org	2021-05-12 11:43:23 +02:00
Pierre Gondois	619e090c8e	sched/fair: Fix negative energy delta in find_energy_efficient_cpu() find_energy_efficient_cpu() (feec()) searches the best energy CPU to place a task on. To do so, compute_energy() estimates the energy impact of placing the task on a CPU, based on CPU and task utilization signals. Utilization signals can be concurrently updated while evaluating a performance domain (pd). In some cases, this leads to having a 'negative delta', i.e. placing the task in the pd is seen as an energy gain. Thus, any further energy comparison is biased. In case of a 'negative delta', return prev_cpu since: 1. a 'negative delta' happens in less than 0.5% of feec() calls, on a Juno with 6 CPUs (4 little, 2 big) 2. it is unlikely to have two consecutive 'negative delta' for a task, so if the first call fails, feec() will correctly place the task in the next feec() call 3. EAS current behavior tends to select prev_cpu if the task doesn't raise the OPP of its current pd. prev_cpu is EAS's generic decision 4. prev_cpu should be preferred to returning an error code. In the latter case, select_idle_sibling() would do the placement, selecting a big (and not energy efficient) CPU. As 3., the task would potentially reside on the big CPU for a long time Reported-by: Xuewen Yan <xuewen.yan@unisoc.com> Suggested-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Link: https://lkml.kernel.org/r/20210504090743.9688-3-Pierre.Gondois@arm.com	2021-05-12 11:43:23 +02:00
Pierre Gondois	8d4c97c105	sched/fair: Only compute base_energy_pd if necessary find_energy_efficient_cpu() searches the best energy CPU to place a task on. To do so, the energy of each performance domain (pd) is computed w/ and w/o the task placed on it. The energy of a pd w/o the task (base_energy_pd) is computed prior knowing whether a CPU is available in the pd. Move the base_energy_pd computation after looping through the CPUs of a pd and only compute it if at least one CPU is available. Suggested-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Link: https://lkml.kernel.org/r/20210504090743.9688-2-Pierre.Gondois@arm.com	2021-05-12 11:43:23 +02:00
Rik van Riel	e5e678e4fe	sched,fair: Skip newidle_balance if a wakeup is pending The try_to_wake_up function has an optimization where it can queue a task for wakeup on its previous CPU, if the task is still in the middle of going to sleep inside schedule(). Once schedule() re-enables IRQs, the task will be woken up with an IPI, and placed back on the runqueue. If we have such a wakeup pending, there is no need to search other CPUs for runnable tasks. Just skip (or bail out early from) newidle balancing, and run the just woken up task. For a memcache like workload test, this reduces total CPU use by about 2%, proportionally split between user and system time, and p99 and p95 application response time by 10% on average. The schedstats run_delay number shows a similar improvement. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210422130236.0bb353df@imladris.surriel.com	2021-05-12 11:43:23 +02:00
Oleg Nesterov	2b8ca1a907	sched/core: Remove the pointless BUG_ON(!task) from wake_up_q() container_of() can never return NULL - so don't check for it pointlessly. [ mingo: Twiddled the changelog. ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210510161522.GA32644@redhat.com	2021-05-12 11:03:54 +02:00
Gautham R. Shenoy	02dbb7246c	sched/fair: Fix clearing of has_idle_cores flag in select_idle_cpu() In commit: `9fe1f127b9` ("sched/fair: Merge select_idle_core/cpu()") in select_idle_cpu(), we check if an idle core is present in the LLC of the target CPU via the flag "has_idle_cores". We look for the idle core in select_idle_cores(). If select_idle_cores() isn't able to find an idle core/CPU, we need to unset the has_idle_cores flag in the LLC of the target to prevent other CPUs from going down this route. However, the current code is unsetting it in the LLC of the current CPU instead of the target CPU. This patch fixes this issue. Fixes: `9fe1f127b9` ("sched/fair: Merge select_idle_core/cpu()") Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/1620746169-13996-1-git-send-email-ego@linux.vnet.ibm.com	2021-05-12 10:41:28 +02:00

1 2 3 4 5 ...

1013578 Commits All Branches Search

1013578 Commits

All Branches