Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
   Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
   Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

   As perf and the scheduler is getting bigger and more complex,
   document the status quo of current responsibilities and interests,
   and spread the review pain^H^H^H^H fun via an increase in the Cc:
   linecount generated by scripts/get_maintainer.pl. :-)

 - Add another series of patches that brings the -rt (PREEMPT_RT) tree
   closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
   into a new CONFIG_PREEMPTION category that will allow the eventual
   introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
   to go though.

 - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
   allow the finer shaping of CPU bandwidth usage.

 - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

 - Improve the behavior of high CPU count, high thread count
   applications running under cpu.cfs_quota_us constraints.

 - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

 - Improve CPU isolation housekeeping CPU allocation NUMA locality.

 - Fix deadline scheduler bandwidth calculations and logic when cpusets
   rebuilds the topology, or when it gets deadline-throttled while it's
   being offlined.

 - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
   setscheduler() system calls without creating global serialization.
   Add new synchronization between cpuset topology-changing events and
   the deadline acceptance tests in setscheduler(), which were broken
   before.

 - Rework the active_mm state machine to be less confusing and more
   optimal.

 - Rework (simplify) the pick_next_task() slowpath.

 - Improve load-balancing on AMD EPYC systems.

 - ... and misc cleanups, smaller fixes and improvements - please see
   the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  sched/psi: Correct overly pessimistic size calculation
  sched/fair: Speed-up energy-aware wake-ups
  sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
  sched/uclamp: Update CPU's refcount on TG's clamp changes
  sched/uclamp: Use TG's clamps to restrict TASK's clamps
  sched/uclamp: Propagate system defaults to the root group
  sched/uclamp: Propagate parent clamps
  sched/uclamp: Extend CPU's cgroup controller
  sched/topology: Improve load balancing on AMD EPYC systems
  arch, ia64: Make NUMA select SMP
  sched, perf: MAINTAINERS update, add submaintainers and reviewers
  sched/fair: Use rq_lock/unlock in online_fair_sched_group
  cpufreq: schedutil: fix equation in comment
  sched: Rework pick_next_task() slow-path
  sched: Allow put_prev_task() to drop rq->lock
  sched/fair: Expose newidle_balance()
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched: Rework CPU hotplug task selection
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Fix kerneldoc comment for ia64_set_curr_task
  ...
This commit is contained in:
Linus Torvalds 2019-09-16 17:25:49 -07:00
commit 7e67a85999
60 changed files with 1274 additions and 595 deletions

View File

@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy. realtime scheduling policy.
In all the above models, cycles distribution is defined only on a temporal
base and it does not account for the frequency at which tasks are executed.
The (optional) utilization clamping support allows to hint the schedutil
cpufreq governor about the minimum desired frequency which should always be
provided by a CPU, as well as the maximum desired frequency, which should not
be exceeded by a CPU.
WARNING: cgroup2 doesn't yet support control of realtime processes and WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already the root cgroup. Be aware that system management software may already
@ -1016,6 +1023,33 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See Shows pressure stall information for CPU. See
Documentation/accounting/psi.rst for details. Documentation/accounting/psi.rst for details.
cpu.uclamp.min
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no utilization boosting.
The requested minimum utilization (protection) as a percentage
rational number, e.g. 12.34 for 12.34%.
This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.
The requested minimum utilization (protection) is always capped by
the current value for the maximum utilization (limit), i.e.
`cpu.uclamp.max`.
cpu.uclamp.max
A read-write single value file which exists on non-root cgroups.
The default is "max". i.e. no utilization capping
The requested maximum utilization (limit) as a percentage rational
number, e.g. 98.76 for 98.76%.
This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
Memory Memory
------ ------

View File

@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy. specification of the maximum CPU bandwidth available to a group or hierarchy.
The bandwidth allowed for a group is specified using a quota and period. Within The bandwidth allowed for a group is specified using a quota and period. Within
each given "period" (microseconds), a group is allowed to consume only up to each given "period" (microseconds), a task group is allocated up to "quota"
"quota" microseconds of CPU time. When the CPU bandwidth consumption of a microseconds of CPU time. That quota is assigned to per-cpu run queues in
group exceeds this limit (for that period), the tasks belonging to its slices as threads in the cgroup become runnable. Once all quota has been
hierarchy will be throttled and are not allowed to run again until the next assigned any additional requests for quota will result in those threads being
period. throttled. Throttled threads will not be able to run again until the next
period when the quota is replenished.
A group's unused runtime is globally tracked, being refreshed with quota units A group's unassigned quota is globally tracked, being refreshed back to
above at each period boundary. As threads consume this bandwidth it is cfs_quota units at each period boundary. As threads consume this bandwidth it
transferred to cpu-local "silos" on a demand basis. The amount transferred is transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice". within each of these updates is tunable and described as the "slice".
Management Management
@ -35,12 +36,12 @@ The default values are::
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained bandwidth restriction in place, such a group is described as an unconstrained
bandwidth group. This represents the traditional work-conserving behavior for bandwidth group. This represents the traditional work-conserving behavior for
CFS. CFS.
Writing any (valid) positive value(s) will enact the specified bandwidth limit. Writing any (valid) positive value(s) will enact the specified bandwidth limit.
The minimum quota allowed for the quota or period is 1ms. There is also an The minimum quota allowed for the quota or period is 1ms. There is also an
upper bound on the period length of 1s. Additional restrictions exist when upper bound on the period length of 1s. Additional restrictions exist when
bandwidth limits are used in a hierarchical fashion, these are explained in bandwidth limits are used in a hierarchical fashion, these are explained in
more detail below. more detail below.
@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
System wide settings System wide settings
-------------------- --------------------
For efficiency run-time is transferred between the global pool and CPU local For efficiency run-time is transferred between the global pool and CPU local
"silos" in a batch fashion. This greatly reduces global accounting pressure "silos" in a batch fashion. This greatly reduces global accounting pressure
on large systems. The amount transferred each time such an update is required on large systems. The amount transferred each time such an update is required
is described as the "slice". is described as the "slice".
This is tunable via procfs:: This is tunable via procfs::
@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
In case b) above, even though the child may have runtime remaining it will not In case b) above, even though the child may have runtime remaining it will not
be allowed to until the parent's runtime is refreshed. be allowed to until the parent's runtime is refreshed.
CFS Bandwidth Quota Caveats
---------------------------
Once a slice is assigned to a cpu it does not expire. However all but 1ms of
the slice may be returned to the global pool if all threads on that cpu become
unrunnable. This is configured at compile time by the min_cfs_rq_runtime
variable. This is a performance tweak that helps prevent added contention on
the global lock.
The fact that cpu-local slices do not expire results in some interesting corner
cases that should be understood.
For cgroup cpu constrained applications that are cpu limited this is a
relatively moot point because they will naturally consume the entirety of their
quota as well as the entirety of each cpu-local slice in each period. As a
result it is expected that nr_periods roughly equal nr_throttled, and that
cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
For highly-threaded, non-cpu bound applications this non-expiration nuance
allows applications to briefly burst past their quota limits by the amount of
unused slice on each cpu that the task group is running on (typically at most
1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only
applies if quota had been assigned to a cpu and then not fully used or returned
in previous periods. This burst amount will not be transferred between cores.
As a result, this mechanism still strictly limits the task group to quota
average usage, albeit over a longer time window than a single period. This
also limits the burst ability to no more than 1ms per cpu. This provides
better more predictable user experience for highly threaded applications with
small quota limits on high core count machines. It also eliminates the
propensity to throttle these applications while simultanously using less than
quota amounts of cpu. Another way to say this, is that by allowing the unused
portion of a slice to remain valid across periods we have decreased the
possibility of wastefully expiring quota on cpu-local silos that don't need a
full slice's amount of cpu time.
The interaction between cpu-bound and non-cpu-bound-interactive applications
should also be considered, especially when single core usage hits 100%. If you
gave each of these applications half of a cpu-core and they both got scheduled
on the same CPU it is theoretically possible that the non-cpu bound application
will use up to 1ms additional quota in some periods, thereby preventing the
cpu-bound application from fully using its quota by that same amount. In these
instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
decide which application is chosen to run, as they will both be runnable and
have remaining quota. This runtime discrepancy will be made up in the following
periods when the interactive application idles.
Examples Examples
-------- --------
1. Limit a group to 1 CPU worth of runtime:: 1. Limit a group to 1 CPU worth of runtime::

View File

@ -12578,6 +12578,7 @@ PERFORMANCE EVENTS SUBSYSTEM
M: Peter Zijlstra <peterz@infradead.org> M: Peter Zijlstra <peterz@infradead.org>
M: Ingo Molnar <mingo@redhat.com> M: Ingo Molnar <mingo@redhat.com>
M: Arnaldo Carvalho de Melo <acme@kernel.org> M: Arnaldo Carvalho de Melo <acme@kernel.org>
R: Mark Rutland <mark.rutland@arm.com>
R: Alexander Shishkin <alexander.shishkin@linux.intel.com> R: Alexander Shishkin <alexander.shishkin@linux.intel.com>
R: Jiri Olsa <jolsa@redhat.com> R: Jiri Olsa <jolsa@redhat.com>
R: Namhyung Kim <namhyung@kernel.org> R: Namhyung Kim <namhyung@kernel.org>
@ -14175,6 +14176,12 @@ F: drivers/watchdog/sc1200wdt.c
SCHEDULER SCHEDULER
M: Ingo Molnar <mingo@redhat.com> M: Ingo Molnar <mingo@redhat.com>
M: Peter Zijlstra <peterz@infradead.org> M: Peter Zijlstra <peterz@infradead.org>
M: Juri Lelli <juri.lelli@redhat.com> (SCHED_DEADLINE)
M: Vincent Guittot <vincent.guittot@linaro.org> (SCHED_NORMAL)
R: Dietmar Eggemann <dietmar.eggemann@arm.com> (SCHED_NORMAL)
R: Steven Rostedt <rostedt@goodmis.org> (SCHED_FIFO/SCHED_RR)
R: Ben Segall <bsegall@google.com> (CONFIG_CFS_BANDWIDTH)
R: Mel Gorman <mgorman@suse.de> (CONFIG_NUMA_BALANCING)
L: linux-kernel@vger.kernel.org L: linux-kernel@vger.kernel.org
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
S: Maintained S: Maintained

View File

@ -106,7 +106,7 @@ config STATIC_KEYS_SELFTEST
config OPTPROBES config OPTPROBES
def_bool y def_bool y
depends on KPROBES && HAVE_OPTPROBES depends on KPROBES && HAVE_OPTPROBES
select TASKS_RCU if PREEMPT select TASKS_RCU if PREEMPTION
config KPROBES_ON_FTRACE config KPROBES_ON_FTRACE
def_bool y def_bool y

View File

@ -311,6 +311,7 @@ config ARCH_DISCONTIGMEM_DEFAULT
config NUMA config NUMA
bool "NUMA support" bool "NUMA support"
depends on !FLATMEM depends on !FLATMEM
select SMP
help help
Say Y to compile the kernel to support NUMA (Non-Uniform Memory Say Y to compile the kernel to support NUMA (Non-Uniform Memory
Access). This option is for configuring high-end multiprocessor Access). This option is for configuring high-end multiprocessor

View File

@ -63,7 +63,7 @@
* enough to patch inline, increasing performance. * enough to patch inline, increasing performance.
*/ */
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
# define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
#else #else
# define preempt_stop(clobbers) # define preempt_stop(clobbers)
@ -1084,7 +1084,7 @@ restore_all:
INTERRUPT_RETURN INTERRUPT_RETURN
restore_all_kernel: restore_all_kernel:
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
DISABLE_INTERRUPTS(CLBR_ANY) DISABLE_INTERRUPTS(CLBR_ANY)
cmpl $0, PER_CPU_VAR(__preempt_count) cmpl $0, PER_CPU_VAR(__preempt_count)
jnz .Lno_preempt jnz .Lno_preempt
@ -1364,7 +1364,7 @@ ENTRY(xen_hypervisor_callback)
ENTRY(xen_do_upcall) ENTRY(xen_do_upcall)
1: mov %esp, %eax 1: mov %esp, %eax
call xen_evtchn_do_upcall call xen_evtchn_do_upcall
#ifndef CONFIG_PREEMPT #ifndef CONFIG_PREEMPTION
call xen_maybe_preempt_hcall call xen_maybe_preempt_hcall
#endif #endif
jmp ret_from_intr jmp ret_from_intr

View File

@ -664,7 +664,7 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
/* Returning to kernel space */ /* Returning to kernel space */
retint_kernel: retint_kernel:
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
/* Interrupts are off */ /* Interrupts are off */
/* Check if we need preemption */ /* Check if we need preemption */
btl $9, EFLAGS(%rsp) /* were interrupts off? */ btl $9, EFLAGS(%rsp) /* were interrupts off? */
@ -1115,7 +1115,7 @@ ENTRY(xen_do_hypervisor_callback) /* do_hypervisor_callback(struct *pt_regs) */
call xen_evtchn_do_upcall call xen_evtchn_do_upcall
LEAVE_IRQ_STACK LEAVE_IRQ_STACK
#ifndef CONFIG_PREEMPT #ifndef CONFIG_PREEMPTION
call xen_maybe_preempt_hcall call xen_maybe_preempt_hcall
#endif #endif
jmp error_exit jmp error_exit

View File

@ -34,7 +34,7 @@
THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1 THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
#endif #endif
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
THUNK ___preempt_schedule, preempt_schedule THUNK ___preempt_schedule, preempt_schedule
THUNK ___preempt_schedule_notrace, preempt_schedule_notrace THUNK ___preempt_schedule_notrace, preempt_schedule_notrace
EXPORT_SYMBOL(___preempt_schedule) EXPORT_SYMBOL(___preempt_schedule)

View File

@ -46,7 +46,7 @@
THUNK lockdep_sys_exit_thunk,lockdep_sys_exit THUNK lockdep_sys_exit_thunk,lockdep_sys_exit
#endif #endif
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
THUNK ___preempt_schedule, preempt_schedule THUNK ___preempt_schedule, preempt_schedule
THUNK ___preempt_schedule_notrace, preempt_schedule_notrace THUNK ___preempt_schedule_notrace, preempt_schedule_notrace
EXPORT_SYMBOL(___preempt_schedule) EXPORT_SYMBOL(___preempt_schedule)
@ -55,7 +55,7 @@
#if defined(CONFIG_TRACE_IRQFLAGS) \ #if defined(CONFIG_TRACE_IRQFLAGS) \
|| defined(CONFIG_DEBUG_LOCK_ALLOC) \ || defined(CONFIG_DEBUG_LOCK_ALLOC) \
|| defined(CONFIG_PREEMPT) || defined(CONFIG_PREEMPTION)
.L_restore: .L_restore:
popq %r11 popq %r11
popq %r10 popq %r10

View File

@ -102,7 +102,7 @@ static __always_inline bool should_resched(int preempt_offset)
return unlikely(raw_cpu_read_4(__preempt_count) == preempt_offset); return unlikely(raw_cpu_read_4(__preempt_count) == preempt_offset);
} }
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
extern asmlinkage void ___preempt_schedule(void); extern asmlinkage void ___preempt_schedule(void);
# define __preempt_schedule() \ # define __preempt_schedule() \
asm volatile ("call ___preempt_schedule" : ASM_CALL_CONSTRAINT) asm volatile ("call ___preempt_schedule" : ASM_CALL_CONSTRAINT)

View File

@ -8,6 +8,7 @@
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/sched/clock.h> #include <linux/sched/clock.h>
#include <linux/random.h> #include <linux/random.h>
#include <linux/topology.h>
#include <asm/processor.h> #include <asm/processor.h>
#include <asm/apic.h> #include <asm/apic.h>
#include <asm/cacheinfo.h> #include <asm/cacheinfo.h>
@ -889,6 +890,10 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{ {
set_cpu_cap(c, X86_FEATURE_ZEN); set_cpu_cap(c, X86_FEATURE_ZEN);
#ifdef CONFIG_NUMA
node_reclaim_distance = 32;
#endif
/* /*
* Fix erratum 1076: CPB feature bit not being set in CPUID. * Fix erratum 1076: CPB feature bit not being set in CPUID.
* Always set it, except when running under a hypervisor. * Always set it, except when running under a hypervisor.

View File

@ -367,13 +367,18 @@ NOKPROBE_SYMBOL(oops_end);
int __die(const char *str, struct pt_regs *regs, long err) int __die(const char *str, struct pt_regs *regs, long err)
{ {
const char *pr = "";
/* Save the regs of the first oops for the executive summary later. */ /* Save the regs of the first oops for the executive summary later. */
if (!die_counter) if (!die_counter)
exec_summary_regs = *regs; exec_summary_regs = *regs;
if (IS_ENABLED(CONFIG_PREEMPTION))
pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT";
printk(KERN_DEFAULT printk(KERN_DEFAULT
"%s: %04lx [#%d]%s%s%s%s%s\n", str, err & 0xffff, ++die_counter, "%s: %04lx [#%d]%s%s%s%s%s\n", str, err & 0xffff, ++die_counter,
IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "", pr,
IS_ENABLED(CONFIG_SMP) ? " SMP" : "", IS_ENABLED(CONFIG_SMP) ? " SMP" : "",
debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "", debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "",
IS_ENABLED(CONFIG_KASAN) ? " KASAN" : "", IS_ENABLED(CONFIG_KASAN) ? " KASAN" : "",

View File

@ -580,7 +580,7 @@ static void setup_singlestep(struct kprobe *p, struct pt_regs *regs,
if (setup_detour_execution(p, regs, reenter)) if (setup_detour_execution(p, regs, reenter))
return; return;
#if !defined(CONFIG_PREEMPT) #if !defined(CONFIG_PREEMPTION)
if (p->ainsn.boostable && !p->post_handler) { if (p->ainsn.boostable && !p->post_handler) {
/* Boost up -- we can execute copied instructions directly */ /* Boost up -- we can execute copied instructions directly */
if (!reenter) if (!reenter)

View File

@ -311,7 +311,7 @@ static void kvm_guest_cpu_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason)); u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
pa |= KVM_ASYNC_PF_SEND_ALWAYS; pa |= KVM_ASYNC_PF_SEND_ALWAYS;
#endif #endif
pa |= KVM_ASYNC_PF_ENABLED; pa |= KVM_ASYNC_PF_ENABLED;

View File

@ -78,11 +78,11 @@ static __always_inline bool should_resched(int preempt_offset)
tif_need_resched()); tif_need_resched());
} }
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
extern asmlinkage void preempt_schedule(void); extern asmlinkage void preempt_schedule(void);
#define __preempt_schedule() preempt_schedule() #define __preempt_schedule() preempt_schedule()
extern asmlinkage void preempt_schedule_notrace(void); extern asmlinkage void preempt_schedule_notrace(void);
#define __preempt_schedule_notrace() preempt_schedule_notrace() #define __preempt_schedule_notrace() preempt_schedule_notrace()
#endif /* CONFIG_PREEMPT */ #endif /* CONFIG_PREEMPTION */
#endif /* __ASM_PREEMPT_H */ #endif /* __ASM_PREEMPT_H */

View File

@ -150,6 +150,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset, struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
struct cgroup_subsys_state **dst_cssp); struct cgroup_subsys_state **dst_cssp);
void cgroup_enable_task_cg_lists(void);
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
struct css_task_iter *it); struct css_task_iter *it);
struct task_struct *css_task_iter_next(struct css_task_iter *it); struct task_struct *css_task_iter_next(struct css_task_iter *it);

View File

@ -40,14 +40,14 @@ static inline bool cpusets_enabled(void)
static inline void cpuset_inc(void) static inline void cpuset_inc(void)
{ {
static_branch_inc(&cpusets_pre_enable_key); static_branch_inc_cpuslocked(&cpusets_pre_enable_key);
static_branch_inc(&cpusets_enabled_key); static_branch_inc_cpuslocked(&cpusets_enabled_key);
} }
static inline void cpuset_dec(void) static inline void cpuset_dec(void)
{ {
static_branch_dec(&cpusets_enabled_key); static_branch_dec_cpuslocked(&cpusets_enabled_key);
static_branch_dec(&cpusets_pre_enable_key); static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
} }
extern int cpuset_init(void); extern int cpuset_init(void);
@ -55,6 +55,8 @@ extern void cpuset_init_smp(void);
extern void cpuset_force_rebuild(void); extern void cpuset_force_rebuild(void);
extern void cpuset_update_active_cpus(void); extern void cpuset_update_active_cpus(void);
extern void cpuset_wait_for_hotplug(void); extern void cpuset_wait_for_hotplug(void);
extern void cpuset_read_lock(void);
extern void cpuset_read_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_fallback(struct task_struct *p); extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p); extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@ -176,6 +178,9 @@ static inline void cpuset_update_active_cpus(void)
static inline void cpuset_wait_for_hotplug(void) { } static inline void cpuset_wait_for_hotplug(void) { }
static inline void cpuset_read_lock(void) { }
static inline void cpuset_read_unlock(void) { }
static inline void cpuset_cpus_allowed(struct task_struct *p, static inline void cpuset_cpus_allowed(struct task_struct *p,
struct cpumask *mask) struct cpumask *mask)
{ {

View File

@ -182,7 +182,7 @@ do { \
#define preemptible() (preempt_count() == 0 && !irqs_disabled()) #define preemptible() (preempt_count() == 0 && !irqs_disabled())
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
#define preempt_enable() \ #define preempt_enable() \
do { \ do { \
barrier(); \ barrier(); \
@ -203,7 +203,7 @@ do { \
__preempt_schedule(); \ __preempt_schedule(); \
} while (0) } while (0)
#else /* !CONFIG_PREEMPT */ #else /* !CONFIG_PREEMPTION */
#define preempt_enable() \ #define preempt_enable() \
do { \ do { \
barrier(); \ barrier(); \
@ -217,7 +217,7 @@ do { \
} while (0) } while (0)
#define preempt_check_resched() do { } while (0) #define preempt_check_resched() do { } while (0)
#endif /* CONFIG_PREEMPT */ #endif /* CONFIG_PREEMPTION */
#define preempt_disable_notrace() \ #define preempt_disable_notrace() \
do { \ do { \

View File

@ -585,7 +585,7 @@ do { \
* *
* In non-preemptible RCU implementations (TREE_RCU and TINY_RCU), * In non-preemptible RCU implementations (TREE_RCU and TINY_RCU),
* it is illegal to block while in an RCU read-side critical section. * it is illegal to block while in an RCU read-side critical section.
* In preemptible RCU implementations (PREEMPT_RCU) in CONFIG_PREEMPT * In preemptible RCU implementations (PREEMPT_RCU) in CONFIG_PREEMPTION
* kernel builds, RCU read-side critical sections may be preempted, * kernel builds, RCU read-side critical sections may be preempted,
* but explicit blocking is illegal. Finally, in preemptible RCU * but explicit blocking is illegal. Finally, in preemptible RCU
* implementations in real-time (with -rt patchset) kernel builds, RCU * implementations in real-time (with -rt patchset) kernel builds, RCU

View File

@ -53,7 +53,7 @@ void rcu_scheduler_starting(void);
extern int rcu_scheduler_active __read_mostly; extern int rcu_scheduler_active __read_mostly;
void rcu_end_inkernel_boot(void); void rcu_end_inkernel_boot(void);
bool rcu_is_watching(void); bool rcu_is_watching(void);
#ifndef CONFIG_PREEMPT #ifndef CONFIG_PREEMPTION
void rcu_all_qs(void); void rcu_all_qs(void);
#endif #endif

View File

@ -295,6 +295,11 @@ enum uclamp_id {
UCLAMP_CNT UCLAMP_CNT
}; };
#ifdef CONFIG_SMP
extern struct root_domain def_root_domain;
extern struct mutex sched_domains_mutex;
#endif
struct sched_info { struct sched_info {
#ifdef CONFIG_SCHED_INFO #ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */ /* Cumulative counters: */
@ -1767,7 +1772,7 @@ static inline int test_tsk_need_resched(struct task_struct *tsk)
* value indicates whether a reschedule was done in fact. * value indicates whether a reschedule was done in fact.
* cond_resched_lock() will drop the spinlock before scheduling, * cond_resched_lock() will drop the spinlock before scheduling,
*/ */
#ifndef CONFIG_PREEMPT #ifndef CONFIG_PREEMPTION
extern int _cond_resched(void); extern int _cond_resched(void);
#else #else
static inline int _cond_resched(void) { return 0; } static inline int _cond_resched(void) { return 0; }
@ -1796,12 +1801,12 @@ static inline void cond_resched_rcu(void)
/* /*
* Does a critical section need to be broken due to another * Does a critical section need to be broken due to another
* task waiting?: (technically does not depend on CONFIG_PREEMPT, * task waiting?: (technically does not depend on CONFIG_PREEMPTION,
* but a general need for low latency) * but a general need for low latency)
*/ */
static inline int spin_needbreak(spinlock_t *lock) static inline int spin_needbreak(spinlock_t *lock)
{ {
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
return spin_is_contended(lock); return spin_is_contended(lock);
#else #else
return 0; return 0;

View File

@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
{ {
return (s64)(a - b) < 0; return (s64)(a - b) < 0;
} }
#ifdef CONFIG_SMP
struct root_domain;
extern void dl_add_task_root_domain(struct task_struct *p);
extern void dl_clear_root_domain(struct root_domain *rd);
#endif /* CONFIG_SMP */

View File

@ -105,7 +105,11 @@ extern void sched_exec(void);
#define sched_exec() {} #define sched_exec() {}
#endif #endif
#define get_task_struct(tsk) do { refcount_inc(&(tsk)->usage); } while(0) static inline struct task_struct *get_task_struct(struct task_struct *t)
{
refcount_inc(&t->usage);
return t;
}
extern void __put_task_struct(struct task_struct *t); extern void __put_task_struct(struct task_struct *t);

View File

@ -150,6 +150,10 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
return to_cpumask(sd->span); return to_cpumask(sd->span);
} }
extern void partition_sched_domains_locked(int ndoms_new,
cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);
extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new); struct sched_domain_attr *dattr_new);
@ -194,6 +198,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
struct sched_domain_attr; struct sched_domain_attr;
static inline void
partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new)
{
}
static inline void static inline void
partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new) struct sched_domain_attr *dattr_new)

View File

@ -214,7 +214,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
/* /*
* Define the various spin_lock methods. Note we define these * Define the various spin_lock methods. Note we define these
* regardless of whether CONFIG_SMP or CONFIG_PREEMPT are set. The * regardless of whether CONFIG_SMP or CONFIG_PREEMPTION are set. The
* various methods are defined as nops in the case they are not * various methods are defined as nops in the case they are not
* required. * required.
*/ */

View File

@ -96,7 +96,7 @@ static inline int __raw_spin_trylock(raw_spinlock_t *lock)
/* /*
* If lockdep is enabled then we use the non-preemption spin-ops * If lockdep is enabled then we use the non-preemption spin-ops
* even on CONFIG_PREEMPT, because lockdep assumes that interrupts are * even on CONFIG_PREEMPTION, because lockdep assumes that interrupts are
* not re-enabled during lock-acquire (which the preempt-spin-ops do): * not re-enabled during lock-acquire (which the preempt-spin-ops do):
*/ */
#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC) #if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)

View File

@ -60,6 +60,20 @@ int arch_update_cpu_topology(void);
*/ */
#define RECLAIM_DISTANCE 30 #define RECLAIM_DISTANCE 30
#endif #endif
/*
* The following tunable allows platforms to override the default node
* reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are
* sufficiently fast that the default value actually hurts
* performance.
*
* AMD EPYC machines use this because even though the 2-hop distance
* is 32 (3.2x slower than a local memory access) performance actually
* *improves* if allowed to reclaim memory and load balance tasks
* between NUMA nodes 2-hops apart.
*/
extern int __read_mostly node_reclaim_distance;
#ifndef PENALTY_FOR_NODE_WITH_CPUS #ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1) #define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif #endif

View File

@ -86,7 +86,7 @@ void _torture_stop_kthread(char *m, struct task_struct **tp);
#define torture_stop_kthread(n, tp) \ #define torture_stop_kthread(n, tp) \
_torture_stop_kthread("Stopping " #n " task", &(tp)) _torture_stop_kthread("Stopping " #n " task", &(tp))
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
#define torture_preempt_schedule() preempt_schedule() #define torture_preempt_schedule() preempt_schedule()
#else #else
#define torture_preempt_schedule() #define torture_preempt_schedule()

View File

@ -931,6 +931,28 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED endif #CGROUP_SCHED
config UCLAMP_TASK_GROUP
bool "Utilization clamping per group of tasks"
depends on CGROUP_SCHED
depends on UCLAMP_TASK
default n
help
This feature enables the scheduler to track the clamped utilization
of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
When this option is enabled, the user can specify a min and max
CPU bandwidth which is allowed for each single task in a group.
The max bandwidth allows to clamp the maximum frequency a task
can use, while the min bandwidth allows to define a minimum
frequency a task will always use.
When task group based utilization clamping is enabled, an eventually
specified task-specific clamp value is constrained by the cgroup
specified clamp value. Both minimum and maximum task clamping cannot
be bigger than the corresponding clamping defined at task group level.
If in doubt, say N.
config CGROUP_PIDS config CGROUP_PIDS
bool "PIDs controller" bool "PIDs controller"
help help

View File

@ -174,7 +174,7 @@ struct task_struct init_task
#ifdef CONFIG_FUNCTION_GRAPH_TRACER #ifdef CONFIG_FUNCTION_GRAPH_TRACER
.ret_stack = NULL, .ret_stack = NULL,
#endif #endif
#if defined(CONFIG_TRACING) && defined(CONFIG_PREEMPT) #if defined(CONFIG_TRACING) && defined(CONFIG_PREEMPTION)
.trace_recursion = 0, .trace_recursion = 0,
#endif #endif
#ifdef CONFIG_LIVEPATCH #ifdef CONFIG_LIVEPATCH

View File

@ -433,7 +433,7 @@ noinline void __ref rest_init(void)
/* /*
* Enable might_sleep() and smp_processor_id() checks. * Enable might_sleep() and smp_processor_id() checks.
* They cannot be enabled earlier because with CONFIG_PREEMPT=y * They cannot be enabled earlier because with CONFIG_PREEMPTION=y
* kernel_thread() would trigger might_sleep() splats. With * kernel_thread() would trigger might_sleep() splats. With
* CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
* already, but it's stuck on the kthreadd_done completion. * already, but it's stuck on the kthreadd_done completion.

View File

@ -1891,7 +1891,7 @@ static int cgroup_reconfigure(struct fs_context *fc)
*/ */
static bool use_task_css_set_links __read_mostly; static bool use_task_css_set_links __read_mostly;
static void cgroup_enable_task_cg_lists(void) void cgroup_enable_task_cg_lists(void)
{ {
struct task_struct *p, *g; struct task_struct *p, *g;

View File

@ -45,6 +45,7 @@
#include <linux/proc_fs.h> #include <linux/proc_fs.h>
#include <linux/rcupdate.h> #include <linux/rcupdate.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/sched/deadline.h>
#include <linux/sched/mm.h> #include <linux/sched/mm.h>
#include <linux/sched/task.h> #include <linux/sched/task.h>
#include <linux/seq_file.h> #include <linux/seq_file.h>
@ -332,7 +333,18 @@ static struct cpuset top_cpuset = {
* guidelines for accessing subsystem state in kernel/cgroup.c * guidelines for accessing subsystem state in kernel/cgroup.c
*/ */
static DEFINE_MUTEX(cpuset_mutex); DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
void cpuset_read_lock(void)
{
percpu_down_read(&cpuset_rwsem);
}
void cpuset_read_unlock(void)
{
percpu_up_read(&cpuset_rwsem);
}
static DEFINE_SPINLOCK(callback_lock); static DEFINE_SPINLOCK(callback_lock);
static struct workqueue_struct *cpuset_migrate_mm_wq; static struct workqueue_struct *cpuset_migrate_mm_wq;
@ -894,6 +906,67 @@ done:
return ndoms; return ndoms;
} }
static void update_tasks_root_domain(struct cpuset *cs)
{
struct css_task_iter it;
struct task_struct *task;
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it)))
dl_add_task_root_domain(task);
css_task_iter_end(&it);
}
static void rebuild_root_domains(void)
{
struct cpuset *cs = NULL;
struct cgroup_subsys_state *pos_css;
percpu_rwsem_assert_held(&cpuset_rwsem);
lockdep_assert_cpus_held();
lockdep_assert_held(&sched_domains_mutex);
cgroup_enable_task_cg_lists();
rcu_read_lock();
/*
* Clear default root domain DL accounting, it will be computed again
* if a task belongs to it.
*/
dl_clear_root_domain(&def_root_domain);
cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
if (cpumask_empty(cs->effective_cpus)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
css_get(&cs->css);
rcu_read_unlock();
update_tasks_root_domain(cs);
rcu_read_lock();
css_put(&cs->css);
}
rcu_read_unlock();
}
static void
partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new)
{
mutex_lock(&sched_domains_mutex);
partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
rebuild_root_domains();
mutex_unlock(&sched_domains_mutex);
}
/* /*
* Rebuild scheduler domains. * Rebuild scheduler domains.
* *
@ -911,8 +984,8 @@ static void rebuild_sched_domains_locked(void)
cpumask_var_t *doms; cpumask_var_t *doms;
int ndoms; int ndoms;
lockdep_assert_held(&cpuset_mutex); lockdep_assert_cpus_held();
get_online_cpus(); percpu_rwsem_assert_held(&cpuset_rwsem);
/* /*
* We have raced with CPU hotplug. Don't do anything to avoid * We have raced with CPU hotplug. Don't do anything to avoid
@ -921,19 +994,17 @@ static void rebuild_sched_domains_locked(void)
*/ */
if (!top_cpuset.nr_subparts_cpus && if (!top_cpuset.nr_subparts_cpus &&
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask)) !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
goto out; return;
if (top_cpuset.nr_subparts_cpus && if (top_cpuset.nr_subparts_cpus &&
!cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask)) !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out; return;
/* Generate domain masks and attrs */ /* Generate domain masks and attrs */
ndoms = generate_sched_domains(&doms, &attr); ndoms = generate_sched_domains(&doms, &attr);
/* Have scheduler rebuild the domains */ /* Have scheduler rebuild the domains */
partition_sched_domains(ndoms, doms, attr); partition_and_rebuild_sched_domains(ndoms, doms, attr);
out:
put_online_cpus();
} }
#else /* !CONFIG_SMP */ #else /* !CONFIG_SMP */
static void rebuild_sched_domains_locked(void) static void rebuild_sched_domains_locked(void)
@ -943,9 +1014,11 @@ static void rebuild_sched_domains_locked(void)
void rebuild_sched_domains(void) void rebuild_sched_domains(void)
{ {
mutex_lock(&cpuset_mutex); get_online_cpus();
percpu_down_write(&cpuset_rwsem);
rebuild_sched_domains_locked(); rebuild_sched_domains_locked();
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
put_online_cpus();
} }
/** /**
@ -1051,7 +1124,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
int deleting; /* Moving cpus from subparts_cpus to effective_cpus */ int deleting; /* Moving cpus from subparts_cpus to effective_cpus */
bool part_error = false; /* Partition error? */ bool part_error = false; /* Partition error? */
lockdep_assert_held(&cpuset_mutex); percpu_rwsem_assert_held(&cpuset_rwsem);
/* /*
* The parent must be a partition root. * The parent must be a partition root.
@ -2039,7 +2112,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css)); cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
cs = css_cs(css); cs = css_cs(css);
mutex_lock(&cpuset_mutex); percpu_down_write(&cpuset_rwsem);
/* allow moving tasks into an empty cpuset if on default hierarchy */ /* allow moving tasks into an empty cpuset if on default hierarchy */
ret = -ENOSPC; ret = -ENOSPC;
@ -2063,7 +2136,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cs->attach_in_progress++; cs->attach_in_progress++;
ret = 0; ret = 0;
out_unlock: out_unlock:
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
return ret; return ret;
} }
@ -2073,9 +2146,9 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
cgroup_taskset_first(tset, &css); cgroup_taskset_first(tset, &css);
mutex_lock(&cpuset_mutex); percpu_down_write(&cpuset_rwsem);
css_cs(css)->attach_in_progress--; css_cs(css)->attach_in_progress--;
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
} }
/* /*
@ -2098,7 +2171,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
cgroup_taskset_first(tset, &css); cgroup_taskset_first(tset, &css);
cs = css_cs(css); cs = css_cs(css);
mutex_lock(&cpuset_mutex); percpu_down_write(&cpuset_rwsem);
/* prepare for attach */ /* prepare for attach */
if (cs == &top_cpuset) if (cs == &top_cpuset)
@ -2152,7 +2225,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
if (!cs->attach_in_progress) if (!cs->attach_in_progress)
wake_up(&cpuset_attach_wq); wake_up(&cpuset_attach_wq);
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
} }
/* The various types of files and directories in a cpuset file system */ /* The various types of files and directories in a cpuset file system */
@ -2183,7 +2256,8 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
cpuset_filetype_t type = cft->private; cpuset_filetype_t type = cft->private;
int retval = 0; int retval = 0;
mutex_lock(&cpuset_mutex); get_online_cpus();
percpu_down_write(&cpuset_rwsem);
if (!is_cpuset_online(cs)) { if (!is_cpuset_online(cs)) {
retval = -ENODEV; retval = -ENODEV;
goto out_unlock; goto out_unlock;
@ -2219,7 +2293,8 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
break; break;
} }
out_unlock: out_unlock:
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
put_online_cpus();
return retval; return retval;
} }
@ -2230,7 +2305,8 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
cpuset_filetype_t type = cft->private; cpuset_filetype_t type = cft->private;
int retval = -ENODEV; int retval = -ENODEV;
mutex_lock(&cpuset_mutex); get_online_cpus();
percpu_down_write(&cpuset_rwsem);
if (!is_cpuset_online(cs)) if (!is_cpuset_online(cs))
goto out_unlock; goto out_unlock;
@ -2243,7 +2319,8 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
break; break;
} }
out_unlock: out_unlock:
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
put_online_cpus();
return retval; return retval;
} }
@ -2282,7 +2359,8 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
kernfs_break_active_protection(of->kn); kernfs_break_active_protection(of->kn);
flush_work(&cpuset_hotplug_work); flush_work(&cpuset_hotplug_work);
mutex_lock(&cpuset_mutex); get_online_cpus();
percpu_down_write(&cpuset_rwsem);
if (!is_cpuset_online(cs)) if (!is_cpuset_online(cs))
goto out_unlock; goto out_unlock;
@ -2306,7 +2384,8 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
free_cpuset(trialcs); free_cpuset(trialcs);
out_unlock: out_unlock:
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
put_online_cpus();
kernfs_unbreak_active_protection(of->kn); kernfs_unbreak_active_protection(of->kn);
css_put(&cs->css); css_put(&cs->css);
flush_workqueue(cpuset_migrate_mm_wq); flush_workqueue(cpuset_migrate_mm_wq);
@ -2437,13 +2516,15 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
return -EINVAL; return -EINVAL;
css_get(&cs->css); css_get(&cs->css);
mutex_lock(&cpuset_mutex); get_online_cpus();
percpu_down_write(&cpuset_rwsem);
if (!is_cpuset_online(cs)) if (!is_cpuset_online(cs))
goto out_unlock; goto out_unlock;
retval = update_prstate(cs, val); retval = update_prstate(cs, val);
out_unlock: out_unlock:
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
put_online_cpus();
css_put(&cs->css); css_put(&cs->css);
return retval ?: nbytes; return retval ?: nbytes;
} }
@ -2649,7 +2730,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
if (!parent) if (!parent)
return 0; return 0;
mutex_lock(&cpuset_mutex); get_online_cpus();
percpu_down_write(&cpuset_rwsem);
set_bit(CS_ONLINE, &cs->flags); set_bit(CS_ONLINE, &cs->flags);
if (is_spread_page(parent)) if (is_spread_page(parent))
@ -2700,7 +2782,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
cpumask_copy(cs->effective_cpus, parent->cpus_allowed); cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
out_unlock: out_unlock:
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
put_online_cpus();
return 0; return 0;
} }
@ -2719,7 +2802,8 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
{ {
struct cpuset *cs = css_cs(css); struct cpuset *cs = css_cs(css);
mutex_lock(&cpuset_mutex); get_online_cpus();
percpu_down_write(&cpuset_rwsem);
if (is_partition_root(cs)) if (is_partition_root(cs))
update_prstate(cs, 0); update_prstate(cs, 0);
@ -2738,7 +2822,8 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
cpuset_dec(); cpuset_dec();
clear_bit(CS_ONLINE, &cs->flags); clear_bit(CS_ONLINE, &cs->flags);
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
put_online_cpus();
} }
static void cpuset_css_free(struct cgroup_subsys_state *css) static void cpuset_css_free(struct cgroup_subsys_state *css)
@ -2750,7 +2835,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
static void cpuset_bind(struct cgroup_subsys_state *root_css) static void cpuset_bind(struct cgroup_subsys_state *root_css)
{ {
mutex_lock(&cpuset_mutex); percpu_down_write(&cpuset_rwsem);
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
if (is_in_v2_mode()) { if (is_in_v2_mode()) {
@ -2763,7 +2848,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
} }
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
} }
/* /*
@ -2805,6 +2890,8 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
int __init cpuset_init(void) int __init cpuset_init(void)
{ {
BUG_ON(percpu_init_rwsem(&cpuset_rwsem));
BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
@ -2876,7 +2963,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
is_empty = cpumask_empty(cs->cpus_allowed) || is_empty = cpumask_empty(cs->cpus_allowed) ||
nodes_empty(cs->mems_allowed); nodes_empty(cs->mems_allowed);
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
/* /*
* Move tasks to the nearest ancestor with execution resources, * Move tasks to the nearest ancestor with execution resources,
@ -2886,7 +2973,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
if (is_empty) if (is_empty)
remove_tasks_in_empty_cpuset(cs); remove_tasks_in_empty_cpuset(cs);
mutex_lock(&cpuset_mutex); percpu_down_write(&cpuset_rwsem);
} }
static void static void
@ -2936,14 +3023,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
retry: retry:
wait_event(cpuset_attach_wq, cs->attach_in_progress == 0); wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
mutex_lock(&cpuset_mutex); percpu_down_write(&cpuset_rwsem);
/* /*
* We have raced with task attaching. We wait until attaching * We have raced with task attaching. We wait until attaching
* is finished, so we won't attach a task to an empty cpuset. * is finished, so we won't attach a task to an empty cpuset.
*/ */
if (cs->attach_in_progress) { if (cs->attach_in_progress) {
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
goto retry; goto retry;
} }
@ -3011,7 +3098,7 @@ update_tasks:
hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
cpus_updated, mems_updated); cpus_updated, mems_updated);
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
} }
/** /**
@ -3041,7 +3128,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
if (on_dfl && !alloc_cpumasks(NULL, &tmp)) if (on_dfl && !alloc_cpumasks(NULL, &tmp))
ptmp = &tmp; ptmp = &tmp;
mutex_lock(&cpuset_mutex); percpu_down_write(&cpuset_rwsem);
/* fetch the available cpus/mems and find out which changed how */ /* fetch the available cpus/mems and find out which changed how */
cpumask_copy(&new_cpus, cpu_active_mask); cpumask_copy(&new_cpus, cpu_active_mask);
@ -3091,7 +3178,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
update_tasks_nodemask(&top_cpuset); update_tasks_nodemask(&top_cpuset);
} }
mutex_unlock(&cpuset_mutex); percpu_up_write(&cpuset_rwsem);
/* if cpus or mems changed, we need to propagate to descendants */ /* if cpus or mems changed, we need to propagate to descendants */
if (cpus_updated || mems_updated) { if (cpus_updated || mems_updated) {

View File

@ -4174,10 +4174,8 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
return NULL; return NULL;
__perf_event_init_context(ctx); __perf_event_init_context(ctx);
if (task) { if (task)
ctx->task = task; ctx->task = get_task_struct(task);
get_task_struct(task);
}
ctx->pmu = pmu; ctx->pmu = pmu;
return ctx; return ctx;
@ -10440,8 +10438,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
* and we cannot use the ctx information because we need the * and we cannot use the ctx information because we need the
* pmu before we get a ctx. * pmu before we get a ctx.
*/ */
get_task_struct(task); event->hw.target = get_task_struct(task);
event->hw.target = task;
} }
event->clock = &local_clock; event->clock = &local_clock;

View File

@ -1255,8 +1255,7 @@ setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
* the thread dies to avoid that the interrupt code * the thread dies to avoid that the interrupt code
* references an already freed task_struct. * references an already freed task_struct.
*/ */
get_task_struct(t); new->thread = get_task_struct(t);
new->thread = t;
/* /*
* Tell the thread to set its affinity. This is * Tell the thread to set its affinity. This is
* important for shared interrupt handlers as we do * important for shared interrupt handlers as we do

View File

@ -1907,7 +1907,7 @@ int register_kretprobe(struct kretprobe *rp)
/* Pre-allocate memory for max kretprobe instances */ /* Pre-allocate memory for max kretprobe instances */
if (rp->maxactive <= 0) { if (rp->maxactive <= 0) {
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus()); rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus());
#else #else
rp->maxactive = num_possible_cpus(); rp->maxactive = num_possible_cpus();

View File

@ -628,8 +628,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
} }
/* [10] Grab the next task, i.e. owner of @lock */ /* [10] Grab the next task, i.e. owner of @lock */
task = rt_mutex_owner(lock); task = get_task_struct(rt_mutex_owner(lock));
get_task_struct(task);
raw_spin_lock(&task->pi_lock); raw_spin_lock(&task->pi_lock);
/* /*
@ -709,8 +708,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
} }
/* [10] Grab the next task, i.e. the owner of @lock */ /* [10] Grab the next task, i.e. the owner of @lock */
task = rt_mutex_owner(lock); task = get_task_struct(rt_mutex_owner(lock));
get_task_struct(task);
raw_spin_lock(&task->pi_lock); raw_spin_lock(&task->pi_lock);
/* [11] requeue the pi waiters if necessary */ /* [11] requeue the pi waiters if necessary */

View File

@ -7,7 +7,7 @@ menu "RCU Subsystem"
config TREE_RCU config TREE_RCU
bool bool
default y if !PREEMPT && SMP default y if !PREEMPTION && SMP
help help
This option selects the RCU implementation that is This option selects the RCU implementation that is
designed for very large SMP system with hundreds or designed for very large SMP system with hundreds or
@ -16,7 +16,7 @@ config TREE_RCU
config PREEMPT_RCU config PREEMPT_RCU
bool bool
default y if PREEMPT default y if PREEMPTION
help help
This option selects the RCU implementation that is This option selects the RCU implementation that is
designed for very large SMP systems with hundreds or designed for very large SMP systems with hundreds or
@ -28,7 +28,7 @@ config PREEMPT_RCU
config TINY_RCU config TINY_RCU
bool bool
default y if !PREEMPT && !SMP default y if !PREEMPTION && !SMP
help help
This option selects the RCU implementation that is This option selects the RCU implementation that is
designed for UP systems from which real-time response designed for UP systems from which real-time response
@ -70,7 +70,7 @@ config TREE_SRCU
This option selects the full-fledged version of SRCU. This option selects the full-fledged version of SRCU.
config TASKS_RCU config TASKS_RCU
def_bool PREEMPT def_bool PREEMPTION
select SRCU select SRCU
help help
This option enables a task-based RCU implementation that uses This option enables a task-based RCU implementation that uses

View File

@ -1912,7 +1912,7 @@ rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
struct rcu_node *rnp_p; struct rcu_node *rnp_p;
raw_lockdep_assert_held_rcu_node(rnp); raw_lockdep_assert_held_rcu_node(rnp);
if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT)) || if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPTION)) ||
WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)) || WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)) ||
rnp->qsmask != 0) { rnp->qsmask != 0) {
raw_spin_unlock_irqrestore_rcu_node(rnp, flags); raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
@ -2266,7 +2266,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
mask = 0; mask = 0;
raw_spin_lock_irqsave_rcu_node(rnp, flags); raw_spin_lock_irqsave_rcu_node(rnp, flags);
if (rnp->qsmask == 0) { if (rnp->qsmask == 0) {
if (!IS_ENABLED(CONFIG_PREEMPT) || if (!IS_ENABLED(CONFIG_PREEMPTION) ||
rcu_preempt_blocked_readers_cgp(rnp)) { rcu_preempt_blocked_readers_cgp(rnp)) {
/* /*
* No point in scanning bits because they * No point in scanning bits because they
@ -2681,7 +2681,7 @@ static int rcu_blocking_is_gp(void)
{ {
int ret; int ret;
if (IS_ENABLED(CONFIG_PREEMPT)) if (IS_ENABLED(CONFIG_PREEMPTION))
return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
might_sleep(); /* Check for RCU read-side critical section. */ might_sleep(); /* Check for RCU read-side critical section. */
preempt_disable(); preempt_disable();
@ -3297,13 +3297,13 @@ static int __init rcu_spawn_gp_kthread(void)
t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name); t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name);
if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__)) if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__))
return 0; return 0;
rnp = rcu_get_root();
raw_spin_lock_irqsave_rcu_node(rnp, flags);
rcu_state.gp_kthread = t;
if (kthread_prio) { if (kthread_prio) {
sp.sched_priority = kthread_prio; sp.sched_priority = kthread_prio;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp); sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
} }
rnp = rcu_get_root();
raw_spin_lock_irqsave_rcu_node(rnp, flags);
rcu_state.gp_kthread = t;
raw_spin_unlock_irqrestore_rcu_node(rnp, flags); raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
wake_up_process(t); wake_up_process(t);
rcu_spawn_nocb_kthreads(); rcu_spawn_nocb_kthreads();

View File

@ -163,7 +163,7 @@ static void rcu_iw_handler(struct irq_work *iwp)
// //
// Printing RCU CPU stall warnings // Printing RCU CPU stall warnings
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
/* /*
* Dump detailed information for all tasks blocking the current RCU * Dump detailed information for all tasks blocking the current RCU
@ -215,7 +215,7 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
return ndetected; return ndetected;
} }
#else /* #ifdef CONFIG_PREEMPT */ #else /* #ifdef CONFIG_PREEMPTION */
/* /*
* Because preemptible RCU does not exist, we never have to check for * Because preemptible RCU does not exist, we never have to check for
@ -233,7 +233,7 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
{ {
return 0; return 0;
} }
#endif /* #else #ifdef CONFIG_PREEMPT */ #endif /* #else #ifdef CONFIG_PREEMPTION */
/* /*
* Dump stacks of all tasks running on stalled CPUs. First try using * Dump stacks of all tasks running on stalled CPUs. First try using

View File

@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
} }
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
/*
* Serializes updates of utilization clamp values
*
* The (slow-path) user-space triggers utilization clamp value updates which
* can require updates on (fast-path) scheduler's data structures used to
* support enqueue/dequeue operations.
* While the per-CPU rq lock protects fast-path update operations, user-space
* requests are serialized using a mutex to reduce the risk of conflicting
* updates or API abuses.
*/
static DEFINE_MUTEX(uclamp_mutex);
/* Max allowed minimum utilization */ /* Max allowed minimum utilization */
unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE; unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
@ -798,7 +810,7 @@ static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value); return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
} }
static inline unsigned int uclamp_none(int clamp_id) static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
{ {
if (clamp_id == UCLAMP_MIN) if (clamp_id == UCLAMP_MIN)
return 0; return 0;
@ -814,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
} }
static inline unsigned int static inline unsigned int
uclamp_idle_value(struct rq *rq, unsigned int clamp_id, uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
unsigned int clamp_value) unsigned int clamp_value)
{ {
/* /*
@ -830,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
return uclamp_none(UCLAMP_MIN); return uclamp_none(UCLAMP_MIN);
} }
static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id, static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
unsigned int clamp_value) unsigned int clamp_value)
{ {
/* Reset max-clamp retention only on idle exit */ /* Reset max-clamp retention only on idle exit */
@ -841,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
} }
static inline static inline
unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id, enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
unsigned int clamp_value) unsigned int clamp_value)
{ {
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1; int bucket_id = UCLAMP_BUCKETS - 1;
@ -861,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value); return uclamp_idle_value(rq, clamp_id, clamp_value);
} }
static inline struct uclamp_se
uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
#ifdef CONFIG_UCLAMP_TASK_GROUP
struct uclamp_se uc_max;
/*
* Tasks in autogroups or root task group will be
* restricted by system defaults.
*/
if (task_group_is_autogroup(task_group(p)))
return uc_req;
if (task_group(p) == &root_task_group)
return uc_req;
uc_max = task_group(p)->uclamp[clamp_id];
if (uc_req.value > uc_max.value || !uc_req.user_defined)
return uc_max;
#endif
return uc_req;
}
/* /*
* The effective clamp bucket index of a task depends on, by increasing * The effective clamp bucket index of a task depends on, by increasing
* priority: * priority:
* - the task specific clamp value, when explicitly requested from userspace * - the task specific clamp value, when explicitly requested from userspace
* - the task group effective clamp value, for tasks not either in the root
* group or in an autogroup
* - the system default clamp value, defined by the sysadmin * - the system default clamp value, defined by the sysadmin
*/ */
static inline struct uclamp_se static inline struct uclamp_se
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id) uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
{ {
struct uclamp_se uc_req = p->uclamp_req[clamp_id]; struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id]; struct uclamp_se uc_max = uclamp_default[clamp_id];
/* System default restrictions always apply */ /* System default restrictions always apply */
@ -880,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
return uc_req; return uc_req;
} }
unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id) enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
{ {
struct uclamp_se uc_eff; struct uclamp_se uc_eff;
@ -904,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
* for each bucket when all its RUNNABLE tasks require the same clamp. * for each bucket when all its RUNNABLE tasks require the same clamp.
*/ */
static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
unsigned int clamp_id) enum uclamp_id clamp_id)
{ {
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_se *uc_se = &p->uclamp[clamp_id]; struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@ -942,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
* enforce the expected state and warn. * enforce the expected state and warn.
*/ */
static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p, static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
unsigned int clamp_id) enum uclamp_id clamp_id)
{ {
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_se *uc_se = &p->uclamp[clamp_id]; struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@ -981,7 +1019,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
{ {
unsigned int clamp_id; enum uclamp_id clamp_id;
if (unlikely(!p->sched_class->uclamp_enabled)) if (unlikely(!p->sched_class->uclamp_enabled))
return; return;
@ -996,7 +1034,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
{ {
unsigned int clamp_id; enum uclamp_id clamp_id;
if (unlikely(!p->sched_class->uclamp_enabled)) if (unlikely(!p->sched_class->uclamp_enabled))
return; return;
@ -1005,15 +1043,82 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id); uclamp_rq_dec_id(rq, p, clamp_id);
} }
static inline void
uclamp_update_active(struct task_struct *p, enum uclamp_id clamp_id)
{
struct rq_flags rf;
struct rq *rq;
/*
* Lock the task and the rq where the task is (or was) queued.
*
* We might lock the (previous) rq of a !RUNNABLE task, but that's the
* price to pay to safely serialize util_{min,max} updates with
* enqueues, dequeues and migration operations.
* This is the same locking schema used by __set_cpus_allowed_ptr().
*/
rq = task_rq_lock(p, &rf);
/*
* Setting the clamp bucket is serialized by task_rq_lock().
* If the task is not yet RUNNABLE and its task_struct is not
* affecting a valid clamp bucket, the next time it's enqueued,
* it will already see the updated clamp bucket value.
*/
if (!p->uclamp[clamp_id].active) {
uclamp_rq_dec_id(rq, p, clamp_id);
uclamp_rq_inc_id(rq, p, clamp_id);
}
task_rq_unlock(rq, p, &rf);
}
static inline void
uclamp_update_active_tasks(struct cgroup_subsys_state *css,
unsigned int clamps)
{
enum uclamp_id clamp_id;
struct css_task_iter it;
struct task_struct *p;
css_task_iter_start(css, 0, &it);
while ((p = css_task_iter_next(&it))) {
for_each_clamp_id(clamp_id) {
if ((0x1 << clamp_id) & clamps)
uclamp_update_active(p, clamp_id);
}
}
css_task_iter_end(&it);
}
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css);
static void uclamp_update_root_tg(void)
{
struct task_group *tg = &root_task_group;
uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
sysctl_sched_uclamp_util_min, false);
uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
sysctl_sched_uclamp_util_max, false);
rcu_read_lock();
cpu_util_update_eff(&root_task_group.css);
rcu_read_unlock();
}
#else
static void uclamp_update_root_tg(void) { }
#endif
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, void __user *buffer, size_t *lenp,
loff_t *ppos) loff_t *ppos)
{ {
bool update_root_tg = false;
int old_min, old_max; int old_min, old_max;
static DEFINE_MUTEX(mutex);
int result; int result;
mutex_lock(&mutex); mutex_lock(&uclamp_mutex);
old_min = sysctl_sched_uclamp_util_min; old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max; old_max = sysctl_sched_uclamp_util_max;
@ -1032,23 +1137,30 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
if (old_min != sysctl_sched_uclamp_util_min) { if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(&uclamp_default[UCLAMP_MIN], uclamp_se_set(&uclamp_default[UCLAMP_MIN],
sysctl_sched_uclamp_util_min, false); sysctl_sched_uclamp_util_min, false);
update_root_tg = true;
} }
if (old_max != sysctl_sched_uclamp_util_max) { if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(&uclamp_default[UCLAMP_MAX], uclamp_se_set(&uclamp_default[UCLAMP_MAX],
sysctl_sched_uclamp_util_max, false); sysctl_sched_uclamp_util_max, false);
update_root_tg = true;
} }
if (update_root_tg)
uclamp_update_root_tg();
/* /*
* Updating all the RUNNABLE task is expensive, keep it simple and do * We update all RUNNABLE tasks only when task groups are in use.
* just a lazy update at each next enqueue time. * Otherwise, keep it simple and do just a lazy update at each next
* task enqueue time.
*/ */
goto done; goto done;
undo: undo:
sysctl_sched_uclamp_util_min = old_min; sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max; sysctl_sched_uclamp_util_max = old_max;
done: done:
mutex_unlock(&mutex); mutex_unlock(&uclamp_mutex);
return result; return result;
} }
@ -1075,7 +1187,7 @@ static int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p, static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr) const struct sched_attr *attr)
{ {
unsigned int clamp_id; enum uclamp_id clamp_id;
/* /*
* On scheduling class change, reset to default clamps for tasks * On scheduling class change, reset to default clamps for tasks
@ -1112,7 +1224,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
static void uclamp_fork(struct task_struct *p) static void uclamp_fork(struct task_struct *p)
{ {
unsigned int clamp_id; enum uclamp_id clamp_id;
for_each_clamp_id(clamp_id) for_each_clamp_id(clamp_id)
p->uclamp[clamp_id].active = false; p->uclamp[clamp_id].active = false;
@ -1134,9 +1246,11 @@ static void uclamp_fork(struct task_struct *p)
static void __init init_uclamp(void) static void __init init_uclamp(void)
{ {
struct uclamp_se uc_max = {}; struct uclamp_se uc_max = {};
unsigned int clamp_id; enum uclamp_id clamp_id;
int cpu; int cpu;
mutex_init(&uclamp_mutex);
for_each_possible_cpu(cpu) { for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq)); memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0; cpu_rq(cpu)->uclamp_flags = 0;
@ -1149,8 +1263,13 @@ static void __init init_uclamp(void)
/* System defaults allow max clamp values for both indexes */ /* System defaults allow max clamp values for both indexes */
uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false); uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
for_each_clamp_id(clamp_id) for_each_clamp_id(clamp_id) {
uclamp_default[clamp_id] = uc_max; uclamp_default[clamp_id] = uc_max;
#ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
root_task_group.uclamp[clamp_id] = uc_max;
#endif
}
} }
#else /* CONFIG_UCLAMP_TASK */ #else /* CONFIG_UCLAMP_TASK */
@ -1494,7 +1613,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
if (queued) if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running) if (running)
set_curr_task(rq, p); set_next_task(rq, p);
} }
/* /*
@ -3214,12 +3333,8 @@ static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev, context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf) struct task_struct *next, struct rq_flags *rf)
{ {
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next); prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/* /*
* For paravirt, this is coupled with an exit in switch_to to * For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into * combine the page table reload and the switch backend into
@ -3228,22 +3343,37 @@ context_switch(struct rq *rq, struct task_struct *prev,
arch_start_context_switch(prev); arch_start_context_switch(prev);
/* /*
* If mm is non-NULL, we pass through switch_mm(). If mm is * kernel -> kernel lazy + transfer active
* NULL, we will pass through mmdrop() in finish_task_switch(). * user -> kernel lazy + mmgrab() active
* Both of these contain the full memory barrier required by *
* membarrier after storing to rq->curr, before returning to * kernel -> user switch + mmdrop() active
* user-space. * user -> user switch
*/ */
if (!mm) { if (!next->mm) { // to kernel
next->active_mm = oldmm; enter_lazy_tlb(prev->active_mm, next);
mmgrab(oldmm);
enter_lazy_tlb(oldmm, next);
} else
switch_mm_irqs_off(oldmm, mm, next);
if (!prev->mm) { next->active_mm = prev->active_mm;
prev->active_mm = NULL; if (prev->mm) // from user
rq->prev_mm = oldmm; mmgrab(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
} }
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
@ -3622,7 +3752,7 @@ static inline void sched_tick_start(int cpu) { }
static inline void sched_tick_stop(int cpu) { } static inline void sched_tick_stop(int cpu) { }
#endif #endif
#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ #if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \
defined(CONFIG_TRACE_PREEMPT_TOGGLE)) defined(CONFIG_TRACE_PREEMPT_TOGGLE))
/* /*
* If the value passed in is equal to the current preempt count * If the value passed in is equal to the current preempt count
@ -3780,7 +3910,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = fair_sched_class.pick_next_task(rq, prev, rf); p = fair_sched_class.pick_next_task(rq, prev, rf);
if (unlikely(p == RETRY_TASK)) if (unlikely(p == RETRY_TASK))
goto again; goto restart;
/* Assumes fair_sched_class->next == idle_sched_class */ /* Assumes fair_sched_class->next == idle_sched_class */
if (unlikely(!p)) if (unlikely(!p))
@ -3789,14 +3919,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
return p; return p;
} }
again: restart:
/*
* Ensure that we put DL/RT tasks before the pick loop, such that they
* can PULL higher prio tasks when we lower the RQ 'priority'.
*/
prev->sched_class->put_prev_task(rq, prev, rf);
if (!rq->nr_running)
newidle_balance(rq, rf);
for_each_class(class) { for_each_class(class) {
p = class->pick_next_task(rq, prev, rf); p = class->pick_next_task(rq, NULL, NULL);
if (p) { if (p)
if (unlikely(p == RETRY_TASK))
goto again;
return p; return p;
}
} }
/* The idle class should always have a runnable task: */ /* The idle class should always have a runnable task: */
@ -3823,7 +3958,7 @@ again:
* task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
* called on the nearest possible occasion: * called on the nearest possible occasion:
* *
* - If the kernel is preemptible (CONFIG_PREEMPT=y): * - If the kernel is preemptible (CONFIG_PREEMPTION=y):
* *
* - in syscall or exception context, at the next outmost * - in syscall or exception context, at the next outmost
* preempt_enable(). (this might be as soon as the wake_up()'s * preempt_enable(). (this might be as soon as the wake_up()'s
@ -3832,7 +3967,7 @@ again:
* - in IRQ context, return from interrupt-handler to * - in IRQ context, return from interrupt-handler to
* preemptible context * preemptible context
* *
* - If the kernel is not preemptible (CONFIG_PREEMPT is not set) * - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
* then at the next: * then at the next:
* *
* - cond_resched() call * - cond_resched() call
@ -4077,7 +4212,7 @@ static void __sched notrace preempt_schedule_common(void)
} while (need_resched()); } while (need_resched());
} }
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
/* /*
* this is the entry point to schedule() from in-kernel preemption * this is the entry point to schedule() from in-kernel preemption
* off of preempt_enable. Kernel preemptions off return from interrupt * off of preempt_enable. Kernel preemptions off return from interrupt
@ -4149,7 +4284,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
} }
EXPORT_SYMBOL_GPL(preempt_schedule_notrace); EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
#endif /* CONFIG_PREEMPT */ #endif /* CONFIG_PREEMPTION */
/* /*
* this is the entry point to schedule() from kernel preemption * this is the entry point to schedule() from kernel preemption
@ -4317,7 +4452,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
if (queued) if (queued)
enqueue_task(rq, p, queue_flag); enqueue_task(rq, p, queue_flag);
if (running) if (running)
set_curr_task(rq, p); set_next_task(rq, p);
check_class_changed(rq, p, prev_class, oldprio); check_class_changed(rq, p, prev_class, oldprio);
out_unlock: out_unlock:
@ -4384,7 +4519,7 @@ void set_user_nice(struct task_struct *p, long nice)
resched_curr(rq); resched_curr(rq);
} }
if (running) if (running)
set_curr_task(rq, p); set_next_task(rq, p);
out_unlock: out_unlock:
task_rq_unlock(rq, p, &rf); task_rq_unlock(rq, p, &rf);
} }
@ -4701,6 +4836,9 @@ recheck:
return retval; return retval;
} }
if (pi)
cpuset_read_lock();
/* /*
* Make sure no PI-waiters arrive (or leave) while we are * Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task: * changing the priority of the task:
@ -4715,8 +4853,8 @@ recheck:
* Changing the policy of the stop threads its a very bad idea: * Changing the policy of the stop threads its a very bad idea:
*/ */
if (p == rq->stop) { if (p == rq->stop) {
task_rq_unlock(rq, p, &rf); retval = -EINVAL;
return -EINVAL; goto unlock;
} }
/* /*
@ -4734,8 +4872,8 @@ recheck:
goto change; goto change;
p->sched_reset_on_fork = reset_on_fork; p->sched_reset_on_fork = reset_on_fork;
task_rq_unlock(rq, p, &rf); retval = 0;
return 0; goto unlock;
} }
change: change:
@ -4748,8 +4886,8 @@ change:
if (rt_bandwidth_enabled() && rt_policy(policy) && if (rt_bandwidth_enabled() && rt_policy(policy) &&
task_group(p)->rt_bandwidth.rt_runtime == 0 && task_group(p)->rt_bandwidth.rt_runtime == 0 &&
!task_group_is_autogroup(task_group(p))) { !task_group_is_autogroup(task_group(p))) {
task_rq_unlock(rq, p, &rf); retval = -EPERM;
return -EPERM; goto unlock;
} }
#endif #endif
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
@ -4764,8 +4902,8 @@ change:
*/ */
if (!cpumask_subset(span, p->cpus_ptr) || if (!cpumask_subset(span, p->cpus_ptr) ||
rq->rd->dl_bw.bw == 0) { rq->rd->dl_bw.bw == 0) {
task_rq_unlock(rq, p, &rf); retval = -EPERM;
return -EPERM; goto unlock;
} }
} }
#endif #endif
@ -4775,6 +4913,8 @@ change:
if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) { if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
policy = oldpolicy = -1; policy = oldpolicy = -1;
task_rq_unlock(rq, p, &rf); task_rq_unlock(rq, p, &rf);
if (pi)
cpuset_read_unlock();
goto recheck; goto recheck;
} }
@ -4784,8 +4924,8 @@ change:
* is available. * is available.
*/ */
if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) { if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
task_rq_unlock(rq, p, &rf); retval = -EBUSY;
return -EBUSY; goto unlock;
} }
p->sched_reset_on_fork = reset_on_fork; p->sched_reset_on_fork = reset_on_fork;
@ -4827,7 +4967,7 @@ change:
enqueue_task(rq, p, queue_flags); enqueue_task(rq, p, queue_flags);
} }
if (running) if (running)
set_curr_task(rq, p); set_next_task(rq, p);
check_class_changed(rq, p, prev_class, oldprio); check_class_changed(rq, p, prev_class, oldprio);
@ -4835,14 +4975,22 @@ change:
preempt_disable(); preempt_disable();
task_rq_unlock(rq, p, &rf); task_rq_unlock(rq, p, &rf);
if (pi) if (pi) {
cpuset_read_unlock();
rt_mutex_adjust_pi(p); rt_mutex_adjust_pi(p);
}
/* Run balance callbacks after we've adjusted the PI chain: */ /* Run balance callbacks after we've adjusted the PI chain: */
balance_callback(rq); balance_callback(rq);
preempt_enable(); preempt_enable();
return 0; return 0;
unlock:
task_rq_unlock(rq, p, &rf);
if (pi)
cpuset_read_unlock();
return retval;
} }
static int _sched_setscheduler(struct task_struct *p, int policy, static int _sched_setscheduler(struct task_struct *p, int policy,
@ -4926,10 +5074,15 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
rcu_read_lock(); rcu_read_lock();
retval = -ESRCH; retval = -ESRCH;
p = find_process_by_pid(pid); p = find_process_by_pid(pid);
if (p != NULL) if (likely(p))
retval = sched_setscheduler(p, policy, &lparam); get_task_struct(p);
rcu_read_unlock(); rcu_read_unlock();
if (likely(p)) {
retval = sched_setscheduler(p, policy, &lparam);
put_task_struct(p);
}
return retval; return retval;
} }
@ -5460,7 +5613,7 @@ SYSCALL_DEFINE0(sched_yield)
return 0; return 0;
} }
#ifndef CONFIG_PREEMPT #ifndef CONFIG_PREEMPTION
int __sched _cond_resched(void) int __sched _cond_resched(void)
{ {
if (should_resched(0)) { if (should_resched(0)) {
@ -5477,7 +5630,7 @@ EXPORT_SYMBOL(_cond_resched);
* __cond_resched_lock() - if a reschedule is pending, drop the given lock, * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
* call schedule, and on return reacquire the lock. * call schedule, and on return reacquire the lock.
* *
* This works OK both with and without CONFIG_PREEMPT. We do strange low-level * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
* operations here to prevent schedule() from being called twice (once via * operations here to prevent schedule() from being called twice (once via
* spin_unlock(), once by hand). * spin_unlock(), once by hand).
*/ */
@ -6016,7 +6169,7 @@ void sched_setnuma(struct task_struct *p, int nid)
if (queued) if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running) if (running)
set_curr_task(rq, p); set_next_task(rq, p);
task_rq_unlock(rq, p, &rf); task_rq_unlock(rq, p, &rf);
} }
#endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA_BALANCING */
@ -6056,22 +6209,23 @@ static void calc_load_migrate(struct rq *rq)
atomic_long_add(delta, &calc_load_tasks); atomic_long_add(delta, &calc_load_tasks);
} }
static void put_prev_task_fake(struct rq *rq, struct task_struct *prev) static struct task_struct *__pick_migrate_task(struct rq *rq)
{ {
const struct sched_class *class;
struct task_struct *next;
for_each_class(class) {
next = class->pick_next_task(rq, NULL, NULL);
if (next) {
next->sched_class->put_prev_task(rq, next, NULL);
return next;
}
}
/* The idle class should always have a runnable task */
BUG();
} }
static const struct sched_class fake_sched_class = {
.put_prev_task = put_prev_task_fake,
};
static struct task_struct fake_task = {
/*
* Avoid pull_{rt,dl}_task()
*/
.prio = MAX_PRIO + 1,
.sched_class = &fake_sched_class,
};
/* /*
* Migrate all tasks from the rq, sleeping tasks will be migrated by * Migrate all tasks from the rq, sleeping tasks will be migrated by
* try_to_wake_up()->select_task_rq(). * try_to_wake_up()->select_task_rq().
@ -6113,12 +6267,7 @@ static void migrate_tasks(struct rq *dead_rq, struct rq_flags *rf)
if (rq->nr_running == 1) if (rq->nr_running == 1)
break; break;
/* next = __pick_migrate_task(rq);
* pick_next_task() assumes pinned rq->lock:
*/
next = pick_next_task(rq, &fake_task, rf);
BUG_ON(!next);
put_prev_task(rq, next);
/* /*
* Rules for changing task_struct::cpus_mask are holding * Rules for changing task_struct::cpus_mask are holding
@ -6415,19 +6564,19 @@ DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
void __init sched_init(void) void __init sched_init(void)
{ {
unsigned long alloc_size = 0, ptr; unsigned long ptr = 0;
int i; int i;
wait_bit_init(); wait_bit_init();
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **); ptr += 2 * nr_cpu_ids * sizeof(void **);
#endif #endif
#ifdef CONFIG_RT_GROUP_SCHED #ifdef CONFIG_RT_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **); ptr += 2 * nr_cpu_ids * sizeof(void **);
#endif #endif
if (alloc_size) { if (ptr) {
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT); ptr = (unsigned long)kzalloc(ptr, GFP_NOWAIT);
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.se = (struct sched_entity **)ptr; root_task_group.se = (struct sched_entity **)ptr;
@ -6746,7 +6895,7 @@ struct task_struct *curr_task(int cpu)
#ifdef CONFIG_IA64 #ifdef CONFIG_IA64
/** /**
* set_curr_task - set the current task for a given CPU. * ia64_set_curr_task - set the current task for a given CPU.
* @cpu: the processor in question. * @cpu: the processor in question.
* @p: the task pointer to set. * @p: the task pointer to set.
* *
@ -6771,6 +6920,20 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */ /* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock); static DEFINE_SPINLOCK(task_group_lock);
static inline void alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
#ifdef CONFIG_UCLAMP_TASK_GROUP
enum uclamp_id clamp_id;
for_each_clamp_id(clamp_id) {
uclamp_se_set(&tg->uclamp_req[clamp_id],
uclamp_none(clamp_id), false);
tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
#endif
}
static void sched_free_group(struct task_group *tg) static void sched_free_group(struct task_group *tg)
{ {
free_fair_sched_group(tg); free_fair_sched_group(tg);
@ -6794,6 +6957,8 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent)) if (!alloc_rt_sched_group(tg, parent))
goto err; goto err;
alloc_uclamp_sched_group(tg, parent);
return tg; return tg;
err: err:
@ -6897,7 +7062,7 @@ void sched_move_task(struct task_struct *tsk)
if (queued) if (queued)
enqueue_task(rq, tsk, queue_flags); enqueue_task(rq, tsk, queue_flags);
if (running) if (running)
set_curr_task(rq, tsk); set_next_task(rq, tsk);
task_rq_unlock(rq, tsk, &rf); task_rq_unlock(rq, tsk, &rf);
} }
@ -6980,10 +7145,6 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
#ifdef CONFIG_RT_GROUP_SCHED #ifdef CONFIG_RT_GROUP_SCHED
if (!sched_rt_can_attach(css_tg(css), task)) if (!sched_rt_can_attach(css_tg(css), task))
return -EINVAL; return -EINVAL;
#else
/* We don't support RT-tasks being in separate groups */
if (task->sched_class != &fair_sched_class)
return -EINVAL;
#endif #endif
/* /*
* Serialize against wake_up_new_task() such that if its * Serialize against wake_up_new_task() such that if its
@ -7014,6 +7175,178 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task); sched_move_task(task);
} }
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css)
{
struct cgroup_subsys_state *top_css = css;
struct uclamp_se *uc_parent = NULL;
struct uclamp_se *uc_se = NULL;
unsigned int eff[UCLAMP_CNT];
enum uclamp_id clamp_id;
unsigned int clamps;
css_for_each_descendant_pre(css, top_css) {
uc_parent = css_tg(css)->parent
? css_tg(css)->parent->uclamp : NULL;
for_each_clamp_id(clamp_id) {
/* Assume effective clamps matches requested clamps */
eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
/* Cap effective clamps with parent's effective clamps */
if (uc_parent &&
eff[clamp_id] > uc_parent[clamp_id].value) {
eff[clamp_id] = uc_parent[clamp_id].value;
}
}
/* Ensure protection is always capped by limit */
eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
/* Propagate most restrictive effective clamps */
clamps = 0x0;
uc_se = css_tg(css)->uclamp;
for_each_clamp_id(clamp_id) {
if (eff[clamp_id] == uc_se[clamp_id].value)
continue;
uc_se[clamp_id].value = eff[clamp_id];
uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
if (!clamps) {
css = css_rightmost_descendant(css);
continue;
}
/* Immediately update descendants RUNNABLE tasks */
uclamp_update_active_tasks(css, clamps);
}
}
/*
* Integer 10^N with a given N exponent by casting to integer the literal "1eN"
* C expression. Since there is no way to convert a macro argument (N) into a
* character constant, use two levels of macros.
*/
#define _POW10(exp) ((unsigned int)1e##exp)
#define POW10(exp) _POW10(exp)
struct uclamp_request {
#define UCLAMP_PERCENT_SHIFT 2
#define UCLAMP_PERCENT_SCALE (100 * POW10(UCLAMP_PERCENT_SHIFT))
s64 percent;
u64 util;
int ret;
};
static inline struct uclamp_request
capacity_from_percent(char *buf)
{
struct uclamp_request req = {
.percent = UCLAMP_PERCENT_SCALE,
.util = SCHED_CAPACITY_SCALE,
.ret = 0,
};
buf = strim(buf);
if (strcmp(buf, "max")) {
req.ret = cgroup_parse_float(buf, UCLAMP_PERCENT_SHIFT,
&req.percent);
if (req.ret)
return req;
if (req.percent > UCLAMP_PERCENT_SCALE) {
req.ret = -ERANGE;
return req;
}
req.util = req.percent << SCHED_CAPACITY_SHIFT;
req.util = DIV_ROUND_CLOSEST_ULL(req.util, UCLAMP_PERCENT_SCALE);
}
return req;
}
static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off,
enum uclamp_id clamp_id)
{
struct uclamp_request req;
struct task_group *tg;
req = capacity_from_percent(buf);
if (req.ret)
return req.ret;
mutex_lock(&uclamp_mutex);
rcu_read_lock();
tg = css_tg(of_css(of));
if (tg->uclamp_req[clamp_id].value != req.util)
uclamp_se_set(&tg->uclamp_req[clamp_id], req.util, false);
/*
* Because of not recoverable conversion rounding we keep track of the
* exact requested value
*/
tg->uclamp_pct[clamp_id] = req.percent;
/* Update effective clamps to track the most restrictive value */
cpu_util_update_eff(of_css(of));
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);
return nbytes;
}
static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MIN);
}
static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MAX);
}
static inline void cpu_uclamp_print(struct seq_file *sf,
enum uclamp_id clamp_id)
{
struct task_group *tg;
u64 util_clamp;
u64 percent;
u32 rem;
rcu_read_lock();
tg = css_tg(seq_css(sf));
util_clamp = tg->uclamp_req[clamp_id].value;
rcu_read_unlock();
if (util_clamp == SCHED_CAPACITY_SCALE) {
seq_puts(sf, "max\n");
return;
}
percent = tg->uclamp_pct[clamp_id];
percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), &rem);
seq_printf(sf, "%llu.%0*u\n", percent, UCLAMP_PERCENT_SHIFT, rem);
}
static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
{
cpu_uclamp_print(sf, UCLAMP_MIN);
return 0;
}
static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
{
cpu_uclamp_print(sf, UCLAMP_MAX);
return 0;
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css, static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval) struct cftype *cftype, u64 shareval)
@ -7358,6 +7691,20 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint, .read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint, .write_u64 = cpu_rt_period_write_uint,
}, },
#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_min_show,
.write = cpu_uclamp_min_write,
},
{
.name = "uclamp.max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_max_show,
.write = cpu_uclamp_max_write,
},
#endif #endif
{ } /* Terminate */ { } /* Terminate */
}; };
@ -7525,6 +7872,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show, .seq_show = cpu_max_show,
.write = cpu_max_write, .write = cpu_max_write,
}, },
#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_min_show,
.write = cpu_uclamp_min_write,
},
{
.name = "uclamp.max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_max_show,
.write = cpu_uclamp_max_write,
},
#endif #endif
{ } /* terminate */ { } /* terminate */
}; };

View File

@ -263,9 +263,9 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
* irq metric. Because IRQ/steal time is hidden from the task clock we * irq metric. Because IRQ/steal time is hidden from the task clock we
* need to scale the task numbers: * need to scale the task numbers:
* *
* 1 - irq * max - irq
* U' = irq + ------- * U * U' = irq + --------- * U
* max * max
*/ */
util = scale_irq_capacity(util, irq, max); util = scale_irq_capacity(util, irq, max);
util += irq; util += irq;

View File

@ -529,6 +529,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p) static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p)
{ {
struct rq *later_rq = NULL; struct rq *later_rq = NULL;
struct dl_bw *dl_b;
later_rq = find_lock_later_rq(p, rq); later_rq = find_lock_later_rq(p, rq);
if (!later_rq) { if (!later_rq) {
@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
double_lock_balance(rq, later_rq); double_lock_balance(rq, later_rq);
} }
if (p->dl.dl_non_contending || p->dl.dl_throttled) {
/*
* Inactive timer is armed (or callback is running, but
* waiting for us to release rq locks). In any case, when it
* will fire (or continue), it will see running_bw of this
* task migrated to later_rq (and correctly handle it).
*/
sub_running_bw(&p->dl, &rq->dl);
sub_rq_bw(&p->dl, &rq->dl);
add_rq_bw(&p->dl, &later_rq->dl);
add_running_bw(&p->dl, &later_rq->dl);
} else {
sub_rq_bw(&p->dl, &rq->dl);
add_rq_bw(&p->dl, &later_rq->dl);
}
/*
* And we finally need to fixup root_domain(s) bandwidth accounting,
* since p is still hanging out in the old (now moved to default) root
* domain.
*/
dl_b = &rq->rd->dl_bw;
raw_spin_lock(&dl_b->lock);
__dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
raw_spin_unlock(&dl_b->lock);
dl_b = &later_rq->rd->dl_bw;
raw_spin_lock(&dl_b->lock);
__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
raw_spin_unlock(&dl_b->lock);
set_task_cpu(p, later_rq->cpu); set_task_cpu(p, later_rq->cpu);
double_unlock_balance(later_rq, rq); double_unlock_balance(later_rq, rq);
@ -1694,12 +1727,20 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
} }
#endif #endif
static inline void set_next_task(struct rq *rq, struct task_struct *p) static void set_next_task_dl(struct rq *rq, struct task_struct *p)
{ {
p->se.exec_start = rq_clock_task(rq); p->se.exec_start = rq_clock_task(rq);
/* You can't push away the running task */ /* You can't push away the running task */
dequeue_pushable_dl_task(rq, p); dequeue_pushable_dl_task(rq, p);
if (hrtick_enabled(rq))
start_hrtick_dl(rq, p);
if (rq->curr->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
deadline_queue_push_tasks(rq);
} }
static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq, static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@ -1720,64 +1761,42 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct task_struct *p; struct task_struct *p;
struct dl_rq *dl_rq; struct dl_rq *dl_rq;
WARN_ON_ONCE(prev || rf);
dl_rq = &rq->dl; dl_rq = &rq->dl;
if (need_pull_dl_task(rq, prev)) {
/*
* This is OK, because current is on_cpu, which avoids it being
* picked for load-balance and preemption/IRQs are still
* disabled avoiding further scheduler activity on it and we're
* being very careful to re-start the picking loop.
*/
rq_unpin_lock(rq, rf);
pull_dl_task(rq);
rq_repin_lock(rq, rf);
/*
* pull_dl_task() can drop (and re-acquire) rq->lock; this
* means a stop task can slip in, in which case we need to
* re-start task selection.
*/
if (rq->stop && task_on_rq_queued(rq->stop))
return RETRY_TASK;
}
/*
* When prev is DL, we may throttle it in put_prev_task().
* So, we update time before we check for dl_nr_running.
*/
if (prev->sched_class == &dl_sched_class)
update_curr_dl(rq);
if (unlikely(!dl_rq->dl_nr_running)) if (unlikely(!dl_rq->dl_nr_running))
return NULL; return NULL;
put_prev_task(rq, prev);
dl_se = pick_next_dl_entity(rq, dl_rq); dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se); BUG_ON(!dl_se);
p = dl_task_of(dl_se); p = dl_task_of(dl_se);
set_next_task(rq, p); set_next_task_dl(rq, p);
if (hrtick_enabled(rq))
start_hrtick_dl(rq, p);
deadline_queue_push_tasks(rq);
if (rq->curr->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
return p; return p;
} }
static void put_prev_task_dl(struct rq *rq, struct task_struct *p) static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
{ {
update_curr_dl(rq); update_curr_dl(rq);
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1); update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1) if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p); enqueue_pushable_dl_task(rq, p);
if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
/*
* This is OK, because current is on_cpu, which avoids it being
* picked for load-balance and preemption/IRQs are still
* disabled avoiding further scheduler activity on it and we've
* not yet started the picking loop.
*/
rq_unpin_lock(rq, rf);
pull_dl_task(rq);
rq_repin_lock(rq, rf);
}
} }
/* /*
@ -1811,11 +1830,6 @@ static void task_fork_dl(struct task_struct *p)
*/ */
} }
static void set_curr_task_dl(struct rq *rq)
{
set_next_task(rq, rq->curr);
}
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
/* Only try algorithms three times */ /* Only try algorithms three times */
@ -2275,6 +2289,36 @@ void __init init_sched_dl_class(void)
GFP_KERNEL, cpu_to_node(i)); GFP_KERNEL, cpu_to_node(i));
} }
void dl_add_task_root_domain(struct task_struct *p)
{
struct rq_flags rf;
struct rq *rq;
struct dl_bw *dl_b;
rq = task_rq_lock(p, &rf);
if (!dl_task(p))
goto unlock;
dl_b = &rq->rd->dl_bw;
raw_spin_lock(&dl_b->lock);
__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
raw_spin_unlock(&dl_b->lock);
unlock:
task_rq_unlock(rq, p, &rf);
}
void dl_clear_root_domain(struct root_domain *rd)
{
unsigned long flags;
raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
rd->dl_bw.total_bw = 0;
raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
}
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
static void switched_from_dl(struct rq *rq, struct task_struct *p) static void switched_from_dl(struct rq *rq, struct task_struct *p)
@ -2395,6 +2439,7 @@ const struct sched_class dl_sched_class = {
.pick_next_task = pick_next_task_dl, .pick_next_task = pick_next_task_dl,
.put_prev_task = put_prev_task_dl, .put_prev_task = put_prev_task_dl,
.set_next_task = set_next_task_dl,
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_dl, .select_task_rq = select_task_rq_dl,
@ -2405,7 +2450,6 @@ const struct sched_class dl_sched_class = {
.task_woken = task_woken_dl, .task_woken = task_woken_dl,
#endif #endif
.set_curr_task = set_curr_task_dl,
.task_tick = task_tick_dl, .task_tick = task_tick_dl,
.task_fork = task_fork_dl, .task_fork = task_fork_dl,

View File

@ -96,12 +96,12 @@ int __weak arch_asym_cpu_priority(int cpu)
} }
/* /*
* The margin used when comparing utilization with CPU capacity: * The margin used when comparing utilization with CPU capacity.
* util * margin < capacity * 1024
* *
* (default: ~20%) * (default: ~20%)
*/ */
static unsigned int capacity_margin = 1280; #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
#endif #endif
#ifdef CONFIG_CFS_BANDWIDTH #ifdef CONFIG_CFS_BANDWIDTH
@ -1188,47 +1188,6 @@ static unsigned int task_scan_max(struct task_struct *p)
return max(smin, smax); return max(smin, smax);
} }
void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
{
int mm_users = 0;
struct mm_struct *mm = p->mm;
if (mm) {
mm_users = atomic_read(&mm->mm_users);
if (mm_users == 1) {
mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
mm->numa_scan_seq = 0;
}
}
p->node_stamp = 0;
p->numa_scan_seq = mm ? mm->numa_scan_seq : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
RCU_INIT_POINTER(p->numa_group, NULL);
p->last_task_numa_placement = 0;
p->last_sum_exec_runtime = 0;
/* New address space, reset the preferred nid */
if (!(clone_flags & CLONE_VM)) {
p->numa_preferred_nid = NUMA_NO_NODE;
return;
}
/*
* New thread, keep existing numa_preferred_nid which should be copied
* already by arch_dup_task_struct but stagger when scans start.
*/
if (mm) {
unsigned int delay;
delay = min_t(unsigned int, task_scan_max(current),
current->numa_scan_period * mm_users * NSEC_PER_MSEC);
delay += 2 * TICK_NSEC;
p->node_stamp = delay;
}
}
static void account_numa_enqueue(struct rq *rq, struct task_struct *p) static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
{ {
rq->nr_numa_running += (p->numa_preferred_nid != NUMA_NO_NODE); rq->nr_numa_running += (p->numa_preferred_nid != NUMA_NO_NODE);
@ -2523,7 +2482,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
* The expensive part of numa migration is done from task_work context. * The expensive part of numa migration is done from task_work context.
* Triggered from task_tick_numa(). * Triggered from task_tick_numa().
*/ */
void task_numa_work(struct callback_head *work) static void task_numa_work(struct callback_head *work)
{ {
unsigned long migrate, next_scan, now = jiffies; unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current; struct task_struct *p = current;
@ -2536,7 +2495,7 @@ void task_numa_work(struct callback_head *work)
SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
work->next = work; /* protect against double add */ work->next = work;
/* /*
* Who cares about NUMA placement when they're dying. * Who cares about NUMA placement when they're dying.
* *
@ -2665,6 +2624,50 @@ out:
} }
} }
void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
{
int mm_users = 0;
struct mm_struct *mm = p->mm;
if (mm) {
mm_users = atomic_read(&mm->mm_users);
if (mm_users == 1) {
mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
mm->numa_scan_seq = 0;
}
}
p->node_stamp = 0;
p->numa_scan_seq = mm ? mm->numa_scan_seq : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
/* Protect against double add, see task_tick_numa and task_numa_work */
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
RCU_INIT_POINTER(p->numa_group, NULL);
p->last_task_numa_placement = 0;
p->last_sum_exec_runtime = 0;
init_task_work(&p->numa_work, task_numa_work);
/* New address space, reset the preferred nid */
if (!(clone_flags & CLONE_VM)) {
p->numa_preferred_nid = NUMA_NO_NODE;
return;
}
/*
* New thread, keep existing numa_preferred_nid which should be copied
* already by arch_dup_task_struct but stagger when scans start.
*/
if (mm) {
unsigned int delay;
delay = min_t(unsigned int, task_scan_max(current),
current->numa_scan_period * mm_users * NSEC_PER_MSEC);
delay += 2 * TICK_NSEC;
p->node_stamp = delay;
}
}
/* /*
* Drive the periodic memory faults.. * Drive the periodic memory faults..
*/ */
@ -2693,10 +2696,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
curr->numa_scan_period = task_scan_start(curr); curr->numa_scan_period = task_scan_start(curr);
curr->node_stamp += period; curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) { if (!time_before(jiffies, curr->mm->numa_next_scan))
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
task_work_add(curr, work, true); task_work_add(curr, work, true);
}
} }
} }
@ -3689,8 +3690,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
return cfs_rq->avg.load_avg; return cfs_rq->avg.load_avg;
} }
static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
static inline unsigned long task_util(struct task_struct *p) static inline unsigned long task_util(struct task_struct *p)
{ {
return READ_ONCE(p->se.avg.util_avg); return READ_ONCE(p->se.avg.util_avg);
@ -3807,7 +3806,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
static inline int task_fits_capacity(struct task_struct *p, long capacity) static inline int task_fits_capacity(struct task_struct *p, long capacity)
{ {
return capacity * 1024 > task_util_est(p) * capacity_margin; return fits_capacity(task_util_est(p), capacity);
} }
static inline void update_misfit_status(struct task_struct *p, struct rq *rq) static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@ -4370,8 +4369,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
now = sched_clock_cpu(smp_processor_id()); now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota; cfs_b->runtime = cfs_b->quota;
cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
cfs_b->expires_seq++;
} }
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@ -4393,8 +4390,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{ {
struct task_group *tg = cfs_rq->tg; struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
u64 amount = 0, min_amount, expires; u64 amount = 0, min_amount;
int expires_seq;
/* note: this is a positive sum as runtime_remaining <= 0 */ /* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@ -4411,61 +4407,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0; cfs_b->idle = 0;
} }
} }
expires_seq = cfs_b->expires_seq;
expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock); raw_spin_unlock(&cfs_b->lock);
cfs_rq->runtime_remaining += amount; cfs_rq->runtime_remaining += amount;
/*
* we may have advanced our local expiration to account for allowed
* spread between our sched_clock and the one on which runtime was
* issued.
*/
if (cfs_rq->expires_seq != expires_seq) {
cfs_rq->expires_seq = expires_seq;
cfs_rq->runtime_expires = expires;
}
return cfs_rq->runtime_remaining > 0; return cfs_rq->runtime_remaining > 0;
} }
/*
* Note: This depends on the synchronization provided by sched_clock and the
* fact that rq->clock snapshots this value.
*/
static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
/* if the deadline is ahead of our clock, nothing to do */
if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
return;
if (cfs_rq->runtime_remaining < 0)
return;
/*
* If the local deadline has passed we have to consider the
* possibility that our sched_clock is 'fast' and the global deadline
* has not truly expired.
*
* Fortunately we can check determine whether this the case by checking
* whether the global deadline(cfs_b->expires_seq) has advanced.
*/
if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
/* global deadline is ahead, expiration has passed */
cfs_rq->runtime_remaining = 0;
}
}
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{ {
/* dock delta_exec before expiring quota (as it could span periods) */ /* dock delta_exec before expiring quota (as it could span periods) */
cfs_rq->runtime_remaining -= delta_exec; cfs_rq->runtime_remaining -= delta_exec;
expire_cfs_rq_runtime(cfs_rq);
if (likely(cfs_rq->runtime_remaining > 0)) if (likely(cfs_rq->runtime_remaining > 0))
return; return;
@ -4556,7 +4508,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq); struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se; struct sched_entity *se;
long task_delta, dequeue = 1; long task_delta, idle_task_delta, dequeue = 1;
bool empty; bool empty;
se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
@ -4567,6 +4519,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
rcu_read_unlock(); rcu_read_unlock();
task_delta = cfs_rq->h_nr_running; task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
for_each_sched_entity(se) { for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se); struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */ /* throttled entity or throttle-on-deactivate */
@ -4576,6 +4529,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (dequeue) if (dequeue)
dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP); dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
qcfs_rq->h_nr_running -= task_delta; qcfs_rq->h_nr_running -= task_delta;
qcfs_rq->idle_h_nr_running -= idle_task_delta;
if (qcfs_rq->load.weight) if (qcfs_rq->load.weight)
dequeue = 0; dequeue = 0;
@ -4615,7 +4569,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se; struct sched_entity *se;
int enqueue = 1; int enqueue = 1;
long task_delta; long task_delta, idle_task_delta;
se = cfs_rq->tg->se[cpu_of(rq)]; se = cfs_rq->tg->se[cpu_of(rq)];
@ -4635,6 +4589,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
return; return;
task_delta = cfs_rq->h_nr_running; task_delta = cfs_rq->h_nr_running;
idle_task_delta = cfs_rq->idle_h_nr_running;
for_each_sched_entity(se) { for_each_sched_entity(se) {
if (se->on_rq) if (se->on_rq)
enqueue = 0; enqueue = 0;
@ -4643,6 +4598,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (enqueue) if (enqueue)
enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP); enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
cfs_rq->h_nr_running += task_delta; cfs_rq->h_nr_running += task_delta;
cfs_rq->idle_h_nr_running += idle_task_delta;
if (cfs_rq_throttled(cfs_rq)) if (cfs_rq_throttled(cfs_rq))
break; break;
@ -4658,8 +4614,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
resched_curr(rq); resched_curr(rq);
} }
static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
u64 remaining, u64 expires)
{ {
struct cfs_rq *cfs_rq; struct cfs_rq *cfs_rq;
u64 runtime; u64 runtime;
@ -4684,7 +4639,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
remaining -= runtime; remaining -= runtime;
cfs_rq->runtime_remaining += runtime; cfs_rq->runtime_remaining += runtime;
cfs_rq->runtime_expires = expires;
/* we check whether we're throttled above */ /* we check whether we're throttled above */
if (cfs_rq->runtime_remaining > 0) if (cfs_rq->runtime_remaining > 0)
@ -4709,7 +4663,7 @@ next:
*/ */
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
{ {
u64 runtime, runtime_expires; u64 runtime;
int throttled; int throttled;
/* no need to continue the timer with no bandwidth constraint */ /* no need to continue the timer with no bandwidth constraint */
@ -4737,8 +4691,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
/* account preceding periods in which throttling occurred */ /* account preceding periods in which throttling occurred */
cfs_b->nr_throttled += overrun; cfs_b->nr_throttled += overrun;
runtime_expires = cfs_b->runtime_expires;
/* /*
* This check is repeated as we are holding onto the new bandwidth while * This check is repeated as we are holding onto the new bandwidth while
* we unthrottle. This can potentially race with an unthrottled group * we unthrottle. This can potentially race with an unthrottled group
@ -4751,8 +4703,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
cfs_b->distribute_running = 1; cfs_b->distribute_running = 1;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags); raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
/* we can't nest cfs_b->lock while distributing bandwidth */ /* we can't nest cfs_b->lock while distributing bandwidth */
runtime = distribute_cfs_runtime(cfs_b, runtime, runtime = distribute_cfs_runtime(cfs_b, runtime);
runtime_expires);
raw_spin_lock_irqsave(&cfs_b->lock, flags); raw_spin_lock_irqsave(&cfs_b->lock, flags);
cfs_b->distribute_running = 0; cfs_b->distribute_running = 0;
@ -4834,8 +4785,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
return; return;
raw_spin_lock(&cfs_b->lock); raw_spin_lock(&cfs_b->lock);
if (cfs_b->quota != RUNTIME_INF && if (cfs_b->quota != RUNTIME_INF) {
cfs_rq->runtime_expires == cfs_b->runtime_expires) {
cfs_b->runtime += slack_runtime; cfs_b->runtime += slack_runtime;
/* we are under rq->lock, defer unthrottling using a timer */ /* we are under rq->lock, defer unthrottling using a timer */
@ -4868,7 +4818,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{ {
u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
unsigned long flags; unsigned long flags;
u64 expires;
/* confirm we're still not at a refresh boundary */ /* confirm we're still not at a refresh boundary */
raw_spin_lock_irqsave(&cfs_b->lock, flags); raw_spin_lock_irqsave(&cfs_b->lock, flags);
@ -4886,7 +4835,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
runtime = cfs_b->runtime; runtime = cfs_b->runtime;
expires = cfs_b->runtime_expires;
if (runtime) if (runtime)
cfs_b->distribute_running = 1; cfs_b->distribute_running = 1;
@ -4895,11 +4843,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (!runtime) if (!runtime)
return; return;
runtime = distribute_cfs_runtime(cfs_b, runtime, expires); runtime = distribute_cfs_runtime(cfs_b, runtime);
raw_spin_lock_irqsave(&cfs_b->lock, flags); raw_spin_lock_irqsave(&cfs_b->lock, flags);
if (expires == cfs_b->runtime_expires) lsub_positive(&cfs_b->runtime, runtime);
lsub_positive(&cfs_b->runtime, runtime);
cfs_b->distribute_running = 0; cfs_b->distribute_running = 0;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags); raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
} }
@ -5056,8 +5003,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->period_active = 1; cfs_b->period_active = 1;
overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
cfs_b->expires_seq++;
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
} }
@ -5235,7 +5180,7 @@ static inline unsigned long cpu_util(int cpu);
static inline bool cpu_overutilized(int cpu) static inline bool cpu_overutilized(int cpu)
{ {
return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); return !fits_capacity(cpu_util(cpu), capacity_of(cpu));
} }
static inline void update_overutilized_status(struct rq *rq) static inline void update_overutilized_status(struct rq *rq)
@ -5259,6 +5204,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{ {
struct cfs_rq *cfs_rq; struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se; struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
/* /*
* The code below (indirectly) updates schedutil which looks at * The code below (indirectly) updates schedutil which looks at
@ -5291,6 +5237,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq)) if (cfs_rq_throttled(cfs_rq))
break; break;
cfs_rq->h_nr_running++; cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
flags = ENQUEUE_WAKEUP; flags = ENQUEUE_WAKEUP;
} }
@ -5298,6 +5245,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
for_each_sched_entity(se) { for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se); cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running++; cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
if (cfs_rq_throttled(cfs_rq)) if (cfs_rq_throttled(cfs_rq))
break; break;
@ -5359,6 +5307,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq; struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se; struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP; int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
for_each_sched_entity(se) { for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se); cfs_rq = cfs_rq_of(se);
@ -5373,6 +5322,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq)) if (cfs_rq_throttled(cfs_rq))
break; break;
cfs_rq->h_nr_running--; cfs_rq->h_nr_running--;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
/* Don't dequeue parent if it has other entities besides us */ /* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) { if (cfs_rq->load.weight) {
@ -5392,6 +5342,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
for_each_sched_entity(se) { for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se); cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running--; cfs_rq->h_nr_running--;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
if (cfs_rq_throttled(cfs_rq)) if (cfs_rq_throttled(cfs_rq))
break; break;
@ -5425,6 +5376,15 @@ static struct {
#endif /* CONFIG_NO_HZ_COMMON */ #endif /* CONFIG_NO_HZ_COMMON */
/* CPU only has SCHED_IDLE tasks enqueued */
static int sched_idle_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);
return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
rq->nr_running);
}
static unsigned long cpu_runnable_load(struct rq *rq) static unsigned long cpu_runnable_load(struct rq *rq)
{ {
return cfs_rq_runnable_load_avg(&rq->cfs); return cfs_rq_runnable_load_avg(&rq->cfs);
@ -5747,7 +5707,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
unsigned int min_exit_latency = UINT_MAX; unsigned int min_exit_latency = UINT_MAX;
u64 latest_idle_timestamp = 0; u64 latest_idle_timestamp = 0;
int least_loaded_cpu = this_cpu; int least_loaded_cpu = this_cpu;
int shallowest_idle_cpu = -1; int shallowest_idle_cpu = -1, si_cpu = -1;
int i; int i;
/* Check if we have any choice: */ /* Check if we have any choice: */
@ -5778,7 +5738,12 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
latest_idle_timestamp = rq->idle_stamp; latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i; shallowest_idle_cpu = i;
} }
} else if (shallowest_idle_cpu == -1) { } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
if (sched_idle_cpu(i)) {
si_cpu = i;
continue;
}
load = cpu_runnable_load(cpu_rq(i)); load = cpu_runnable_load(cpu_rq(i));
if (load < min_load) { if (load < min_load) {
min_load = load; min_load = load;
@ -5787,7 +5752,11 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
} }
} }
return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; if (shallowest_idle_cpu != -1)
return shallowest_idle_cpu;
if (si_cpu != -1)
return si_cpu;
return least_loaded_cpu;
} }
static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p,
@ -5940,7 +5909,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
*/ */
static int select_idle_smt(struct task_struct *p, int target) static int select_idle_smt(struct task_struct *p, int target)
{ {
int cpu; int cpu, si_cpu = -1;
if (!static_branch_likely(&sched_smt_present)) if (!static_branch_likely(&sched_smt_present))
return -1; return -1;
@ -5950,9 +5919,11 @@ static int select_idle_smt(struct task_struct *p, int target)
continue; continue;
if (available_idle_cpu(cpu)) if (available_idle_cpu(cpu))
return cpu; return cpu;
if (si_cpu == -1 && sched_idle_cpu(cpu))
si_cpu = cpu;
} }
return -1; return si_cpu;
} }
#else /* CONFIG_SCHED_SMT */ #else /* CONFIG_SCHED_SMT */
@ -5980,8 +5951,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle; u64 avg_cost, avg_idle;
u64 time, cost; u64 time, cost;
s64 delta; s64 delta;
int cpu, nr = INT_MAX;
int this = smp_processor_id(); int this = smp_processor_id();
int cpu, nr = INT_MAX, si_cpu = -1;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd) if (!this_sd)
@ -6009,11 +5980,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
if (!--nr) if (!--nr)
return -1; return si_cpu;
if (!cpumask_test_cpu(cpu, p->cpus_ptr)) if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue; continue;
if (available_idle_cpu(cpu)) if (available_idle_cpu(cpu))
break; break;
if (si_cpu == -1 && sched_idle_cpu(cpu))
si_cpu = cpu;
} }
time = cpu_clock(this) - time; time = cpu_clock(this) - time;
@ -6032,13 +6005,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
struct sched_domain *sd; struct sched_domain *sd;
int i, recent_used_cpu; int i, recent_used_cpu;
if (available_idle_cpu(target)) if (available_idle_cpu(target) || sched_idle_cpu(target))
return target; return target;
/* /*
* If the previous CPU is cache affine and idle, don't be stupid: * If the previous CPU is cache affine and idle, don't be stupid:
*/ */
if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev)) if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)))
return prev; return prev;
/* Check a recently used CPU as a potential idle candidate: */ /* Check a recently used CPU as a potential idle candidate: */
@ -6046,7 +6020,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (recent_used_cpu != prev && if (recent_used_cpu != prev &&
recent_used_cpu != target && recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) && cpus_share_cache(recent_used_cpu, target) &&
available_idle_cpu(recent_used_cpu) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr)) { cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr)) {
/* /*
* Replace recent_used_cpu with prev as it is a potential * Replace recent_used_cpu with prev as it is a potential
@ -6282,69 +6256,55 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
} }
/* /*
* compute_energy(): Estimates the energy that would be consumed if @p was * compute_energy(): Estimates the energy that @pd would consume if @p was
* migrated to @dst_cpu. compute_energy() predicts what will be the utilization * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
* landscape of the * CPUs after the task migration, and uses the Energy Model * landscape of @pd's CPUs after the task migration, and uses the Energy Model
* to compute what would be the energy if we decided to actually migrate that * to compute what would be the energy if we decided to actually migrate that
* task. * task.
*/ */
static long static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{ {
unsigned int max_util, util_cfs, cpu_util, cpu_cap; struct cpumask *pd_mask = perf_domain_span(pd);
unsigned long sum_util, energy = 0; unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
struct task_struct *tsk; unsigned long max_util = 0, sum_util = 0;
int cpu; int cpu;
for (; pd; pd = pd->next) { /*
struct cpumask *pd_mask = perf_domain_span(pd); * The capacity state of CPUs of the current rd can be driven by CPUs
* of another rd if they belong to the same pd. So, account for the
* utilization of these CPUs too by masking pd with cpu_online_mask
* instead of the rd span.
*
* If an entire pd is outside of the current rd, it will not appear in
* its pd list and will not be accounted by compute_energy().
*/
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
/* /*
* The energy model mandates all the CPUs of a performance * Busy time computation: utilization clamping is not
* domain have the same capacity. * required since the ratio (sum_util / cpu_capacity)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/ */
cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
max_util = sum_util = 0; ENERGY_UTIL, NULL);
/* /*
* The capacity state of CPUs of the current rd can be driven by * Performance domain frequency: utilization clamping
* CPUs of another rd if they belong to the same performance * must be considered since it affects the selection
* domain. So, account for the utilization of these CPUs too * of the performance domain frequency.
* by masking pd with cpu_online_mask instead of the rd span. * NOTE: in case RT tasks are running, by default the
* * FREQUENCY_UTIL's utilization can be max OPP.
* If an entire performance domain is outside of the current rd,
* it will not appear in its pd list and will not be accounted
* by compute_energy().
*/ */
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
util_cfs = cpu_util_next(cpu, p, dst_cpu); FREQUENCY_UTIL, tsk);
max_util = max(max_util, cpu_util);
/*
* Busy time computation: utilization clamping is not
* required since the ratio (sum_util / cpu_capacity)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/
sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
ENERGY_UTIL, NULL);
/*
* Performance domain frequency: utilization clamping
* must be considered since it affects the selection
* of the performance domain frequency.
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
tsk = cpu == dst_cpu ? p : NULL;
cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
FREQUENCY_UTIL, tsk);
max_util = max(max_util, cpu_util);
}
energy += em_pd_energy(pd->em_pd, max_util, sum_util);
} }
return energy; return em_pd_energy(pd->em_pd, max_util, sum_util);
} }
/* /*
@ -6386,21 +6346,19 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* other use-cases too. So, until someone finds a better way to solve this, * other use-cases too. So, until someone finds a better way to solve this,
* let's keep things simple by re-using the existing slow path. * let's keep things simple by re-using the existing slow path.
*/ */
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{ {
unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX; unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd; struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
unsigned long cpu_cap, util, base_energy = 0;
int cpu, best_energy_cpu = prev_cpu; int cpu, best_energy_cpu = prev_cpu;
struct perf_domain *head, *pd;
unsigned long cpu_cap, util;
struct sched_domain *sd; struct sched_domain *sd;
struct perf_domain *pd;
rcu_read_lock(); rcu_read_lock();
pd = rcu_dereference(rd->pd); pd = rcu_dereference(rd->pd);
if (!pd || READ_ONCE(rd->overutilized)) if (!pd || READ_ONCE(rd->overutilized))
goto fail; goto fail;
head = pd;
/* /*
* Energy-aware wake-up happens on the lowest sched_domain starting * Energy-aware wake-up happens on the lowest sched_domain starting
@ -6417,9 +6375,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
goto unlock; goto unlock;
for (; pd; pd = pd->next) { for (; pd; pd = pd->next) {
unsigned long cur_energy, spare_cap, max_spare_cap = 0; unsigned long cur_delta, spare_cap, max_spare_cap = 0;
unsigned long base_energy_pd;
int max_spare_cap_cpu = -1; int max_spare_cap_cpu = -1;
/* Compute the 'base' energy of the pd, without @p */
base_energy_pd = compute_energy(p, -1, pd);
base_energy += base_energy_pd;
for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) { for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr)) if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue; continue;
@ -6427,14 +6390,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
/* Skip CPUs that will be overutilized. */ /* Skip CPUs that will be overutilized. */
util = cpu_util_next(cpu, p, cpu); util = cpu_util_next(cpu, p, cpu);
cpu_cap = capacity_of(cpu); cpu_cap = capacity_of(cpu);
if (cpu_cap * 1024 < util * capacity_margin) if (!fits_capacity(util, cpu_cap))
continue; continue;
/* Always use prev_cpu as a candidate. */ /* Always use prev_cpu as a candidate. */
if (cpu == prev_cpu) { if (cpu == prev_cpu) {
prev_energy = compute_energy(p, prev_cpu, head); prev_delta = compute_energy(p, prev_cpu, pd);
best_energy = min(best_energy, prev_energy); prev_delta -= base_energy_pd;
continue; best_delta = min(best_delta, prev_delta);
} }
/* /*
@ -6450,9 +6413,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
/* Evaluate the energy impact of using this CPU. */ /* Evaluate the energy impact of using this CPU. */
if (max_spare_cap_cpu >= 0) { if (max_spare_cap_cpu >= 0) {
cur_energy = compute_energy(p, max_spare_cap_cpu, head); cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
if (cur_energy < best_energy) { cur_delta -= base_energy_pd;
best_energy = cur_energy; if (cur_delta < best_delta) {
best_delta = cur_delta;
best_energy_cpu = max_spare_cap_cpu; best_energy_cpu = max_spare_cap_cpu;
} }
} }
@ -6464,10 +6428,10 @@ unlock:
* Pick the best CPU if prev_cpu cannot be used, or if it saves at * Pick the best CPU if prev_cpu cannot be used, or if it saves at
* least 6% of the energy used by prev_cpu. * least 6% of the energy used by prev_cpu.
*/ */
if (prev_energy == ULONG_MAX) if (prev_delta == ULONG_MAX)
return best_energy_cpu; return best_energy_cpu;
if ((prev_energy - best_energy) > (prev_energy >> 4)) if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
return best_energy_cpu; return best_energy_cpu;
return prev_cpu; return prev_cpu;
@ -6801,7 +6765,7 @@ again:
goto idle; goto idle;
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
if (prev->sched_class != &fair_sched_class) if (!prev || prev->sched_class != &fair_sched_class)
goto simple; goto simple;
/* /*
@ -6878,8 +6842,8 @@ again:
goto done; goto done;
simple: simple:
#endif #endif
if (prev)
put_prev_task(rq, prev); put_prev_task(rq, prev);
do { do {
se = pick_next_entity(cfs_rq, NULL); se = pick_next_entity(cfs_rq, NULL);
@ -6907,11 +6871,13 @@ done: __maybe_unused;
return p; return p;
idle: idle:
update_misfit_status(NULL, rq); if (!rf)
new_tasks = idle_balance(rq, rf); return NULL;
new_tasks = newidle_balance(rq, rf);
/* /*
* Because idle_balance() releases (and re-acquires) rq->lock, it is * Because newidle_balance() releases (and re-acquires) rq->lock, it is
* possible for any higher priority task to appear. In that case we * possible for any higher priority task to appear. In that case we
* must re-start the pick_next_entity() loop. * must re-start the pick_next_entity() loop.
*/ */
@ -6933,7 +6899,7 @@ idle:
/* /*
* Account for a descheduled task: * Account for a descheduled task:
*/ */
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{ {
struct sched_entity *se = &prev->se; struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq; struct cfs_rq *cfs_rq;
@ -7435,7 +7401,7 @@ static int detach_tasks(struct lb_env *env)
detached++; detached++;
env->imbalance -= load; env->imbalance -= load;
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
/* /*
* NEWIDLE balancing is a source of latency, so preemptible * NEWIDLE balancing is a source of latency, so preemptible
* kernels will stop after the first task is detached to minimize * kernels will stop after the first task is detached to minimize
@ -7982,8 +7948,7 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
static inline bool static inline bool
group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref) group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{ {
return sg->sgc->min_capacity * capacity_margin < return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
ref->sgc->min_capacity * 1024;
} }
/* /*
@ -7993,8 +7958,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
static inline bool static inline bool
group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{ {
return sg->sgc->max_capacity * capacity_margin < return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
ref->sgc->max_capacity * 1024;
} }
static inline enum static inline enum
@ -9052,9 +9016,10 @@ more_balance:
out_balanced: out_balanced:
/* /*
* We reach balance although we may have faced some affinity * We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set. * constraints. Clear the imbalance flag only if other tasks got
* a chance to move and fix the imbalance.
*/ */
if (sd_parent) { if (sd_parent && !(env.flags & LBF_ALL_PINNED)) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance; int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance) if (*group_imbalance)
@ -9075,10 +9040,10 @@ out_one_pinned:
ld_moved = 0; ld_moved = 0;
/* /*
* idle_balance() disregards balance intervals, so we could repeatedly * newidle_balance() disregards balance intervals, so we could
* reach this code, which would lead to balance_interval skyrocketting * repeatedly reach this code, which would lead to balance_interval
* in a short amount of time. Skip the balance_interval increase logic * skyrocketting in a short amount of time. Skip the balance_interval
* to avoid that. * increase logic to avoid that.
*/ */
if (env.idle == CPU_NEWLY_IDLE) if (env.idle == CPU_NEWLY_IDLE)
goto out; goto out;
@ -9788,7 +9753,7 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
* idle_balance is called by schedule() if this_cpu is about to become * idle_balance is called by schedule() if this_cpu is about to become
* idle. Attempts to pull tasks from other CPUs. * idle. Attempts to pull tasks from other CPUs.
*/ */
static int idle_balance(struct rq *this_rq, struct rq_flags *rf) int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
{ {
unsigned long next_balance = jiffies + HZ; unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu; int this_cpu = this_rq->cpu;
@ -9796,6 +9761,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
int pulled_task = 0; int pulled_task = 0;
u64 curr_cost = 0; u64 curr_cost = 0;
update_misfit_status(NULL, this_rq);
/* /*
* We must set idle_stamp _before_ calling idle_balance(), such that we * We must set idle_stamp _before_ calling idle_balance(), such that we
* measure the duration of idle_balance() as idle time. * measure the duration of idle_balance() as idle time.
@ -10180,9 +10146,19 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* This routine is mostly called to set cfs_rq->curr field when a task * This routine is mostly called to set cfs_rq->curr field when a task
* migrates between groups/classes. * migrates between groups/classes.
*/ */
static void set_curr_task_fair(struct rq *rq) static void set_next_task_fair(struct rq *rq, struct task_struct *p)
{ {
struct sched_entity *se = &rq->curr->se; struct sched_entity *se = &p->se;
#ifdef CONFIG_SMP
if (task_on_rq_queued(p)) {
/*
* Move the next running task to the front of the list, so our
* cfs_tasks list becomes MRU one.
*/
list_move(&se->group_node, &rq->cfs_tasks);
}
#endif
for_each_sched_entity(se) { for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se); struct cfs_rq *cfs_rq = cfs_rq_of(se);
@ -10300,18 +10276,18 @@ err:
void online_fair_sched_group(struct task_group *tg) void online_fair_sched_group(struct task_group *tg)
{ {
struct sched_entity *se; struct sched_entity *se;
struct rq_flags rf;
struct rq *rq; struct rq *rq;
int i; int i;
for_each_possible_cpu(i) { for_each_possible_cpu(i) {
rq = cpu_rq(i); rq = cpu_rq(i);
se = tg->se[i]; se = tg->se[i];
rq_lock_irq(rq, &rf);
raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq); update_rq_clock(rq);
attach_entity_cfs_rq(se); attach_entity_cfs_rq(se);
sync_throttle(tg, i); sync_throttle(tg, i);
raw_spin_unlock_irq(&rq->lock); rq_unlock_irq(rq, &rf);
} }
} }
@ -10453,7 +10429,9 @@ const struct sched_class fair_sched_class = {
.check_preempt_curr = check_preempt_wakeup, .check_preempt_curr = check_preempt_wakeup,
.pick_next_task = pick_next_task_fair, .pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair, .put_prev_task = put_prev_task_fair,
.set_next_task = set_next_task_fair,
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair, .select_task_rq = select_task_rq_fair,
@ -10466,7 +10444,6 @@ const struct sched_class fair_sched_class = {
.set_cpus_allowed = set_cpus_allowed_common, .set_cpus_allowed = set_cpus_allowed_common,
#endif #endif
.set_curr_task = set_curr_task_fair,
.task_tick = task_tick_fair, .task_tick = task_tick_fair,
.task_fork = task_fork_fair, .task_fork = task_fork_fair,

View File

@ -375,14 +375,27 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
resched_curr(rq); resched_curr(rq);
} }
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
}
static void set_next_task_idle(struct rq *rq, struct task_struct *next)
{
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
}
static struct task_struct * static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{ {
put_prev_task(rq, prev); struct task_struct *next = rq->idle;
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
return rq->idle; if (prev)
put_prev_task(rq, prev);
set_next_task_idle(rq, next);
return next;
} }
/* /*
@ -398,10 +411,6 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
raw_spin_lock_irq(&rq->lock); raw_spin_lock_irq(&rq->lock);
} }
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
}
/* /*
* scheduler tick hitting a task of our scheduling class. * scheduler tick hitting a task of our scheduling class.
* *
@ -414,10 +423,6 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
{ {
} }
static void set_curr_task_idle(struct rq *rq)
{
}
static void switched_to_idle(struct rq *rq, struct task_struct *p) static void switched_to_idle(struct rq *rq, struct task_struct *p)
{ {
BUG(); BUG();
@ -452,13 +457,13 @@ const struct sched_class idle_sched_class = {
.pick_next_task = pick_next_task_idle, .pick_next_task = pick_next_task_idle,
.put_prev_task = put_prev_task_idle, .put_prev_task = put_prev_task_idle,
.set_next_task = set_next_task_idle,
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_idle, .select_task_rq = select_task_rq_idle,
.set_cpus_allowed = set_cpus_allowed_common, .set_cpus_allowed = set_cpus_allowed_common,
#endif #endif
.set_curr_task = set_curr_task_idle,
.task_tick = task_tick_idle, .task_tick = task_tick_idle,
.get_rr_interval = get_rr_interval_idle, .get_rr_interval = get_rr_interval_idle,

View File

@ -22,9 +22,17 @@ EXPORT_SYMBOL_GPL(housekeeping_enabled);
int housekeeping_any_cpu(enum hk_flags flags) int housekeeping_any_cpu(enum hk_flags flags)
{ {
if (static_branch_unlikely(&housekeeping_overridden)) int cpu;
if (housekeeping_flags & flags)
if (static_branch_unlikely(&housekeeping_overridden)) {
if (housekeeping_flags & flags) {
cpu = sched_numa_find_closest(housekeeping_mask, smp_processor_id());
if (cpu < nr_cpu_ids)
return cpu;
return cpumask_any_and(housekeeping_mask, cpu_online_mask); return cpumask_any_and(housekeeping_mask, cpu_online_mask);
}
}
return smp_processor_id(); return smp_processor_id();
} }
EXPORT_SYMBOL_GPL(housekeeping_any_cpu); EXPORT_SYMBOL_GPL(housekeeping_any_cpu);

View File

@ -1198,7 +1198,7 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
if (static_branch_likely(&psi_disabled)) if (static_branch_likely(&psi_disabled))
return -EOPNOTSUPP; return -EOPNOTSUPP;
buf_size = min(nbytes, (sizeof(buf) - 1)); buf_size = min(nbytes, sizeof(buf));
if (copy_from_user(buf, user_buf, buf_size)) if (copy_from_user(buf, user_buf, buf_size))
return -EFAULT; return -EFAULT;

View File

@ -1498,12 +1498,22 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
#endif #endif
} }
static inline void set_next_task(struct rq *rq, struct task_struct *p) static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
{ {
p->se.exec_start = rq_clock_task(rq); p->se.exec_start = rq_clock_task(rq);
/* The running task is never eligible for pushing */ /* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p); dequeue_pushable_task(rq, p);
/*
* If prev task was rt, put_prev_task() has already updated the
* utilization. We only care of the case where we start to schedule a
* rt task
*/
if (rq->curr->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
rt_queue_push_tasks(rq);
} }
static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq, static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@ -1543,56 +1553,19 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct task_struct *p; struct task_struct *p;
struct rt_rq *rt_rq = &rq->rt; struct rt_rq *rt_rq = &rq->rt;
if (need_pull_rt_task(rq, prev)) { WARN_ON_ONCE(prev || rf);
/*
* This is OK, because current is on_cpu, which avoids it being
* picked for load-balance and preemption/IRQs are still
* disabled avoiding further scheduler activity on it and we're
* being very careful to re-start the picking loop.
*/
rq_unpin_lock(rq, rf);
pull_rt_task(rq);
rq_repin_lock(rq, rf);
/*
* pull_rt_task() can drop (and re-acquire) rq->lock; this
* means a dl or stop task can slip in, in which case we need
* to re-start task selection.
*/
if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
rq->dl.dl_nr_running))
return RETRY_TASK;
}
/*
* We may dequeue prev's rt_rq in put_prev_task().
* So, we update time before rt_queued check.
*/
if (prev->sched_class == &rt_sched_class)
update_curr_rt(rq);
if (!rt_rq->rt_queued) if (!rt_rq->rt_queued)
return NULL; return NULL;
put_prev_task(rq, prev);
p = _pick_next_task_rt(rq); p = _pick_next_task_rt(rq);
set_next_task(rq, p); set_next_task_rt(rq, p);
rt_queue_push_tasks(rq);
/*
* If prev task was rt, put_prev_task() has already updated the
* utilization. We only care of the case where we start to schedule a
* rt task
*/
if (rq->curr->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
return p; return p;
} }
static void put_prev_task_rt(struct rq *rq, struct task_struct *p) static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
{ {
update_curr_rt(rq); update_curr_rt(rq);
@ -1604,6 +1577,18 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
*/ */
if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1) if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
enqueue_pushable_task(rq, p); enqueue_pushable_task(rq, p);
if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
/*
* This is OK, because current is on_cpu, which avoids it being
* picked for load-balance and preemption/IRQs are still
* disabled avoiding further scheduler activity on it and we've
* not yet started the picking loop.
*/
rq_unpin_lock(rq, rf);
pull_rt_task(rq);
rq_repin_lock(rq, rf);
}
} }
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
@ -2354,11 +2339,6 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
} }
} }
static void set_curr_task_rt(struct rq *rq)
{
set_next_task(rq, rq->curr);
}
static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task) static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
{ {
/* /*
@ -2380,6 +2360,7 @@ const struct sched_class rt_sched_class = {
.pick_next_task = pick_next_task_rt, .pick_next_task = pick_next_task_rt,
.put_prev_task = put_prev_task_rt, .put_prev_task = put_prev_task_rt,
.set_next_task = set_next_task_rt,
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_rt, .select_task_rq = select_task_rq_rt,
@ -2391,7 +2372,6 @@ const struct sched_class rt_sched_class = {
.switched_from = switched_from_rt, .switched_from = switched_from_rt,
#endif #endif
.set_curr_task = set_curr_task_rt,
.task_tick = task_tick_rt, .task_tick = task_tick_rt,
.get_rr_interval = get_rr_interval_rt, .get_rr_interval = get_rr_interval_rt,

View File

@ -335,8 +335,6 @@ struct cfs_bandwidth {
u64 quota; u64 quota;
u64 runtime; u64 runtime;
s64 hierarchical_quota; s64 hierarchical_quota;
u64 runtime_expires;
int expires_seq;
u8 idle; u8 idle;
u8 period_active; u8 period_active;
@ -393,6 +391,16 @@ struct task_group {
#endif #endif
struct cfs_bandwidth cfs_bandwidth; struct cfs_bandwidth cfs_bandwidth;
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* The two decimal precision [%] value requested from user-space */
unsigned int uclamp_pct[UCLAMP_CNT];
/* Clamp values requested for a task group */
struct uclamp_se uclamp_req[UCLAMP_CNT];
/* Effective clamp values used for a task group */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
}; };
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
@ -483,7 +491,8 @@ struct cfs_rq {
struct load_weight load; struct load_weight load;
unsigned long runnable_weight; unsigned long runnable_weight;
unsigned int nr_running; unsigned int nr_running;
unsigned int h_nr_running; unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
u64 exec_clock; u64 exec_clock;
u64 min_vruntime; u64 min_vruntime;
@ -556,8 +565,6 @@ struct cfs_rq {
#ifdef CONFIG_CFS_BANDWIDTH #ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled; int runtime_enabled;
int expires_seq;
u64 runtime_expires;
s64 runtime_remaining; s64 runtime_remaining;
u64 throttled_clock; u64 throttled_clock;
@ -777,9 +784,6 @@ struct root_domain {
struct perf_domain __rcu *pd; struct perf_domain __rcu *pd;
}; };
extern struct root_domain def_root_domain;
extern struct mutex sched_domains_mutex;
extern void init_defrootdomain(void); extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map); extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd); extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
@ -1261,16 +1265,18 @@ enum numa_topology_type {
extern enum numa_topology_type sched_numa_topology_type; extern enum numa_topology_type sched_numa_topology_type;
extern int sched_max_numa_distance; extern int sched_max_numa_distance;
extern bool find_numa_distance(int distance); extern bool find_numa_distance(int distance);
#endif
#ifdef CONFIG_NUMA
extern void sched_init_numa(void); extern void sched_init_numa(void);
extern void sched_domains_numa_masks_set(unsigned int cpu); extern void sched_domains_numa_masks_set(unsigned int cpu);
extern void sched_domains_numa_masks_clear(unsigned int cpu); extern void sched_domains_numa_masks_clear(unsigned int cpu);
extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
#else #else
static inline void sched_init_numa(void) { } static inline void sched_init_numa(void) { }
static inline void sched_domains_numa_masks_set(unsigned int cpu) { } static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
static inline void sched_domains_numa_masks_clear(unsigned int cpu) { } static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
{
return nr_cpu_ids;
}
#endif #endif
#ifdef CONFIG_NUMA_BALANCING #ifdef CONFIG_NUMA_BALANCING
@ -1449,10 +1455,14 @@ static inline void unregister_sched_domain_sysctl(void)
} }
#endif #endif
extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
#else #else
static inline void sched_ttwu_pending(void) { } static inline void sched_ttwu_pending(void) { }
static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
#include "stats.h" #include "stats.h"
@ -1700,17 +1710,21 @@ struct sched_class {
void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
/* /*
* It is the responsibility of the pick_next_task() method that will * Both @prev and @rf are optional and may be NULL, in which case the
* return the next task to call put_prev_task() on the @prev task or * caller must already have invoked put_prev_task(rq, prev, rf).
* something equivalent.
* *
* May return RETRY_TASK when it finds a higher prio class has runnable * Otherwise it is the responsibility of the pick_next_task() to call
* tasks. * put_prev_task() on the @prev task or something equivalent, IFF it
* returns a next task.
*
* In that case (@rf != NULL) it may return RETRY_TASK when it finds a
* higher prio class has runnable tasks.
*/ */
struct task_struct * (*pick_next_task)(struct rq *rq, struct task_struct * (*pick_next_task)(struct rq *rq,
struct task_struct *prev, struct task_struct *prev,
struct rq_flags *rf); struct rq_flags *rf);
void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
void (*set_next_task)(struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
@ -1725,7 +1739,6 @@ struct sched_class {
void (*rq_offline)(struct rq *rq); void (*rq_offline)(struct rq *rq);
#endif #endif
void (*set_curr_task)(struct rq *rq);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p); void (*task_fork)(struct task_struct *p);
void (*task_dead)(struct task_struct *p); void (*task_dead)(struct task_struct *p);
@ -1755,12 +1768,14 @@ struct sched_class {
static inline void put_prev_task(struct rq *rq, struct task_struct *prev) static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{ {
prev->sched_class->put_prev_task(rq, prev); WARN_ON_ONCE(rq->curr != prev);
prev->sched_class->put_prev_task(rq, prev, NULL);
} }
static inline void set_curr_task(struct rq *rq, struct task_struct *curr) static inline void set_next_task(struct rq *rq, struct task_struct *next)
{ {
curr->sched_class->set_curr_task(rq); WARN_ON_ONCE(rq->curr != next);
next->sched_class->set_next_task(rq, next);
} }
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
@ -1943,7 +1958,7 @@ unsigned long arch_scale_freq_capacity(int cpu)
#endif #endif
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPTION
static inline void double_rq_lock(struct rq *rq1, struct rq *rq2); static inline void double_rq_lock(struct rq *rq1, struct rq *rq2);
@ -1995,7 +2010,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
return ret; return ret;
} }
#endif /* CONFIG_PREEMPT */ #endif /* CONFIG_PREEMPTION */
/* /*
* double_lock_balance - lock the busiest runqueue, this_rq is locked already. * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
@ -2266,7 +2281,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */ #endif /* CONFIG_CPU_FREQ */
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id); enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
static __always_inline static __always_inline
unsigned int uclamp_util_with(struct rq *rq, unsigned int util, unsigned int uclamp_util_with(struct rq *rq, unsigned int util,

View File

@ -157,9 +157,10 @@ static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
{ {
unsigned long long now = rq_clock(rq), delta = 0; unsigned long long now = rq_clock(rq), delta = 0;
if (unlikely(sched_info_on())) if (sched_info_on()) {
if (t->sched_info.last_queued) if (t->sched_info.last_queued)
delta = now - t->sched_info.last_queued; delta = now - t->sched_info.last_queued;
}
sched_info_reset_dequeued(t); sched_info_reset_dequeued(t);
t->sched_info.run_delay += delta; t->sched_info.run_delay += delta;
@ -192,7 +193,7 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t)
*/ */
static inline void sched_info_queued(struct rq *rq, struct task_struct *t) static inline void sched_info_queued(struct rq *rq, struct task_struct *t)
{ {
if (unlikely(sched_info_on())) { if (sched_info_on()) {
if (!t->sched_info.last_queued) if (!t->sched_info.last_queued)
t->sched_info.last_queued = rq_clock(rq); t->sched_info.last_queued = rq_clock(rq);
} }
@ -239,7 +240,7 @@ __sched_info_switch(struct rq *rq, struct task_struct *prev, struct task_struct
static inline void static inline void
sched_info_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) sched_info_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next)
{ {
if (unlikely(sched_info_on())) if (sched_info_on())
__sched_info_switch(rq, prev, next); __sched_info_switch(rq, prev, next);
} }

View File

@ -23,17 +23,22 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
/* we're never preempted */ /* we're never preempted */
} }
static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
{
stop->se.exec_start = rq_clock_task(rq);
}
static struct task_struct * static struct task_struct *
pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{ {
struct task_struct *stop = rq->stop; struct task_struct *stop = rq->stop;
WARN_ON_ONCE(prev || rf);
if (!stop || !task_on_rq_queued(stop)) if (!stop || !task_on_rq_queued(stop))
return NULL; return NULL;
put_prev_task(rq, prev); set_next_task_stop(rq, stop);
stop->se.exec_start = rq_clock_task(rq);
return stop; return stop;
} }
@ -55,7 +60,7 @@ static void yield_task_stop(struct rq *rq)
BUG(); /* the stop task should never yield, its pointless. */ BUG(); /* the stop task should never yield, its pointless. */
} }
static void put_prev_task_stop(struct rq *rq, struct task_struct *prev) static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{ {
struct task_struct *curr = rq->curr; struct task_struct *curr = rq->curr;
u64 delta_exec; u64 delta_exec;
@ -86,13 +91,6 @@ static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
{ {
} }
static void set_curr_task_stop(struct rq *rq)
{
struct task_struct *stop = rq->stop;
stop->se.exec_start = rq_clock_task(rq);
}
static void switched_to_stop(struct rq *rq, struct task_struct *p) static void switched_to_stop(struct rq *rq, struct task_struct *p)
{ {
BUG(); /* its impossible to change to this class */ BUG(); /* its impossible to change to this class */
@ -128,13 +126,13 @@ const struct sched_class stop_sched_class = {
.pick_next_task = pick_next_task_stop, .pick_next_task = pick_next_task_stop,
.put_prev_task = put_prev_task_stop, .put_prev_task = put_prev_task_stop,
.set_next_task = set_next_task_stop,
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_stop, .select_task_rq = select_task_rq_stop,
.set_cpus_allowed = set_cpus_allowed_common, .set_cpus_allowed = set_cpus_allowed_common,
#endif #endif
.set_curr_task = set_curr_task_stop,
.task_tick = task_tick_stop, .task_tick = task_tick_stop,
.get_rr_interval = get_rr_interval_stop, .get_rr_interval = get_rr_interval_stop,

View File

@ -1284,6 +1284,7 @@ static int sched_domains_curr_level;
int sched_max_numa_distance; int sched_max_numa_distance;
static int *sched_domains_numa_distance; static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks; static struct cpumask ***sched_domains_numa_masks;
int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
#endif #endif
/* /*
@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->flags &= ~SD_PREFER_SIBLING; sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE; sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC | sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK | SD_BALANCE_FORK |
SD_WAKE_AFFINE); SD_WAKE_AFFINE);
@ -1724,6 +1725,26 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
} }
} }
/*
* sched_numa_find_closest() - given the NUMA topology, find the cpu
* closest to @cpu from @cpumask.
* cpumask: cpumask to find a cpu from
* cpu: cpu to be close to
*
* returns: cpu, or nr_cpu_ids when nothing found.
*/
int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
{
int i, j = cpu_to_node(cpu);
for (i = 0; i < sched_domains_numa_levels; i++) {
cpu = cpumask_any_and(cpus, sched_domains_numa_masks[i][j]);
if (cpu < nr_cpu_ids)
return cpu;
}
return nr_cpu_ids;
}
#endif /* CONFIG_NUMA */ #endif /* CONFIG_NUMA */
static int __sdt_alloc(const struct cpumask *cpu_map) static int __sdt_alloc(const struct cpumask *cpu_map)
@ -2149,16 +2170,16 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
* ndoms_new == 0 is a special case for destroying existing domains, * ndoms_new == 0 is a special case for destroying existing domains,
* and it will not create the default domain. * and it will not create the default domain.
* *
* Call with hotplug lock held * Call with hotplug lock and sched_domains_mutex held
*/ */
void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new) struct sched_domain_attr *dattr_new)
{ {
bool __maybe_unused has_eas = false; bool __maybe_unused has_eas = false;
int i, j, n; int i, j, n;
int new_topology; int new_topology;
mutex_lock(&sched_domains_mutex); lockdep_assert_held(&sched_domains_mutex);
/* Always unregister in case we don't destroy any domains: */ /* Always unregister in case we don't destroy any domains: */
unregister_sched_domain_sysctl(); unregister_sched_domain_sysctl();
@ -2183,8 +2204,19 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
for (i = 0; i < ndoms_cur; i++) { for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < n && !new_topology; j++) { for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_cur[i], doms_new[j]) && if (cpumask_equal(doms_cur[i], doms_new[j]) &&
dattrs_equal(dattr_cur, i, dattr_new, j)) dattrs_equal(dattr_cur, i, dattr_new, j)) {
struct root_domain *rd;
/*
* This domain won't be destroyed and as such
* its dl_bw->total_bw needs to be cleared. It
* will be recomputed in function
* update_tasks_root_domain().
*/
rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
dl_clear_root_domain(rd);
goto match1; goto match1;
}
} }
/* No match - a current sched domain not in new doms_new[] */ /* No match - a current sched domain not in new doms_new[] */
detach_destroy_domains(doms_cur[i]); detach_destroy_domains(doms_cur[i]);
@ -2241,6 +2273,15 @@ match3:
ndoms_cur = ndoms_new; ndoms_cur = ndoms_new;
register_sched_domain_sysctl(); register_sched_domain_sysctl();
}
/*
* Call with hotplug lock held
*/
void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new)
{
mutex_lock(&sched_domains_mutex);
partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
mutex_unlock(&sched_domains_mutex); mutex_unlock(&sched_domains_mutex);
} }

View File

@ -383,6 +383,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
*/ */
preempt_disable(); preempt_disable();
stop_cpus_in_progress = true; stop_cpus_in_progress = true;
barrier();
for_each_cpu(cpu, cpumask) { for_each_cpu(cpu, cpumask) {
work = &per_cpu(cpu_stopper.stop_work, cpu); work = &per_cpu(cpu_stopper.stop_work, cpu);
work->fn = fn; work->fn = fn;
@ -391,6 +392,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
if (cpu_stop_queue_work(cpu, work)) if (cpu_stop_queue_work(cpu, work))
queued = true; queued = true;
} }
barrier();
stop_cpus_in_progress = false; stop_cpus_in_progress = false;
preempt_enable(); preempt_enable();

View File

@ -146,7 +146,7 @@ config FUNCTION_TRACER
select GENERIC_TRACER select GENERIC_TRACER
select CONTEXT_SWITCH_TRACER select CONTEXT_SWITCH_TRACER
select GLOB select GLOB
select TASKS_RCU if PREEMPT select TASKS_RCU if PREEMPTION
help help
Enable the kernel to trace every kernel function. This is done Enable the kernel to trace every kernel function. This is done
by using a compiler feature to insert a small, 5-byte No-Operation by using a compiler feature to insert a small, 5-byte No-Operation
@ -179,7 +179,7 @@ config TRACE_PREEMPT_TOGGLE
config PREEMPTIRQ_EVENTS config PREEMPTIRQ_EVENTS
bool "Enable trace events for preempt and irq disable/enable" bool "Enable trace events for preempt and irq disable/enable"
select TRACE_IRQFLAGS select TRACE_IRQFLAGS
select TRACE_PREEMPT_TOGGLE if PREEMPT select TRACE_PREEMPT_TOGGLE if PREEMPTION
select GENERIC_TRACER select GENERIC_TRACER
default n default n
help help
@ -214,7 +214,7 @@ config PREEMPT_TRACER
bool "Preemption-off Latency Tracer" bool "Preemption-off Latency Tracer"
default n default n
depends on !ARCH_USES_GETTIMEOFFSET depends on !ARCH_USES_GETTIMEOFFSET
depends on PREEMPT depends on PREEMPTION
select GENERIC_TRACER select GENERIC_TRACER
select TRACER_MAX_TRACE select TRACER_MAX_TRACE
select RING_BUFFER_ALLOW_SWAP select RING_BUFFER_ALLOW_SWAP

View File

@ -2814,7 +2814,7 @@ int ftrace_shutdown(struct ftrace_ops *ops, int command)
* synchornize_rcu_tasks() will wait for those tasks to * synchornize_rcu_tasks() will wait for those tasks to
* execute and either schedule voluntarily or enter user space. * execute and either schedule voluntarily or enter user space.
*/ */
if (IS_ENABLED(CONFIG_PREEMPT)) if (IS_ENABLED(CONFIG_PREEMPTION))
synchronize_rcu_tasks(); synchronize_rcu_tasks();
free_ops: free_ops:

View File

@ -267,7 +267,7 @@ static void ring_buffer_producer(void)
if (consumer && !(cnt % wakeup_interval)) if (consumer && !(cnt % wakeup_interval))
wake_up_process(consumer); wake_up_process(consumer);
#ifndef CONFIG_PREEMPT #ifndef CONFIG_PREEMPTION
/* /*
* If we are a non preempt kernel, the 10 second run will * If we are a non preempt kernel, the 10 second run will
* stop everything while it runs. Instead, we will call * stop everything while it runs. Instead, we will call

View File

@ -255,12 +255,12 @@ void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
local_save_flags(fbuffer->flags); local_save_flags(fbuffer->flags);
fbuffer->pc = preempt_count(); fbuffer->pc = preempt_count();
/* /*
* If CONFIG_PREEMPT is enabled, then the tracepoint itself disables * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
* preemption (adding one to the preempt_count). Since we are * preemption (adding one to the preempt_count). Since we are
* interested in the preempt_count at the time the tracepoint was * interested in the preempt_count at the time the tracepoint was
* hit, we need to subtract one to offset the increment. * hit, we need to subtract one to offset the increment.
*/ */
if (IS_ENABLED(CONFIG_PREEMPT)) if (IS_ENABLED(CONFIG_PREEMPTION))
fbuffer->pc--; fbuffer->pc--;
fbuffer->trace_file = trace_file; fbuffer->trace_file = trace_file;

View File

@ -579,8 +579,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
else else
tracing_dl = 0; tracing_dl = 0;
wakeup_task = p; wakeup_task = get_task_struct(p);
get_task_struct(wakeup_task);
local_save_flags(flags); local_save_flags(flags);

View File

@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid)
for (i = 0; i < MAX_NUMNODES; i++) { for (i = 0; i < MAX_NUMNODES; i++) {
if (!khugepaged_node_load[i]) if (!khugepaged_node_load[i])
continue; continue;
if (node_distance(nid, i) > RECLAIM_DISTANCE) if (node_distance(nid, i) > node_reclaim_distance)
return true; return true;
} }
return false; return false;

View File

@ -3511,7 +3511,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{ {
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <= return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
RECLAIM_DISTANCE; node_reclaim_distance;
} }
#else /* CONFIG_NUMA */ #else /* CONFIG_NUMA */
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)