440 lines
16 KiB
ReStructuredText
440 lines
16 KiB
ReStructuredText
|
=========================
|
||
|
Capacity Aware Scheduling
|
||
|
=========================
|
||
|
|
||
|
1. CPU Capacity
|
||
|
===============
|
||
|
|
||
|
1.1 Introduction
|
||
|
----------------
|
||
|
|
||
|
Conventional, homogeneous SMP platforms are composed of purely identical
|
||
|
CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
|
||
|
different performance characteristics - on such platforms, not all CPUs can be
|
||
|
considered equal.
|
||
|
|
||
|
CPU capacity is a measure of the performance a CPU can reach, normalized against
|
||
|
the most performant CPU in the system. Heterogeneous systems are also called
|
||
|
asymmetric CPU capacity systems, as they contain CPUs of different capacities.
|
||
|
|
||
|
Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
|
||
|
from two factors:
|
||
|
|
||
|
- not all CPUs may have the same microarchitecture (µarch).
|
||
|
- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
|
||
|
physically able to attain the higher Operating Performance Points (OPP).
|
||
|
|
||
|
Arm big.LITTLE systems are an example of both. The big CPUs are more
|
||
|
performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
|
||
|
smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
|
||
|
can.
|
||
|
|
||
|
CPU performance is usually expressed in Millions of Instructions Per Second
|
||
|
(MIPS), which can also be expressed as a given amount of instructions attainable
|
||
|
per Hz, leading to::
|
||
|
|
||
|
capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
|
||
|
|
||
|
1.2 Scheduler terms
|
||
|
-------------------
|
||
|
|
||
|
Two different capacity values are used within the scheduler. A CPU's
|
||
|
``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
|
||
|
attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
|
||
|
which some loss of available performance (e.g. time spent handling IRQs) is
|
||
|
subtracted.
|
||
|
|
||
|
Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
|
||
|
while ``capacity_orig`` is class-agnostic. The rest of this document will use
|
||
|
the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
|
||
|
brevity.
|
||
|
|
||
|
1.3 Platform examples
|
||
|
---------------------
|
||
|
|
||
|
1.3.1 Identical OPPs
|
||
|
~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
Consider an hypothetical dual-core asymmetric CPU capacity system where
|
||
|
|
||
|
- work_per_hz(CPU0) = W
|
||
|
- work_per_hz(CPU1) = W/2
|
||
|
- all CPUs are running at the same fixed frequency
|
||
|
|
||
|
By the above definition of capacity:
|
||
|
|
||
|
- capacity(CPU0) = C
|
||
|
- capacity(CPU1) = C/2
|
||
|
|
||
|
To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
|
||
|
be a LITTLE.
|
||
|
|
||
|
With a workload that periodically does a fixed amount of work, you will get an
|
||
|
execution trace like so::
|
||
|
|
||
|
CPU0 work ^
|
||
|
| ____ ____ ____
|
||
|
| | | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
CPU1 work ^
|
||
|
| _________ _________ ____
|
||
|
| | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
CPU0 has the highest capacity in the system (C), and completes a fixed amount of
|
||
|
work W in T units of time. On the other hand, CPU1 has half the capacity of
|
||
|
CPU0, and thus only completes W/2 in T.
|
||
|
|
||
|
1.3.2 Different max OPPs
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
Usually, CPUs of different capacity values also have different maximum
|
||
|
OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
|
||
|
|
||
|
- max_freq(CPU0) = F
|
||
|
- max_freq(CPU1) = 2/3 * F
|
||
|
|
||
|
This yields:
|
||
|
|
||
|
- capacity(CPU0) = C
|
||
|
- capacity(CPU1) = C/3
|
||
|
|
||
|
Executing the same workload as described in 1.3.1, which each CPU running at its
|
||
|
maximum frequency results in::
|
||
|
|
||
|
CPU0 work ^
|
||
|
| ____ ____ ____
|
||
|
| | | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
workload on CPU1
|
||
|
CPU1 work ^
|
||
|
| ______________ ______________ ____
|
||
|
| | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
1.4 Representation caveat
|
||
|
-------------------------
|
||
|
|
||
|
It should be noted that having a *single* value to represent differences in CPU
|
||
|
performance is somewhat of a contentious point. The relative performance
|
||
|
difference between two different µarchs could be X% on integer operations, Y% on
|
||
|
floating point operations, Z% on branches, and so on. Still, results using this
|
||
|
simple approach have been satisfactory for now.
|
||
|
|
||
|
2. Task utilization
|
||
|
===================
|
||
|
|
||
|
2.1 Introduction
|
||
|
----------------
|
||
|
|
||
|
Capacity aware scheduling requires an expression of a task's requirements with
|
||
|
regards to CPU capacity. Each scheduler class can express this differently, and
|
||
|
while task utilization is specific to CFS, it is convenient to describe it here
|
||
|
in order to introduce more generic concepts.
|
||
|
|
||
|
Task utilization is a percentage meant to represent the throughput requirements
|
||
|
of a task. A simple approximation of it is the task's duty cycle, i.e.::
|
||
|
|
||
|
task_util(p) = duty_cycle(p)
|
||
|
|
||
|
On an SMP system with fixed frequencies, 100% utilization suggests the task is a
|
||
|
busy loop. Conversely, 10% utilization hints it is a small periodic task that
|
||
|
spends more time sleeping than executing. Variable CPU frequencies and
|
||
|
asymmetric CPU capacities complexify this somewhat; the following sections will
|
||
|
expand on these.
|
||
|
|
||
|
2.2 Frequency invariance
|
||
|
------------------------
|
||
|
|
||
|
One issue that needs to be taken into account is that a workload's duty cycle is
|
||
|
directly impacted by the current OPP the CPU is running at. Consider running a
|
||
|
periodic workload at a given frequency F::
|
||
|
|
||
|
CPU work ^
|
||
|
| ____ ____ ____
|
||
|
| | | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
This yields duty_cycle(p) == 25%.
|
||
|
|
||
|
Now, consider running the *same* workload at frequency F/2::
|
||
|
|
||
|
CPU work ^
|
||
|
| _________ _________ ____
|
||
|
| | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
This yields duty_cycle(p) == 50%, despite the task having the exact same
|
||
|
behaviour (i.e. executing the same amount of work) in both executions.
|
||
|
|
||
|
The task utilization signal can be made frequency invariant using the following
|
||
|
formula::
|
||
|
|
||
|
task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
|
||
|
|
||
|
Applying this formula to the two examples above yields a frequency invariant
|
||
|
task utilization of 25%.
|
||
|
|
||
|
2.3 CPU invariance
|
||
|
------------------
|
||
|
|
||
|
CPU capacity has a similar effect on task utilization in that running an
|
||
|
identical workload on CPUs of different capacity values will yield different
|
||
|
duty cycles.
|
||
|
|
||
|
Consider the system described in 1.3.2., i.e.::
|
||
|
|
||
|
- capacity(CPU0) = C
|
||
|
- capacity(CPU1) = C/3
|
||
|
|
||
|
Executing a given periodic workload on each CPU at their maximum frequency would
|
||
|
result in::
|
||
|
|
||
|
CPU0 work ^
|
||
|
| ____ ____ ____
|
||
|
| | | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
CPU1 work ^
|
||
|
| ______________ ______________ ____
|
||
|
| | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
|
||
|
IOW,
|
||
|
|
||
|
- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
|
||
|
- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
|
||
|
|
||
|
The task utilization signal can be made CPU invariant using the following
|
||
|
formula::
|
||
|
|
||
|
task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
|
||
|
|
||
|
with ``max_capacity`` being the highest CPU capacity value in the
|
||
|
system. Applying this formula to the above example above yields a CPU
|
||
|
invariant task utilization of 25%.
|
||
|
|
||
|
2.4 Invariant task utilization
|
||
|
------------------------------
|
||
|
|
||
|
Both frequency and CPU invariance need to be applied to task utilization in
|
||
|
order to obtain a truly invariant signal. The pseudo-formula for a task
|
||
|
utilization that is both CPU and frequency invariant is thus, for a given
|
||
|
task p::
|
||
|
|
||
|
curr_frequency(cpu) capacity(cpu)
|
||
|
task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
|
||
|
max_frequency(cpu) max_capacity
|
||
|
|
||
|
In other words, invariant task utilization describes the behaviour of a task as
|
||
|
if it were running on the highest-capacity CPU in the system, running at its
|
||
|
maximum frequency.
|
||
|
|
||
|
Any mention of task utilization in the following sections will imply its
|
||
|
invariant form.
|
||
|
|
||
|
2.5 Utilization estimation
|
||
|
--------------------------
|
||
|
|
||
|
Without a crystal ball, task behaviour (and thus task utilization) cannot
|
||
|
accurately be predicted the moment a task first becomes runnable. The CFS class
|
||
|
maintains a handful of CPU and task signals based on the Per-Entity Load
|
||
|
Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
|
||
|
opposed to instantaneous).
|
||
|
|
||
|
This means that while the capacity aware scheduling criteria will be written
|
||
|
considering a "true" task utilization (using a crystal ball), the implementation
|
||
|
will only ever be able to use an estimator thereof.
|
||
|
|
||
|
3. Capacity aware scheduling requirements
|
||
|
=========================================
|
||
|
|
||
|
3.1 CPU capacity
|
||
|
----------------
|
||
|
|
||
|
Linux cannot currently figure out CPU capacity on its own, this information thus
|
||
|
needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
|
||
|
for that purpose.
|
||
|
|
||
|
The arm and arm64 architectures directly map this to the arch_topology driver
|
||
|
CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
|
||
|
Documentation/devicetree/bindings/arm/cpu-capacity.txt.
|
||
|
|
||
|
3.2 Frequency invariance
|
||
|
------------------------
|
||
|
|
||
|
As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
|
||
|
utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
|
||
|
purpose.
|
||
|
|
||
|
Implementing this function requires figuring out at which frequency each CPU
|
||
|
have been running at. One way to implement this is to leverage hardware counters
|
||
|
whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
|
||
|
AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
|
||
|
when the kernel is aware of the switched-to frequency (also employed by
|
||
|
arm/arm64).
|
||
|
|
||
|
4. Scheduler topology
|
||
|
=====================
|
||
|
|
||
|
During the construction of the sched domains, the scheduler will figure out
|
||
|
whether the system exhibits asymmetric CPU capacities. Should that be the
|
||
|
case:
|
||
|
|
||
|
- The sched_asym_cpucapacity static key will be enabled.
|
||
|
- The SD_ASYM_CPUCAPACITY flag will be set at the lowest sched_domain level that
|
||
|
spans all unique CPU capacity values.
|
||
|
|
||
|
The sched_asym_cpucapacity static key is intended to guard sections of code that
|
||
|
cater to asymmetric CPU capacity systems. Do note however that said key is
|
||
|
*system-wide*. Imagine the following setup using cpusets::
|
||
|
|
||
|
capacity C/2 C
|
||
|
________ ________
|
||
|
/ \ / \
|
||
|
CPUs 0 1 2 3 4 5 6 7
|
||
|
\__/ \______________/
|
||
|
cpusets cs0 cs1
|
||
|
|
||
|
Which could be created via:
|
||
|
|
||
|
.. code-block:: sh
|
||
|
|
||
|
mkdir /sys/fs/cgroup/cpuset/cs0
|
||
|
echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
|
||
|
echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
|
||
|
|
||
|
mkdir /sys/fs/cgroup/cpuset/cs1
|
||
|
echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
|
||
|
echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
|
||
|
|
||
|
echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
|
||
|
|
||
|
Since there *is* CPU capacity asymmetry in the system, the
|
||
|
sched_asym_cpucapacity static key will be enabled. However, the sched_domain
|
||
|
hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
|
||
|
set in that hierarchy, it describes an SMP island and should be treated as such.
|
||
|
|
||
|
Therefore, the 'canonical' pattern for protecting codepaths that cater to
|
||
|
asymmetric CPU capacities is to:
|
||
|
|
||
|
- Check the sched_asym_cpucapacity static key
|
||
|
- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
|
||
|
the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
|
||
|
CPU or group thereof)
|
||
|
|
||
|
5. Capacity aware scheduling implementation
|
||
|
===========================================
|
||
|
|
||
|
5.1 CFS
|
||
|
-------
|
||
|
|
||
|
5.1.1 Capacity fitness
|
||
|
~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
The main capacity scheduling criterion of CFS is::
|
||
|
|
||
|
task_util(p) < capacity(task_cpu(p))
|
||
|
|
||
|
This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
|
||
|
task "fits" on its CPU. If it is violated, the task will need to achieve more
|
||
|
work than what its CPU can provide: it will be CPU-bound.
|
||
|
|
||
|
Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
|
||
|
value for a task, either via sched_setattr() or via the cgroup interface (see
|
||
|
Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
|
||
|
clamp task_util() in the previous criterion.
|
||
|
|
||
|
5.1.2 Wakeup CPU selection
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
CFS task wakeup CPU selection follows the capacity fitness criterion described
|
||
|
above. On top of that, uclamp is used to clamp the task utilization values,
|
||
|
which lets userspace have more leverage over the CPU selection of CFS
|
||
|
tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
|
||
|
|
||
|
clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
|
||
|
|
||
|
By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
|
||
|
on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
|
||
|
periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
|
||
|
giving it a high uclamp.min value.
|
||
|
|
||
|
.. note::
|
||
|
|
||
|
Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
|
||
|
(EAS), which is described in Documentation/scheduling/sched-energy.rst.
|
||
|
|
||
|
5.1.3 Load balancing
|
||
|
~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
A pathological case in the wakeup CPU selection occurs when a task rarely
|
||
|
sleeps, if at all - it thus rarely wakes up, if at all. Consider::
|
||
|
|
||
|
w == wakeup event
|
||
|
|
||
|
capacity(CPU0) = C
|
||
|
capacity(CPU1) = C / 3
|
||
|
|
||
|
workload on CPU0
|
||
|
CPU work ^
|
||
|
| _________ _________ ____
|
||
|
| | | | | |
|
||
|
+----+----+----+----+----+----+----+----+----+----+-> time
|
||
|
w w w
|
||
|
|
||
|
workload on CPU1
|
||
|
CPU work ^
|
||
|
| ____________________________________________
|
||
|
| |
|
||
|
+----+----+----+----+----+----+----+----+----+----+->
|
||
|
w
|
||
|
|
||
|
This workload should run on CPU0, but if the task either:
|
||
|
|
||
|
- was improperly scheduled from the start (inaccurate initial
|
||
|
utilization estimation)
|
||
|
- was properly scheduled from the start, but suddenly needs more
|
||
|
processing power
|
||
|
|
||
|
then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
|
||
|
the CPU capacity scheduling criterion is violated, and there may not be any more
|
||
|
wakeup event to fix this up via wakeup CPU selection.
|
||
|
|
||
|
Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
|
||
|
put in place to handle this shares the same name. Misfit task migration
|
||
|
leverages the CFS load balancer, more specifically the active load balance part
|
||
|
(which caters to migrating currently running tasks). When load balance happens,
|
||
|
a misfit active load balance will be triggered if a misfit task can be migrated
|
||
|
to a CPU with more capacity than its current one.
|
||
|
|
||
|
5.2 RT
|
||
|
------
|
||
|
|
||
|
5.2.1 Wakeup CPU selection
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
RT task wakeup CPU selection searches for a CPU that satisfies::
|
||
|
|
||
|
task_uclamp_min(p) <= capacity(task_cpu(cpu))
|
||
|
|
||
|
while still following the usual priority constraints. If none of the candidate
|
||
|
CPUs can satisfy this capacity criterion, then strict priority based scheduling
|
||
|
is followed and CPU capacities are ignored.
|
||
|
|
||
|
5.3 DL
|
||
|
------
|
||
|
|
||
|
5.3.1 Wakeup CPU selection
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
DL task wakeup CPU selection searches for a CPU that satisfies::
|
||
|
|
||
|
task_bandwidth(p) < capacity(task_cpu(p))
|
||
|
|
||
|
while still respecting the usual bandwidth and deadline constraints. If
|
||
|
none of the candidate CPUs can satisfy this capacity criterion, then the
|
||
|
task will remain on its current CPU.
|