sched: better rt-group documentation
Viktor was nice enough to enhance the document based on my replies to his questions on the subject. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit is contained in:
parent
c24b7c5244
commit
b9b158fe1c
|
@ -1,59 +1,177 @@
|
|||
Real-Time group scheduling
|
||||
--------------------------
|
||||
|
||||
CONTENTS
|
||||
========
|
||||
|
||||
1. Overview
|
||||
1.1 The problem
|
||||
1.2 The solution
|
||||
2. The interface
|
||||
2.1 System-wide settings
|
||||
2.2 Default behaviour
|
||||
2.3 Basis for grouping tasks
|
||||
3. Future plans
|
||||
|
||||
|
||||
Real-Time group scheduling.
|
||||
|
||||
The problem space:
|
||||
|
||||
In order to schedule multiple groups of realtime tasks each group must
|
||||
be assigned a fixed portion of the CPU time available. Without a minimum
|
||||
guarantee a realtime group can obviously fall short. A fuzzy upper limit
|
||||
is of no use since it cannot be relied upon. Which leaves us with just
|
||||
the single fixed portion.
|
||||
|
||||
CPU time is divided by means of specifying how much time can be spent
|
||||
running in a given period. Say a frame fixed realtime renderer must
|
||||
deliver 25 frames a second, which yields a period of 0.04s. Now say
|
||||
it will also have to play some music and respond to input, leaving it
|
||||
with around 80% for the graphics. We can then give this group a runtime
|
||||
of 0.8 * 0.04s = 0.032s.
|
||||
|
||||
This way the graphics group will have a 0.04s period with a 0.032s runtime
|
||||
limit.
|
||||
|
||||
Now if the audio thread needs to refill the DMA buffer every 0.005s, but
|
||||
needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s
|
||||
= 0.00015s.
|
||||
1. Overview
|
||||
===========
|
||||
|
||||
|
||||
The Interface:
|
||||
1.1 The problem
|
||||
---------------
|
||||
|
||||
system wide:
|
||||
Realtime scheduling is all about determinism, a group has to be able to rely on
|
||||
the amount of bandwidth (eg. CPU time) being constant. In order to schedule
|
||||
multiple groups of realtime tasks, each group must be assigned a fixed portion
|
||||
of the CPU time available. Without a minimum guarantee a realtime group can
|
||||
obviously fall short. A fuzzy upper limit is of no use since it cannot be
|
||||
relied upon. Which leaves us with just the single fixed portion.
|
||||
|
||||
/proc/sys/kernel/sched_rt_period_ms
|
||||
/proc/sys/kernel/sched_rt_runtime_us
|
||||
1.2 The solution
|
||||
----------------
|
||||
|
||||
CONFIG_FAIR_USER_SCHED
|
||||
CPU time is divided by means of specifying how much time can be spent running
|
||||
in a given period. We allocate this "run time" for each realtime group which
|
||||
the other realtime groups will not be permitted to use.
|
||||
|
||||
/sys/kernel/uids/<uid>/cpu_rt_runtime_us
|
||||
Any time not allocated to a realtime group will be used to run normal priority
|
||||
tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
|
||||
SCHED_OTHER.
|
||||
|
||||
or
|
||||
Let's consider an example: a frame fixed realtime renderer must deliver 25
|
||||
frames a second, which yields a period of 0.04s per frame. Now say it will also
|
||||
have to play some music and respond to input, leaving it with around 80% CPU
|
||||
time dedicated for the graphics. We can then give this group a run time of 0.8
|
||||
* 0.04s = 0.032s.
|
||||
|
||||
CONFIG_FAIR_CGROUP_SCHED
|
||||
This way the graphics group will have a 0.04s period with a 0.032s run time
|
||||
limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
|
||||
needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
|
||||
0.00015s. So this group can be scheduled with a period of 0.005s and a run time
|
||||
of 0.00015s.
|
||||
|
||||
/cgroup/<cgroup>/cpu.rt_runtime_us
|
||||
The remaining CPU time will be used for user input and other tass. Because
|
||||
realtime tasks have explicitly allocated the CPU time they need to perform
|
||||
their tasks, buffer underruns in the graphocs or audio can be eliminated.
|
||||
|
||||
[ time is specified in us because the interface is s32; this gives an
|
||||
operating range of ~35m to 1us ]
|
||||
NOTE: the above example is not fully implemented as of yet (2.6.25). We still
|
||||
lack an EDF scheduler to make non-uniform periods usable.
|
||||
|
||||
The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ].
|
||||
|
||||
A runtime of -1 specifies runtime == period, ie. no limit.
|
||||
2. The Interface
|
||||
================
|
||||
|
||||
New groups get the period from /proc/sys/kernel/sched_rt_period_us and
|
||||
a runtime of 0.
|
||||
|
||||
Settings are constrained to:
|
||||
2.1 System wide settings
|
||||
------------------------
|
||||
|
||||
The system wide settings are configured under the /proc virtual file system:
|
||||
|
||||
/proc/sys/kernel/sched_rt_period_us:
|
||||
The scheduling period that is equivalent to 100% CPU bandwidth
|
||||
|
||||
/proc/sys/kernel/sched_rt_runtime_us:
|
||||
A global limit on how much time realtime scheduling may use. Even without
|
||||
CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
|
||||
processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
|
||||
available to all realtime groups.
|
||||
|
||||
* Time is specified in us because the interface is s32. This gives an
|
||||
operating range from 1us to about 35 minutes.
|
||||
* sched_rt_period_us takes values from 1 to INT_MAX.
|
||||
* sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
|
||||
* A run time of -1 specifies runtime == period, ie. no limit.
|
||||
|
||||
|
||||
2.2 Default behaviour
|
||||
---------------------
|
||||
|
||||
The default values for sched_rt_period_us (1000000 or 1s) and
|
||||
sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
|
||||
SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
|
||||
realtime tasks will not lock up the machine but leave a little time to recover
|
||||
it. By setting runtime to -1 you'd get the old behaviour back.
|
||||
|
||||
By default all bandwidth is assigned to the root group and new groups get the
|
||||
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
|
||||
want to assign bandwidth to another group, reduce the root group's bandwidth
|
||||
and assign some or all of the difference to another group.
|
||||
|
||||
Realtime group scheduling means you have to assign a portion of total CPU
|
||||
bandwidth to the group before it will accept realtime tasks. Therefore you will
|
||||
not be able to run realtime tasks as any user other than root until you have
|
||||
done that, even if the user has the rights to run processes with realtime
|
||||
priority!
|
||||
|
||||
|
||||
2.3 Basis for grouping tasks
|
||||
----------------------------
|
||||
|
||||
There are two compile-time settings for allocating CPU bandwidth. These are
|
||||
configured using the "Basis for grouping tasks" multiple choice menu under
|
||||
General setup > Group CPU Scheduler:
|
||||
|
||||
a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id")
|
||||
|
||||
This lets you use the virtual files under
|
||||
"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
|
||||
each user .
|
||||
|
||||
The other option is:
|
||||
|
||||
.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
|
||||
|
||||
This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us"
|
||||
to control the CPU time reserved for each control group instead.
|
||||
|
||||
For more information on working with control groups, you should read
|
||||
Documentation/cgroups.txt as well.
|
||||
|
||||
Group settings are checked against the following limits in order to keep the configuration
|
||||
schedulable:
|
||||
|
||||
\Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
|
||||
|
||||
in order to keep the configuration schedulable.
|
||||
For now, this can be simplified to just the following (but see Future plans):
|
||||
|
||||
\Sum_{i} runtime_{i} <= global_runtime
|
||||
|
||||
|
||||
3. Future plans
|
||||
===============
|
||||
|
||||
There is work in progress to make the scheduling period for each group
|
||||
("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
|
||||
"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.
|
||||
|
||||
The constraint on the period is that a subgroup must have a smaller or
|
||||
equal period to its parent. But realistically its not very useful _yet_
|
||||
as its prone to starvation without deadline scheduling.
|
||||
|
||||
Consider two sibling groups A and B; both have 50% bandwidth, but A's
|
||||
period is twice the length of B's.
|
||||
|
||||
* group A: period=100000us, runtime=10000us
|
||||
- this runs for 0.01s once every 0.1s
|
||||
|
||||
* group B: period= 50000us, runtime=10000us
|
||||
- this runs for 0.01s twice every 0.1s (or once every 0.05 sec).
|
||||
|
||||
This means that currently a while (1) loop in A will run for the full period of
|
||||
B and can starve B's tasks (assuming they are of lower priority) for a whole
|
||||
period.
|
||||
|
||||
The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
|
||||
full deadline scheduling to the linux kernel. Deadline scheduling the above
|
||||
groups and treating end of the period as a deadline will ensure that they both
|
||||
get their allocated time.
|
||||
|
||||
Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
|
||||
the biggest challenge as the current linux PI infrastructure is geared towards
|
||||
the limited static priority levels 0-139. With deadline scheduling you need to
|
||||
do deadline inheritance (since priority is inversely proportional to the
|
||||
deadline delta (deadline - now).
|
||||
|
||||
This means the whole PI machinery will have to be reworked - and that is one of
|
||||
the most complex pieces of code we have.
|
||||
|
|
|
@ -328,6 +328,13 @@ config RT_GROUP_SCHED
|
|||
depends on EXPERIMENTAL
|
||||
depends on GROUP_SCHED
|
||||
default n
|
||||
help
|
||||
This feature lets you explicitly allocate real CPU bandwidth
|
||||
to users or control groups (depending on the "Basis for grouping tasks"
|
||||
setting below. If enabled, it will also make it impossible to
|
||||
schedule realtime tasks for non-root users until you allocate
|
||||
realtime bandwidth for them.
|
||||
See Documentation/sched-rt-group.txt for more information.
|
||||
|
||||
choice
|
||||
depends on GROUP_SCHED
|
||||
|
|
Loading…
Reference in New Issue