sched: better rt-group documentation

Viktor was nice enough to enhance the document based on my replies to his questions on the subject. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:01 +02:00 · 2008-04-19 19:45:01 +02:00 · b9b158fe1c
parent c24b7c5244
commit b9b158fe1c
2 changed files with 165 additions and 40 deletions
--- a/Documentation/scheduler/sched-rt-group.txt
+++ b/Documentation/scheduler/sched-rt-group.txt
@ -1,59 +1,177 @@
+				Real-Time group scheduling
+				--------------------------
+
+CONTENTS
+========
+
+1. Overview
+  1.1 The problem
+  1.2 The solution
+2. The interface
+  2.1 System-wide settings
+  2.2 Default behaviour
+  2.3 Basis for grouping tasks
+3. Future plans


-Real-Time group scheduling.
+1. Overview
+===========

-The problem space:

-In order to schedule multiple groups of realtime tasks each group must
-be assigned a fixed portion of the CPU time available. Without a minimum
-guarantee a realtime group can obviously fall short. A fuzzy upper limit
-is of no use since it cannot be relied upon. Which leaves us with just
-the single fixed portion.
+1.1 The problem
+---------------

-CPU time is divided by means of specifying how much time can be spent
-running in a given period. Say a frame fixed realtime renderer must
-deliver 25 frames a second, which yields a period of 0.04s. Now say
-it will also have to play some music and respond to input, leaving it
-with around 80% for the graphics. We can then give this group a runtime
-of 0.8 * 0.04s = 0.032s.
+Realtime scheduling is all about determinism, a group has to be able to rely on
+the amount of bandwidth (eg. CPU time) being constant. In order to schedule
+multiple groups of realtime tasks, each group must be assigned a fixed portion
+of the CPU time available.  Without a minimum guarantee a realtime group can
+obviously fall short. A fuzzy upper limit is of no use since it cannot be
+relied upon. Which leaves us with just the single fixed portion.
+
+1.2 The solution
+----------------
+
+CPU time is divided by means of specifying how much time can be spent running
+in a given period. We allocate this "run time" for each realtime group which
+the other realtime groups will not be permitted to use.
+
+Any time not allocated to a realtime group will be used to run normal priority
+tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
+SCHED_OTHER.
+
+Let's consider an example: a frame fixed realtime renderer must deliver 25
+frames a second, which yields a period of 0.04s per frame. Now say it will also
+have to play some music and respond to input, leaving it with around 80% CPU
+time dedicated for the graphics. We can then give this group a run time of 0.8
+* 0.04s = 0.032s.

 This way the graphics group will have a 0.04s period with a 0.032s run time
-limit.
+limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
+needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
+0.00015s. So this group can be scheduled with a period of 0.005s and a run time
+of 0.00015s.

-Now if the audio thread needs to refill the DMA buffer every 0.005s, but
-needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s
-= 0.00015s.
+The remaining CPU time will be used for user input and other tass. Because
+realtime tasks have explicitly allocated the CPU time they need to perform
+their tasks, buffer underruns in the graphocs or audio can be eliminated.
+
+NOTE: the above example is not fully implemented as of yet (2.6.25). We still
+lack an EDF scheduler to make non-uniform periods usable.


-The Interface:
+2. The Interface
+================

-system wide:

-/proc/sys/kernel/sched_rt_period_ms
-/proc/sys/kernel/sched_rt_runtime_us
+2.1 System wide settings
+------------------------

-CONFIG_FAIR_USER_SCHED
+The system wide settings are configured under the /proc virtual file system:

-/sys/kernel/uids/<uid>/cpu_rt_runtime_us
+/proc/sys/kernel/sched_rt_period_us:
+  The scheduling period that is equivalent to 100% CPU bandwidth

-or
+/proc/sys/kernel/sched_rt_runtime_us:
+  A global limit on how much time realtime scheduling may use.  Even without
+  CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
+  processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
+  available to all realtime groups.

-CONFIG_FAIR_CGROUP_SCHED
+  * Time is specified in us because the interface is s32. This gives an
+    operating range from 1us to about 35 minutes.
+  * sched_rt_period_us takes values from 1 to INT_MAX.
+  * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
+  * A run time of -1 specifies runtime == period, ie. no limit.

-/cgroup/<cgroup>/cpu.rt_runtime_us

-[ time is specified in us because the interface is s32; this gives an
-  operating range of ~35m to 1us ]
+2.2 Default behaviour
+---------------------

-The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ].
+The default values for sched_rt_period_us (1000000 or 1s) and
+sched_rt_runtime_us (950000 or 0.95s).  This gives 0.05s to be used by
+SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
+realtime tasks will not lock up the machine but leave a little time to recover
+it.  By setting runtime to -1 you'd get the old behaviour back.

-A runtime of -1 specifies runtime == period, ie. no limit.
+By default all bandwidth is assigned to the root group and new groups get the
+period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
+want to assign bandwidth to another group, reduce the root group's bandwidth
+and assign some or all of the difference to another group.

-New groups get the period from /proc/sys/kernel/sched_rt_period_us and
-a runtime of 0.
+Realtime group scheduling means you have to assign a portion of total CPU
+bandwidth to the group before it will accept realtime tasks. Therefore you will
+not be able to run realtime tasks as any user other than root until you have
+done that, even if the user has the rights to run processes with realtime
+priority!

-Settings are constrained to:
+
+2.3 Basis for grouping tasks
+----------------------------
+
+There are two compile-time settings for allocating CPU bandwidth. These are
+configured using the "Basis for grouping tasks" multiple choice menu under
+General setup > Group CPU Scheduler:
+
+a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" =  "user id")
+
+This lets you use the virtual files under
+"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
+each user .
+
+The other option is:
+
+.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
+
+This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us"
+to control the CPU time reserved for each control group instead.
+
+For more information on working with control groups, you should read
+Documentation/cgroups.txt as well.
+
+Group settings are checked against the following limits in order to keep the configuration
+schedulable:

   \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period

-in order to keep the configuration schedulable.
+For now, this can be simplified to just the following (but see Future plans):
+
+   \Sum_{i} runtime_{i} <= global_runtime
+
+
+3. Future plans
+===============
+
+There is work in progress to make the scheduling period for each group
+("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
+"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.
+
+The constraint on the period is that a subgroup must have a smaller or
+equal period to its parent. But realistically its not very useful _yet_
+as its prone to starvation without deadline scheduling.
+
+Consider two sibling groups A and B; both have 50% bandwidth, but A's
+period is twice the length of B's.
+
+* group A: period=100000us, runtime=10000us
+	- this runs for 0.01s once every 0.1s
+
+* group B: period= 50000us, runtime=10000us
+	- this runs for 0.01s twice every 0.1s (or once every 0.05 sec).
+
+This means that currently a while (1) loop in A will run for the full period of
+B and can starve B's tasks (assuming they are of lower priority) for a whole
+period.
+
+The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
+full deadline scheduling to the linux kernel. Deadline scheduling the above
+groups and treating end of the period as a deadline will ensure that they both
+get their allocated time.
+
+Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
+the biggest challenge as the current linux PI infrastructure is geared towards
+the limited static priority levels 0-139. With deadline scheduling you need to
+do deadline inheritance (since priority is inversely proportional to the
+deadline delta (deadline - now).
+
+This means the whole PI machinery will have to be reworked - and that is one of
+the most complex pieces of code we have.
--- a/init/Kconfig
+++ b/init/Kconfig
@ -328,6 +328,13 @@ config RT_GROUP_SCHED
 	depends on EXPERIMENTAL
 	depends on GROUP_SCHED
 	default n
+	help
+	  This feature lets you explicitly allocate real CPU bandwidth
+	  to users or control groups (depending on the "Basis for grouping tasks"
+	  setting below. If enabled, it will also make it impossible to
+	  schedule realtime tasks for non-root users until you allocate
+	  realtime bandwidth for them.
+	  See Documentation/sched-rt-group.txt for more information.

 choice
 	depends on GROUP_SCHED