84 lines
4.3 KiB
ReStructuredText
84 lines
4.3 KiB
ReStructuredText
|
=================
|
||
|
Scheduler Domains
|
||
|
=================
|
||
|
|
||
|
Each CPU has a "base" scheduling domain (struct sched_domain). The domain
|
||
|
hierarchy is built from these base domains via the ->parent pointer. ->parent
|
||
|
MUST be NULL terminated, and domain structures should be per-CPU as they are
|
||
|
locklessly updated.
|
||
|
|
||
|
Each scheduling domain spans a number of CPUs (stored in the ->span field).
|
||
|
A domain's span MUST be a superset of it child's span (this restriction could
|
||
|
be relaxed if the need arises), and a base domain for CPU i MUST span at least
|
||
|
i. The top domain for each CPU will generally span all CPUs in the system
|
||
|
although strictly it doesn't have to, but this could lead to a case where some
|
||
|
CPUs will never be given tasks to run unless the CPUs allowed mask is
|
||
|
explicitly set. A sched domain's span means "balance process load among these
|
||
|
CPUs".
|
||
|
|
||
|
Each scheduling domain must have one or more CPU groups (struct sched_group)
|
||
|
which are organised as a circular one way linked list from the ->groups
|
||
|
pointer. The union of cpumasks of these groups MUST be the same as the
|
||
|
domain's span. The intersection of cpumasks from any two of these groups
|
||
|
MUST be the empty set. The group pointed to by the ->groups pointer MUST
|
||
|
contain the CPU to which the domain belongs. Groups may be shared among
|
||
|
CPUs as they contain read only data after they have been set up.
|
||
|
|
||
|
Balancing within a sched domain occurs between groups. That is, each group
|
||
|
is treated as one entity. The load of a group is defined as the sum of the
|
||
|
load of each of its member CPUs, and only when the load of a group becomes
|
||
|
out of balance are tasks moved between groups.
|
||
|
|
||
|
In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU
|
||
|
through scheduler_tick(). It raises a softirq after the next regularly scheduled
|
||
|
rebalancing event for the current runqueue has arrived. The actual load
|
||
|
balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
|
||
|
in softirq context (SCHED_SOFTIRQ).
|
||
|
|
||
|
The latter function takes two arguments: the current CPU and whether it was idle
|
||
|
at the time the scheduler_tick() happened and iterates over all sched domains
|
||
|
our CPU is on, starting from its base domain and going up the ->parent chain.
|
||
|
While doing that, it checks to see if the current domain has exhausted its
|
||
|
rebalance interval. If so, it runs load_balance() on that domain. It then checks
|
||
|
the parent sched_domain (if it exists), and the parent of the parent and so
|
||
|
forth.
|
||
|
|
||
|
Initially, load_balance() finds the busiest group in the current sched domain.
|
||
|
If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
|
||
|
that group. If it manages to find such a runqueue, it locks both our initial
|
||
|
CPU's runqueue and the newly found busiest one and starts moving tasks from it
|
||
|
to our runqueue. The exact number of tasks amounts to an imbalance previously
|
||
|
computed while iterating over this sched domain's groups.
|
||
|
|
||
|
Implementing sched domains
|
||
|
==========================
|
||
|
|
||
|
The "base" domain will "span" the first level of the hierarchy. In the case
|
||
|
of SMT, you'll span all siblings of the physical CPU, with each group being
|
||
|
a single virtual CPU.
|
||
|
|
||
|
In SMP, the parent of the base domain will span all physical CPUs in the
|
||
|
node. Each group being a single physical CPU. Then with NUMA, the parent
|
||
|
of the SMP domain will span the entire machine, with each group having the
|
||
|
cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
|
||
|
might have just one domain covering its one NUMA level.
|
||
|
|
||
|
The implementor should read comments in include/linux/sched.h:
|
||
|
struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
|
||
|
the specifics and what to tune.
|
||
|
|
||
|
Architectures may retain the regular override the default SD_*_INIT flags
|
||
|
while using the generic domain builder in kernel/sched/core.c if they wish to
|
||
|
retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
|
||
|
can be done by #define'ing ARCH_HASH_SCHED_TUNE.
|
||
|
|
||
|
Alternatively, the architecture may completely override the generic domain
|
||
|
builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
|
||
|
arch_init_sched_domains function. This function will attach domains to all
|
||
|
CPUs using cpu_attach_domain.
|
||
|
|
||
|
The sched-domains debugging infrastructure can be enabled by enabling
|
||
|
CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains
|
||
|
which should catch most possible errors (described above). It also prints out
|
||
|
the domain structure in a visual format.
|