2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* kernel/cpuset.c
|
|
|
|
*
|
|
|
|
* Processor and Memory placement constraints for sets of tasks.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2003 BULL SA.
|
2007-10-19 14:40:20 +08:00
|
|
|
* Copyright (C) 2004-2007 Silicon Graphics, Inc.
|
2007-10-19 14:39:39 +08:00
|
|
|
* Copyright (C) 2006 Google, Inc
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* Portions derived from Patrick Mochel's sysfs code.
|
|
|
|
* sysfs is Copyright (c) 2001-3 Patrick Mochel
|
|
|
|
*
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
* 2003-10-10 Written by Simon Derr.
|
2005-04-17 06:20:36 +08:00
|
|
|
* 2003-10-22 Updates by Stephen Hemminger.
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
* 2004 May-July Rework by Paul Jackson.
|
2007-10-19 14:39:39 +08:00
|
|
|
* 2006 Rework by Paul Menage to use generic cgroups
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
* 2008 Rework of the scheduler domains and CPU hotplug handling
|
|
|
|
* by Max Krasnyansky
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* This file is subject to the terms and conditions of the GNU General Public
|
|
|
|
* License. See the file COPYING in the main directory of the Linux
|
|
|
|
* distribution for more details.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cpumask.h>
|
|
|
|
#include <linux/cpuset.h>
|
|
|
|
#include <linux/err.h>
|
|
|
|
#include <linux/errno.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/kmod.h>
|
|
|
|
#include <linux/list.h>
|
[PATCH] cpusets: automatic numa mempolicy rebinding
This patch automatically updates a tasks NUMA mempolicy when its cpuset
memory placement changes. It does so within the context of the task,
without any need to support low level external mempolicy manipulation.
If a system is not using cpusets, or if running on a system with just the
root (all-encompassing) cpuset, then this remap is a no-op. Only when a
task is moved between cpusets, or a cpusets memory placement is changed
does the following apply. Otherwise, the main routine below,
rebind_policy() is not even called.
When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
essential role of cpusets is to place jobs (several related tasks) on a set
of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
manage a jobs processor placement within its allowed cpuset, and the
essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
memory placement within its allowed cpuset.
However, CPU affinity and NUMA memory placement are managed within the
kernel using absolute system wide numbering, not cpuset relative numbering.
This is ok until a job is migrated to a different cpuset, or what's the
same, a jobs cpuset is moved to different CPUs and Memory Nodes.
Then the CPU affinity and NUMA memory placement of the tasks in the job
need to be updated, to preserve their cpuset-relative position. This can
be done for CPU affinity using sched_setaffinity() from user code, as one
task can modify anothers CPU affinity. This cannot be done from an
external task for NUMA memory placement, as that can only be modified in
the context of the task using it.
However, it easy enough to remap a tasks NUMA mempolicy automatically when
a task is migrated, using the existing cpuset mechanism to trigger a
refresh of a tasks memory placement after its cpuset has changed. All that
is needed is the old and new nodemask, and notice to the task that it needs
to rebind its mempolicy. The tasks mems_allowed has the old mask, the
tasks cpuset has the new mask, and the existing
cpuset_update_current_mems_allowed() mechanism provides the notice. The
bitmap/cpumask/nodemask remap operators provide the cpuset relative
calculations.
This patch leaves open a couple of issues:
1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:
These mempolicies may reference nodes outside of those allowed to
the current task by its cpuset. Tasks are migrated as part of jobs,
which reside on what might be several cpusets in a subtree. When such
a job is migrated, all NUMA memory policy references to nodes within
that cpuset subtree should be translated, and references to any nodes
outside that subtree should be left untouched. A future patch will
provide the cpuset mechanism needed to mark such subtrees. With that
patch, we will be able to correctly migrate these other memory policies
across a job migration.
2) Updating cpuset, affinity and memory policies in user space:
This is harder. Any placement state stored in user space using
system-wide numbering will be invalidated across a migration. More
work will be required to provide user code with a migration-safe means
to manage its cpuset relative placement, while preserving the current
API's that pass system wide numbers, not cpuset relative numbers across
the kernel-user boundary.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:36 +08:00
|
|
|
#include <linux/mempolicy.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/mm.h>
|
2008-11-20 07:36:30 +08:00
|
|
|
#include <linux/memory.h>
|
2011-05-24 02:51:41 +08:00
|
|
|
#include <linux/export.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/namei.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/proc_fs.h>
|
[PATCH] cpuset: use rcu directly optimization
Optimize the cpuset impact on page allocation, the most performance critical
cpuset hook in the kernel.
On each page allocation, the cpuset hook needs to check for a possible change
in the current tasks cpuset. It can now handle the common case, of no change,
without taking any spinlock or semaphore, thanks to RCU.
Convert a spinlock on the current task to an rcu_read_lock(), saving
approximately a memory barrier and an atomic op, depending on architecture.
This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the
write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay
freeing up a detached cpuset until after any critical sections referencing
that pointer.
Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:02:02 +08:00
|
|
|
#include <linux/rcupdate.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/sched.h>
|
|
|
|
#include <linux/seq_file.h>
|
2006-06-23 17:04:00 +08:00
|
|
|
#include <linux/security.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/stat.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/time.h>
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/sort.h>
|
|
|
|
|
|
|
|
#include <asm/uaccess.h>
|
2011-07-27 07:09:06 +08:00
|
|
|
#include <linux/atomic.h>
|
2006-03-23 19:00:18 +08:00
|
|
|
#include <linux/mutex.h>
|
2008-02-07 16:14:43 +08:00
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/cgroup.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-01-08 17:01:57 +08:00
|
|
|
/*
|
|
|
|
* Tracks how many cpusets are currently defined in system.
|
|
|
|
* When there is only one cpuset (the root cpuset) we can
|
|
|
|
* short circuit some hooks.
|
|
|
|
*/
|
2006-01-08 17:02:03 +08:00
|
|
|
int number_of_cpusets __read_mostly;
|
2006-01-08 17:01:57 +08:00
|
|
|
|
2008-02-07 16:14:45 +08:00
|
|
|
/* Forward declare cgroup structures */
|
2007-10-19 14:39:39 +08:00
|
|
|
struct cgroup_subsys cpuset_subsys;
|
|
|
|
struct cpuset;
|
|
|
|
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
/* See "Frequency meter" comments, below. */
|
|
|
|
|
|
|
|
struct fmeter {
|
|
|
|
int cnt; /* unprocessed events count */
|
|
|
|
int val; /* most recent output value */
|
|
|
|
time_t time; /* clock (secs) when val computed */
|
|
|
|
spinlock_t lock; /* guards read or write of above */
|
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
struct cpuset {
|
2007-10-19 14:39:39 +08:00
|
|
|
struct cgroup_subsys_state css;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned long flags; /* "unsigned long" so bitops work */
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
|
2005-04-17 06:20:36 +08:00
|
|
|
nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
|
|
|
|
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
struct fmeter fmeter; /* memory_pressure filter */
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
|
|
|
* Tasks are being attached to this cpuset. Used to prevent
|
|
|
|
* zeroing cpus/mems_allowed between ->can_attach() and ->attach().
|
|
|
|
*/
|
|
|
|
int attach_in_progress;
|
|
|
|
|
2007-10-19 14:40:20 +08:00
|
|
|
/* partition number for rebuild_sched_domains() */
|
|
|
|
int pn;
|
2008-02-07 16:14:43 +08:00
|
|
|
|
2008-04-15 13:04:23 +08:00
|
|
|
/* for custom sched domain */
|
|
|
|
int relax_domain_level;
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
struct work_struct hotplug_work;
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2007-10-19 14:39:39 +08:00
|
|
|
/* Retrieve the cpuset for a cgroup */
|
|
|
|
static inline struct cpuset *cgroup_cs(struct cgroup *cont)
|
|
|
|
{
|
|
|
|
return container_of(cgroup_subsys_state(cont, cpuset_subsys_id),
|
|
|
|
struct cpuset, css);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Retrieve the cpuset for a task */
|
|
|
|
static inline struct cpuset *task_cs(struct task_struct *task)
|
|
|
|
{
|
|
|
|
return container_of(task_subsys_state(task, cpuset_subsys_id),
|
|
|
|
struct cpuset, css);
|
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
static inline struct cpuset *parent_cs(const struct cpuset *cs)
|
|
|
|
{
|
|
|
|
struct cgroup *pcgrp = cs->css.cgroup->parent;
|
|
|
|
|
|
|
|
if (pcgrp)
|
|
|
|
return cgroup_cs(pcgrp);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2011-12-20 09:11:52 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
static inline bool task_has_mempolicy(struct task_struct *task)
|
|
|
|
{
|
|
|
|
return task->mempolicy;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline bool task_has_mempolicy(struct task_struct *task)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* bits in struct cpuset flags field */
|
|
|
|
typedef enum {
|
2013-01-08 00:51:07 +08:00
|
|
|
CS_ONLINE,
|
2005-04-17 06:20:36 +08:00
|
|
|
CS_CPU_EXCLUSIVE,
|
|
|
|
CS_MEM_EXCLUSIVE,
|
2008-04-29 16:00:26 +08:00
|
|
|
CS_MEM_HARDWALL,
|
[PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:00:56 +08:00
|
|
|
CS_MEMORY_MIGRATE,
|
2007-10-19 14:40:20 +08:00
|
|
|
CS_SCHED_LOAD_BALANCE,
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
CS_SPREAD_PAGE,
|
|
|
|
CS_SPREAD_SLAB,
|
2005-04-17 06:20:36 +08:00
|
|
|
} cpuset_flagbits_t;
|
|
|
|
|
|
|
|
/* convenient tests for these bits */
|
2013-01-08 00:51:07 +08:00
|
|
|
static inline bool is_cpuset_online(const struct cpuset *cs)
|
|
|
|
{
|
|
|
|
return test_bit(CS_ONLINE, &cs->flags);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline int is_cpu_exclusive(const struct cpuset *cs)
|
|
|
|
{
|
2006-03-24 19:16:00 +08:00
|
|
|
return test_bit(CS_CPU_EXCLUSIVE, &cs->flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int is_mem_exclusive(const struct cpuset *cs)
|
|
|
|
{
|
2006-03-24 19:16:00 +08:00
|
|
|
return test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-04-29 16:00:26 +08:00
|
|
|
static inline int is_mem_hardwall(const struct cpuset *cs)
|
|
|
|
{
|
|
|
|
return test_bit(CS_MEM_HARDWALL, &cs->flags);
|
|
|
|
}
|
|
|
|
|
2007-10-19 14:40:20 +08:00
|
|
|
static inline int is_sched_load_balance(const struct cpuset *cs)
|
|
|
|
{
|
|
|
|
return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
|
|
|
|
}
|
|
|
|
|
[PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:00:56 +08:00
|
|
|
static inline int is_memory_migrate(const struct cpuset *cs)
|
|
|
|
{
|
2006-03-24 19:16:00 +08:00
|
|
|
return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
|
[PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:00:56 +08:00
|
|
|
}
|
|
|
|
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
static inline int is_spread_page(const struct cpuset *cs)
|
|
|
|
{
|
|
|
|
return test_bit(CS_SPREAD_PAGE, &cs->flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int is_spread_slab(const struct cpuset *cs)
|
|
|
|
{
|
|
|
|
return test_bit(CS_SPREAD_SLAB, &cs->flags);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static struct cpuset top_cpuset = {
|
2013-01-08 00:51:07 +08:00
|
|
|
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
|
|
|
|
(1 << CS_MEM_EXCLUSIVE)),
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/**
|
|
|
|
* cpuset_for_each_child - traverse online children of a cpuset
|
|
|
|
* @child_cs: loop cursor pointing to the current child
|
|
|
|
* @pos_cgrp: used for iteration
|
|
|
|
* @parent_cs: target cpuset to walk children of
|
|
|
|
*
|
|
|
|
* Walk @child_cs through the online children of @parent_cs. Must be used
|
|
|
|
* with RCU read locked.
|
|
|
|
*/
|
|
|
|
#define cpuset_for_each_child(child_cs, pos_cgrp, parent_cs) \
|
|
|
|
cgroup_for_each_child((pos_cgrp), (parent_cs)->css.cgroup) \
|
|
|
|
if (is_cpuset_online(((child_cs) = cgroup_cs((pos_cgrp)))))
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
/**
|
|
|
|
* cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants
|
|
|
|
* @des_cs: loop cursor pointing to the current descendant
|
|
|
|
* @pos_cgrp: used for iteration
|
|
|
|
* @root_cs: target cpuset to walk ancestor of
|
|
|
|
*
|
|
|
|
* Walk @des_cs through the online descendants of @root_cs. Must be used
|
|
|
|
* with RCU read locked. The caller may modify @pos_cgrp by calling
|
|
|
|
* cgroup_rightmost_descendant() to skip subtree.
|
|
|
|
*/
|
|
|
|
#define cpuset_for_each_descendant_pre(des_cs, pos_cgrp, root_cs) \
|
|
|
|
cgroup_for_each_descendant_pre((pos_cgrp), (root_cs)->css.cgroup) \
|
|
|
|
if (is_cpuset_online(((des_cs) = cgroup_cs((pos_cgrp)))))
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2013-01-08 00:51:08 +08:00
|
|
|
* There are two global mutexes guarding cpuset structures - cpuset_mutex
|
|
|
|
* and callback_mutex. The latter may nest inside the former. We also
|
|
|
|
* require taking task_lock() when dereferencing a task's cpuset pointer.
|
|
|
|
* See "The task_lock() exception", at the end of this comment.
|
|
|
|
*
|
|
|
|
* A task must hold both mutexes to modify cpusets. If a task holds
|
|
|
|
* cpuset_mutex, then it blocks others wanting that mutex, ensuring that it
|
|
|
|
* is the only task able to also acquire callback_mutex and be able to
|
|
|
|
* modify cpusets. It can perform various checks on the cpuset structure
|
|
|
|
* first, knowing nothing will change. It can also allocate memory while
|
|
|
|
* just holding cpuset_mutex. While it is performing these checks, various
|
|
|
|
* callback routines can briefly acquire callback_mutex to query cpusets.
|
|
|
|
* Once it is ready to make the changes, it takes callback_mutex, blocking
|
|
|
|
* everyone else.
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
*
|
|
|
|
* Calls to the kernel memory allocator can not be made while holding
|
2006-03-23 19:00:18 +08:00
|
|
|
* callback_mutex, as that would risk double tripping on callback_mutex
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
* from one of the callbacks into the cpuset code from within
|
|
|
|
* __alloc_pages().
|
|
|
|
*
|
2006-03-23 19:00:18 +08:00
|
|
|
* If a task is only holding callback_mutex, then it has read-only
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
* access to cpusets.
|
|
|
|
*
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
* Now, the task_struct fields mems_allowed and mempolicy may be changed
|
|
|
|
* by other task, we use alloc_lock in the task_struct fields to protect
|
|
|
|
* them.
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
*
|
2006-03-23 19:00:18 +08:00
|
|
|
* The cpuset_common_file_read() handlers only hold callback_mutex across
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
* small pieces of code, such as when reading out possibly multi-word
|
|
|
|
* cpumasks and nodemasks.
|
|
|
|
*
|
2008-02-07 16:14:45 +08:00
|
|
|
* Accessing a task's cpuset should be done in accordance with the
|
|
|
|
* guidelines for accessing subsystem state in kernel/cgroup.c
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
static DEFINE_MUTEX(cpuset_mutex);
|
2006-03-23 19:00:18 +08:00
|
|
|
static DEFINE_MUTEX(callback_mutex);
|
[PATCH] cpuset semaphore depth check deadlock fix
The cpusets-formalize-intermediate-gfp_kernel-containment patch
has a deadlock problem.
This patch was part of a set of four patches to make more
extensive use of the cpuset 'mem_exclusive' attribute to
manage kernel GFP_KERNEL memory allocations and to constrain
the out-of-memory (oom) killer.
A task that is changing cpusets in particular ways on a system
when it is very short of free memory could double trip over
the global cpuset_sem semaphore (get the lock and then deadlock
trying to get it again).
The second attempt to get cpuset_sem would be in the routine
cpuset_zone_allowed(). This was discovered by code inspection.
I can not reproduce the problem except with an artifically
hacked kernel and a specialized stress test.
In real life you cannot hit this unless you are manipulating
cpusets, and are very unlikely to hit it unless you are rapidly
modifying cpusets on a memory tight system. Even then it would
be a rare occurence.
If you did hit it, the task double tripping over cpuset_sem
would deadlock in the kernel, and any other task also trying
to manipulate cpusets would deadlock there too, on cpuset_sem.
Your batch manager would be wedged solid (if it was cpuset
savvy), but classic Unix shells and utilities would work well
enough to reboot the system.
The unusual condition that led to this bug is that unlike most
semaphores, cpuset_sem _can_ be acquired while in the page
allocation code, when __alloc_pages() calls cpuset_zone_allowed.
So it easy to mistakenly perform the following sequence:
1) task makes system call to alter a cpuset
2) take cpuset_sem
3) try to allocate memory
4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
5) deadlock
The reason that this is not a serious bug for most users
is that almost all calls to allocate memory don't require
taking cpuset_sem. Only some code paths off the beaten
track require taking cpuset_sem -- which is good. Taking
a global semaphore on the main code path for allocating
memory would not scale well.
This patch fixes this deadlock by wrapping the up() and down()
calls on cpuset_sem in kernel/cpuset.c with code that tracks
the nesting depth of the current task on that semaphore, and
only does the real down() if the task doesn't hold the lock
already, and only does the real up() if the nesting depth
(number of unmatched downs) is exactly one.
The previous required use of refresh_mems(), anytime that
the cpuset_sem semaphore was acquired and the code executed
while holding that semaphore might try to allocate memory, is
no longer required. Two refresh_mems() calls were removed
thanks to this. This is a good change, as failing to get
all the necessary refresh_mems() calls placed was a primary
source of bugs in this cpuset code. The only remaining call
to refresh_mems() is made while doing a memory allocation,
if certain task memory placement data needs to be updated
from its cpuset, due to the cpuset having been changed behind
the tasks back.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 15:26:06 +08:00
|
|
|
|
2009-01-07 06:39:01 +08:00
|
|
|
/*
|
|
|
|
* cpuset_buffer_lock protects both the cpuset_name and cpuset_nodelist
|
|
|
|
* buffers. They are statically allocated to prevent using excess stack
|
|
|
|
* when calling cpuset_print_task_mems_allowed().
|
|
|
|
*/
|
|
|
|
#define CPUSET_NAME_LEN (128)
|
|
|
|
#define CPUSET_NODELIST_LEN (256)
|
|
|
|
static char cpuset_name[CPUSET_NAME_LEN];
|
|
|
|
static char cpuset_nodelist[CPUSET_NODELIST_LEN];
|
|
|
|
static DEFINE_SPINLOCK(cpuset_buffer_lock);
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
|
|
|
* CPU / memory hotplug is handled asynchronously.
|
|
|
|
*/
|
2013-01-08 00:51:07 +08:00
|
|
|
static struct workqueue_struct *cpuset_propagate_hotplug_wq;
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
static void cpuset_hotplug_workfn(struct work_struct *work);
|
2013-01-08 00:51:07 +08:00
|
|
|
static void cpuset_propagate_hotplug_workfn(struct work_struct *work);
|
2013-01-08 00:51:08 +08:00
|
|
|
static void schedule_cpuset_propagate_hotplug(struct cpuset *cs);
|
2013-01-08 00:51:07 +08:00
|
|
|
|
|
|
|
static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn);
|
|
|
|
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
/*
|
|
|
|
* This is ugly, but preserves the userspace API for existing cpuset
|
2007-10-19 14:39:39 +08:00
|
|
|
* users. If someone tries to mount the "cpuset" filesystem, we
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
* silently switch it to mount "cgroup" instead
|
|
|
|
*/
|
2010-07-26 17:23:11 +08:00
|
|
|
static struct dentry *cpuset_mount(struct file_system_type *fs_type,
|
|
|
|
int flags, const char *unused_dev_name, void *data)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-10-19 14:39:39 +08:00
|
|
|
struct file_system_type *cgroup_fs = get_fs_type("cgroup");
|
2010-07-26 17:23:11 +08:00
|
|
|
struct dentry *ret = ERR_PTR(-ENODEV);
|
2007-10-19 14:39:39 +08:00
|
|
|
if (cgroup_fs) {
|
|
|
|
char mountopts[] =
|
|
|
|
"cpuset,noprefix,"
|
|
|
|
"release_agent=/sbin/cpuset_release_agent";
|
2010-07-26 17:23:11 +08:00
|
|
|
ret = cgroup_fs->mount(cgroup_fs, flags,
|
|
|
|
unused_dev_name, mountopts);
|
2007-10-19 14:39:39 +08:00
|
|
|
put_filesystem(cgroup_fs);
|
|
|
|
}
|
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct file_system_type cpuset_fs_type = {
|
|
|
|
.name = "cpuset",
|
2010-07-26 17:23:11 +08:00
|
|
|
.mount = cpuset_mount,
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
2009-01-08 10:08:44 +08:00
|
|
|
* Return in pmask the portion of a cpusets's cpus_allowed that
|
2005-04-17 06:20:36 +08:00
|
|
|
* are online. If none are online, walk up the cpuset hierarchy
|
|
|
|
* until we find one that does have some online cpus. If we get
|
|
|
|
* all the way to the top and still haven't found any online cpus,
|
2012-03-29 13:08:31 +08:00
|
|
|
* return cpu_online_mask. Or if passed a NULL cs from an exit'ing
|
|
|
|
* task, return cpu_online_mask.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* One way or another, we guarantee to return some non-empty subset
|
2012-03-29 13:08:31 +08:00
|
|
|
* of cpu_online_mask.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2006-03-23 19:00:18 +08:00
|
|
|
* Call with callback_mutex held.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
2009-01-08 10:08:45 +08:00
|
|
|
static void guarantee_online_cpus(const struct cpuset *cs,
|
|
|
|
struct cpumask *pmask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2009-01-08 10:08:44 +08:00
|
|
|
while (cs && !cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
|
2013-01-08 00:51:08 +08:00
|
|
|
cs = parent_cs(cs);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (cs)
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
|
2005-04-17 06:20:36 +08:00
|
|
|
else
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_copy(pmask, cpu_online_mask);
|
|
|
|
BUG_ON(!cpumask_intersects(pmask, cpu_online_mask));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return in *pmask the portion of a cpusets's mems_allowed that
|
2007-10-16 16:25:38 +08:00
|
|
|
* are online, with memory. If none are online with memory, walk
|
|
|
|
* up the cpuset hierarchy until we find one that does have some
|
|
|
|
* online mems. If we get all the way to the top and still haven't
|
2012-12-13 05:51:24 +08:00
|
|
|
* found any online mems, return node_states[N_MEMORY].
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* One way or another, we guarantee to return some non-empty subset
|
2012-12-13 05:51:24 +08:00
|
|
|
* of node_states[N_MEMORY].
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2006-03-23 19:00:18 +08:00
|
|
|
* Call with callback_mutex held.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
|
|
|
|
{
|
2007-10-16 16:25:38 +08:00
|
|
|
while (cs && !nodes_intersects(cs->mems_allowed,
|
2012-12-13 05:51:24 +08:00
|
|
|
node_states[N_MEMORY]))
|
2013-01-08 00:51:08 +08:00
|
|
|
cs = parent_cs(cs);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (cs)
|
2007-10-16 16:25:38 +08:00
|
|
|
nodes_and(*pmask, cs->mems_allowed,
|
2012-12-13 05:51:24 +08:00
|
|
|
node_states[N_MEMORY]);
|
2005-04-17 06:20:36 +08:00
|
|
|
else
|
2012-12-13 05:51:24 +08:00
|
|
|
*pmask = node_states[N_MEMORY];
|
|
|
|
BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-06-17 06:31:46 +08:00
|
|
|
/*
|
|
|
|
* update task's spread flag if cpuset's page/slab spread flag is set
|
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Called with callback_mutex/cpuset_mutex held
|
2009-06-17 06:31:46 +08:00
|
|
|
*/
|
|
|
|
static void cpuset_update_task_spread_flag(struct cpuset *cs,
|
|
|
|
struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
if (is_spread_page(cs))
|
|
|
|
tsk->flags |= PF_SPREAD_PAGE;
|
|
|
|
else
|
|
|
|
tsk->flags &= ~PF_SPREAD_PAGE;
|
|
|
|
if (is_spread_slab(cs))
|
|
|
|
tsk->flags |= PF_SPREAD_SLAB;
|
|
|
|
else
|
|
|
|
tsk->flags &= ~PF_SPREAD_SLAB;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* is_cpuset_subset(p, q) - Is cpuset p a subset of cpuset q?
|
|
|
|
*
|
|
|
|
* One cpuset is a subset of another if all its allowed CPUs and
|
|
|
|
* Memory Nodes are a subset of the other, and its exclusive flags
|
2013-01-08 00:51:08 +08:00
|
|
|
* are only set if the other's are set. Call holding cpuset_mutex.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
|
|
|
|
{
|
2009-01-08 10:08:44 +08:00
|
|
|
return cpumask_subset(p->cpus_allowed, q->cpus_allowed) &&
|
2005-04-17 06:20:36 +08:00
|
|
|
nodes_subset(p->mems_allowed, q->mems_allowed) &&
|
|
|
|
is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
|
|
|
|
is_mem_exclusive(p) <= is_mem_exclusive(q);
|
|
|
|
}
|
|
|
|
|
2009-01-08 10:08:43 +08:00
|
|
|
/**
|
|
|
|
* alloc_trial_cpuset - allocate a trial cpuset
|
|
|
|
* @cs: the cpuset that the trial cpuset duplicates
|
|
|
|
*/
|
|
|
|
static struct cpuset *alloc_trial_cpuset(const struct cpuset *cs)
|
|
|
|
{
|
2009-01-08 10:08:44 +08:00
|
|
|
struct cpuset *trial;
|
|
|
|
|
|
|
|
trial = kmemdup(cs, sizeof(*cs), GFP_KERNEL);
|
|
|
|
if (!trial)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL)) {
|
|
|
|
kfree(trial);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
|
|
|
|
|
|
|
|
return trial;
|
2009-01-08 10:08:43 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* free_trial_cpuset - free the trial cpuset
|
|
|
|
* @trial: the trial cpuset to be freed
|
|
|
|
*/
|
|
|
|
static void free_trial_cpuset(struct cpuset *trial)
|
|
|
|
{
|
2009-01-08 10:08:44 +08:00
|
|
|
free_cpumask_var(trial->cpus_allowed);
|
2009-01-08 10:08:43 +08:00
|
|
|
kfree(trial);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* validate_change() - Used to validate that any proposed cpuset change
|
|
|
|
* follows the structural rules for cpusets.
|
|
|
|
*
|
|
|
|
* If we replaced the flag and mask values of the current cpuset
|
|
|
|
* (cur) with those values in the trial cpuset (trial), would
|
|
|
|
* our various subset and exclusive rules still be valid? Presumes
|
2013-01-08 00:51:08 +08:00
|
|
|
* cpuset_mutex held.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* 'cur' is the address of an actual, in-use cpuset. Operations
|
|
|
|
* such as list traversal that depend on the actual address of the
|
|
|
|
* cpuset in the list must use cur below, not trial.
|
|
|
|
*
|
|
|
|
* 'trial' is the address of bulk structure copy of cur, with
|
|
|
|
* perhaps one or more of the fields cpus_allowed, mems_allowed,
|
|
|
|
* or flags changed to new, trial values.
|
|
|
|
*
|
|
|
|
* Return 0 if valid, -errno if not.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
|
|
|
|
{
|
2007-10-19 14:39:39 +08:00
|
|
|
struct cgroup *cont;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct cpuset *c, *par;
|
2013-01-08 00:51:07 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* Each of our child cpusets must be a subset of us */
|
2013-01-08 00:51:07 +08:00
|
|
|
ret = -EBUSY;
|
|
|
|
cpuset_for_each_child(c, cont, cur)
|
|
|
|
if (!is_cpuset_subset(c, trial))
|
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* Remaining checks don't apply to root cpuset */
|
2013-01-08 00:51:07 +08:00
|
|
|
ret = 0;
|
2006-12-07 12:36:15 +08:00
|
|
|
if (cur == &top_cpuset)
|
2013-01-08 00:51:07 +08:00
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
par = parent_cs(cur);
|
2006-12-07 12:36:15 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* We must be a subset of our parent cpuset */
|
2013-01-08 00:51:07 +08:00
|
|
|
ret = -EACCES;
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!is_cpuset_subset(trial, par))
|
2013-01-08 00:51:07 +08:00
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-02-07 16:14:45 +08:00
|
|
|
/*
|
|
|
|
* If either I or some sibling (!= me) is exclusive, we can't
|
|
|
|
* overlap
|
|
|
|
*/
|
2013-01-08 00:51:07 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
cpuset_for_each_child(c, cont, par) {
|
2005-04-17 06:20:36 +08:00
|
|
|
if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
|
|
|
|
c != cur &&
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_intersects(trial->cpus_allowed, c->cpus_allowed))
|
2013-01-08 00:51:07 +08:00
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
|
|
|
|
c != cur &&
|
|
|
|
nodes_intersects(trial->mems_allowed, c->mems_allowed))
|
2013-01-08 00:51:07 +08:00
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
|
|
|
* Cpusets with tasks - existing or newly being attached - can't
|
|
|
|
* have empty cpus_allowed or mems_allowed.
|
|
|
|
*/
|
2013-01-08 00:51:07 +08:00
|
|
|
ret = -ENOSPC;
|
2013-01-08 00:51:07 +08:00
|
|
|
if ((cgroup_task_count(cur->css.cgroup) || cur->attach_in_progress) &&
|
2013-01-08 00:51:07 +08:00
|
|
|
(cpumask_empty(trial->cpus_allowed) ||
|
|
|
|
nodes_empty(trial->mems_allowed)))
|
|
|
|
goto out;
|
2007-10-19 14:40:21 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-04-03 07:57:55 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2007-10-19 14:40:20 +08:00
|
|
|
/*
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
* Helper routine for generate_sched_domains().
|
2007-10-19 14:40:20 +08:00
|
|
|
* Do cpusets a, b have overlapping cpus_allowed masks?
|
|
|
|
*/
|
|
|
|
static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
|
|
|
|
{
|
2009-01-08 10:08:44 +08:00
|
|
|
return cpumask_intersects(a->cpus_allowed, b->cpus_allowed);
|
2007-10-19 14:40:20 +08:00
|
|
|
}
|
|
|
|
|
2008-04-15 13:04:23 +08:00
|
|
|
static void
|
|
|
|
update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c)
|
|
|
|
{
|
|
|
|
if (dattr->relax_domain_level < c->relax_domain_level)
|
|
|
|
dattr->relax_domain_level = c->relax_domain_level;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
static void update_domain_attr_tree(struct sched_domain_attr *dattr,
|
|
|
|
struct cpuset *root_cs)
|
2008-07-30 13:33:22 +08:00
|
|
|
{
|
2013-01-08 00:51:08 +08:00
|
|
|
struct cpuset *cp;
|
|
|
|
struct cgroup *pos_cgrp;
|
2008-07-30 13:33:22 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
cpuset_for_each_descendant_pre(cp, pos_cgrp, root_cs) {
|
|
|
|
/* skip the whole subtree if @cp doesn't have any CPU */
|
|
|
|
if (cpumask_empty(cp->cpus_allowed)) {
|
|
|
|
pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
|
2008-07-30 13:33:22 +08:00
|
|
|
continue;
|
2013-01-08 00:51:08 +08:00
|
|
|
}
|
2008-07-30 13:33:22 +08:00
|
|
|
|
|
|
|
if (is_sched_load_balance(cp))
|
|
|
|
update_domain_attr(dattr, cp);
|
|
|
|
}
|
2013-01-08 00:51:08 +08:00
|
|
|
rcu_read_unlock();
|
2008-07-30 13:33:22 +08:00
|
|
|
}
|
|
|
|
|
2007-10-19 14:40:20 +08:00
|
|
|
/*
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
* generate_sched_domains()
|
|
|
|
*
|
|
|
|
* This function builds a partial partition of the systems CPUs
|
|
|
|
* A 'partial partition' is a set of non-overlapping subsets whose
|
|
|
|
* union is a subset of that set.
|
|
|
|
* The output of this function needs to be passed to kernel/sched.c
|
|
|
|
* partition_sched_domains() routine, which will rebuild the scheduler's
|
|
|
|
* load balancing domains (sched domains) as specified by that partial
|
|
|
|
* partition.
|
2007-10-19 14:40:20 +08:00
|
|
|
*
|
2009-01-16 05:50:59 +08:00
|
|
|
* See "What is sched_load_balance" in Documentation/cgroups/cpusets.txt
|
2007-10-19 14:40:20 +08:00
|
|
|
* for a background explanation of this.
|
|
|
|
*
|
|
|
|
* Does not return errors, on the theory that the callers of this
|
|
|
|
* routine would rather not worry about failures to rebuild sched
|
|
|
|
* domains when operating in the severe memory shortage situations
|
|
|
|
* that could cause allocation failures below.
|
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Must be called with cpuset_mutex held.
|
2007-10-19 14:40:20 +08:00
|
|
|
*
|
|
|
|
* The three key local variables below are:
|
2008-07-30 13:33:24 +08:00
|
|
|
* q - a linked-list queue of cpuset pointers, used to implement a
|
2007-10-19 14:40:20 +08:00
|
|
|
* top-down scan of all cpusets. This scan loads a pointer
|
|
|
|
* to each cpuset marked is_sched_load_balance into the
|
|
|
|
* array 'csa'. For our purposes, rebuilding the schedulers
|
|
|
|
* sched domains, we can ignore !is_sched_load_balance cpusets.
|
|
|
|
* csa - (for CpuSet Array) Array of pointers to all the cpusets
|
|
|
|
* that need to be load balanced, for convenient iterative
|
|
|
|
* access by the subsequent code that finds the best partition,
|
|
|
|
* i.e the set of domains (subsets) of CPUs such that the
|
|
|
|
* cpus_allowed of every cpuset marked is_sched_load_balance
|
|
|
|
* is a subset of one of these domains, while there are as
|
|
|
|
* many such domains as possible, each as small as possible.
|
|
|
|
* doms - Conversion of 'csa' to an array of cpumasks, for passing to
|
|
|
|
* the kernel/sched.c routine partition_sched_domains() in a
|
|
|
|
* convenient format, that can be easily compared to the prior
|
|
|
|
* value to determine what partition elements (sched domains)
|
|
|
|
* were changed (added or removed.)
|
|
|
|
*
|
|
|
|
* Finding the best partition (set of domains):
|
|
|
|
* The triple nested loops below over i, j, k scan over the
|
|
|
|
* load balanced cpusets (using the array of cpuset pointers in
|
|
|
|
* csa[]) looking for pairs of cpusets that have overlapping
|
|
|
|
* cpus_allowed, but which don't have the same 'pn' partition
|
|
|
|
* number and gives them in the same partition number. It keeps
|
|
|
|
* looping on the 'restart' label until it can no longer find
|
|
|
|
* any such pairs.
|
|
|
|
*
|
|
|
|
* The union of the cpus_allowed masks from the set of
|
|
|
|
* all cpusets having the same 'pn' value then form the one
|
|
|
|
* element of the partition (one sched domain) to be passed to
|
|
|
|
* partition_sched_domains().
|
|
|
|
*/
|
2009-11-03 12:23:40 +08:00
|
|
|
static int generate_sched_domains(cpumask_var_t **domains,
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
struct sched_domain_attr **attributes)
|
2007-10-19 14:40:20 +08:00
|
|
|
{
|
|
|
|
struct cpuset *cp; /* scans q */
|
|
|
|
struct cpuset **csa; /* array of all cpuset ptrs */
|
|
|
|
int csn; /* how many cpuset ptrs in csa so far */
|
|
|
|
int i, j, k; /* indices for partition finding loops */
|
2009-11-03 12:23:40 +08:00
|
|
|
cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
|
2008-04-15 13:04:23 +08:00
|
|
|
struct sched_domain_attr *dattr; /* attributes for custom domains */
|
2008-11-25 17:27:49 +08:00
|
|
|
int ndoms = 0; /* number of sched domains in result */
|
2009-01-08 10:08:45 +08:00
|
|
|
int nslot; /* next empty doms[] struct cpumask slot */
|
2013-01-08 00:51:08 +08:00
|
|
|
struct cgroup *pos_cgrp;
|
2007-10-19 14:40:20 +08:00
|
|
|
|
|
|
|
doms = NULL;
|
2008-04-15 13:04:23 +08:00
|
|
|
dattr = NULL;
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
csa = NULL;
|
2007-10-19 14:40:20 +08:00
|
|
|
|
|
|
|
/* Special case for the 99% of systems with one, full, sched domain */
|
|
|
|
if (is_sched_load_balance(&top_cpuset)) {
|
2009-11-03 12:23:40 +08:00
|
|
|
ndoms = 1;
|
|
|
|
doms = alloc_sched_domains(ndoms);
|
2007-10-19 14:40:20 +08:00
|
|
|
if (!doms)
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
goto done;
|
|
|
|
|
2008-04-15 13:04:23 +08:00
|
|
|
dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
|
|
|
|
if (dattr) {
|
|
|
|
*dattr = SD_ATTR_INIT;
|
2008-07-30 13:33:23 +08:00
|
|
|
update_domain_attr_tree(dattr, &top_cpuset);
|
2008-04-15 13:04:23 +08:00
|
|
|
}
|
2009-11-03 12:23:40 +08:00
|
|
|
cpumask_copy(doms[0], top_cpuset.cpus_allowed);
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
|
|
|
|
goto done;
|
2007-10-19 14:40:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
|
|
|
|
if (!csa)
|
|
|
|
goto done;
|
|
|
|
csn = 0;
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
cpuset_for_each_descendant_pre(cp, pos_cgrp, &top_cpuset) {
|
2008-07-30 13:33:22 +08:00
|
|
|
/*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Continue traversing beyond @cp iff @cp has some CPUs and
|
|
|
|
* isn't load balancing. The former is obvious. The
|
|
|
|
* latter: All child cpusets contain a subset of the
|
|
|
|
* parent's cpus, so just skip them, and then we call
|
|
|
|
* update_domain_attr_tree() to calc relax_domain_level of
|
|
|
|
* the corresponding sched domain.
|
2008-07-30 13:33:22 +08:00
|
|
|
*/
|
2013-01-08 00:51:08 +08:00
|
|
|
if (!cpumask_empty(cp->cpus_allowed) &&
|
|
|
|
!is_sched_load_balance(cp))
|
2008-07-30 13:33:22 +08:00
|
|
|
continue;
|
2008-07-25 16:47:23 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
if (is_sched_load_balance(cp))
|
|
|
|
csa[csn++] = cp;
|
|
|
|
|
|
|
|
/* skip @cp's subtree */
|
|
|
|
pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2007-10-19 14:40:20 +08:00
|
|
|
|
|
|
|
for (i = 0; i < csn; i++)
|
|
|
|
csa[i]->pn = i;
|
|
|
|
ndoms = csn;
|
|
|
|
|
|
|
|
restart:
|
|
|
|
/* Find the best partition (set of sched domains) */
|
|
|
|
for (i = 0; i < csn; i++) {
|
|
|
|
struct cpuset *a = csa[i];
|
|
|
|
int apn = a->pn;
|
|
|
|
|
|
|
|
for (j = 0; j < csn; j++) {
|
|
|
|
struct cpuset *b = csa[j];
|
|
|
|
int bpn = b->pn;
|
|
|
|
|
|
|
|
if (apn != bpn && cpusets_overlap(a, b)) {
|
|
|
|
for (k = 0; k < csn; k++) {
|
|
|
|
struct cpuset *c = csa[k];
|
|
|
|
|
|
|
|
if (c->pn == bpn)
|
|
|
|
c->pn = apn;
|
|
|
|
}
|
|
|
|
ndoms--; /* one less element */
|
|
|
|
goto restart;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
/*
|
|
|
|
* Now we know how many domains to create.
|
|
|
|
* Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
|
|
|
|
*/
|
2009-11-03 12:23:40 +08:00
|
|
|
doms = alloc_sched_domains(ndoms);
|
2008-11-18 14:02:03 +08:00
|
|
|
if (!doms)
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
goto done;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The rest of the code, including the scheduler, can deal with
|
|
|
|
* dattr==NULL case. No need to abort if alloc fails.
|
|
|
|
*/
|
2008-04-15 13:04:23 +08:00
|
|
|
dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL);
|
2007-10-19 14:40:20 +08:00
|
|
|
|
|
|
|
for (nslot = 0, i = 0; i < csn; i++) {
|
|
|
|
struct cpuset *a = csa[i];
|
2009-01-08 10:08:45 +08:00
|
|
|
struct cpumask *dp;
|
2007-10-19 14:40:20 +08:00
|
|
|
int apn = a->pn;
|
|
|
|
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
if (apn < 0) {
|
|
|
|
/* Skip completed partitions */
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2009-11-03 12:23:40 +08:00
|
|
|
dp = doms[nslot];
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
|
|
|
|
if (nslot == ndoms) {
|
|
|
|
static int warnings = 10;
|
|
|
|
if (warnings) {
|
|
|
|
printk(KERN_WARNING
|
|
|
|
"rebuild_sched_domains confused:"
|
|
|
|
" nslot %d, ndoms %d, csn %d, i %d,"
|
|
|
|
" apn %d\n",
|
|
|
|
nslot, ndoms, csn, i, apn);
|
|
|
|
warnings--;
|
2007-10-19 14:40:20 +08:00
|
|
|
}
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
continue;
|
|
|
|
}
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2009-01-08 10:08:45 +08:00
|
|
|
cpumask_clear(dp);
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
if (dattr)
|
|
|
|
*(dattr + nslot) = SD_ATTR_INIT;
|
|
|
|
for (j = i; j < csn; j++) {
|
|
|
|
struct cpuset *b = csa[j];
|
|
|
|
|
|
|
|
if (apn == b->pn) {
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_or(dp, dp, b->cpus_allowed);
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
if (dattr)
|
|
|
|
update_domain_attr_tree(dattr + nslot, b);
|
|
|
|
|
|
|
|
/* Done with this partition */
|
|
|
|
b->pn = -1;
|
2007-10-19 14:40:20 +08:00
|
|
|
}
|
|
|
|
}
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
nslot++;
|
2007-10-19 14:40:20 +08:00
|
|
|
}
|
|
|
|
BUG_ON(nslot != ndoms);
|
|
|
|
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
done:
|
|
|
|
kfree(csa);
|
|
|
|
|
2008-11-18 14:02:03 +08:00
|
|
|
/*
|
|
|
|
* Fallback to the default domain if kmalloc() failed.
|
|
|
|
* See comments in partition_sched_domains().
|
|
|
|
*/
|
|
|
|
if (doms == NULL)
|
|
|
|
ndoms = 1;
|
|
|
|
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
*domains = doms;
|
|
|
|
*attributes = dattr;
|
|
|
|
return ndoms;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Rebuild scheduler domains.
|
|
|
|
*
|
2013-01-08 00:51:07 +08:00
|
|
|
* If the flag 'sched_load_balance' of any cpuset with non-empty
|
|
|
|
* 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
|
|
|
|
* which has that flag enabled, or if any cpuset with a non-empty
|
|
|
|
* 'cpus' is removed, then call this routine to rebuild the
|
|
|
|
* scheduler's dynamic sched domains.
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Call with cpuset_mutex held. Takes get_online_cpus().
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
*/
|
2013-01-08 00:51:07 +08:00
|
|
|
static void rebuild_sched_domains_locked(void)
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
{
|
|
|
|
struct sched_domain_attr *attr;
|
2009-11-03 12:23:40 +08:00
|
|
|
cpumask_var_t *doms;
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
int ndoms;
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
lockdep_assert_held(&cpuset_mutex);
|
2008-01-26 04:08:02 +08:00
|
|
|
get_online_cpus();
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
|
|
|
|
/* Generate domain masks and attrs */
|
|
|
|
ndoms = generate_sched_domains(&doms, &attr);
|
|
|
|
|
|
|
|
/* Have scheduler rebuild the domains */
|
|
|
|
partition_sched_domains(ndoms, doms, attr);
|
|
|
|
|
2008-01-26 04:08:02 +08:00
|
|
|
put_online_cpus();
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
}
|
2009-04-03 07:57:55 +08:00
|
|
|
#else /* !CONFIG_SMP */
|
2013-01-08 00:51:07 +08:00
|
|
|
static void rebuild_sched_domains_locked(void)
|
2009-04-03 07:57:55 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2009-12-07 03:41:16 +08:00
|
|
|
static int generate_sched_domains(cpumask_var_t **domains,
|
2009-04-03 07:57:55 +08:00
|
|
|
struct sched_domain_attr **attributes)
|
|
|
|
{
|
|
|
|
*domains = NULL;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_SMP */
|
2007-10-19 14:40:20 +08:00
|
|
|
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
void rebuild_sched_domains(void)
|
|
|
|
{
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
2013-01-08 00:51:07 +08:00
|
|
|
rebuild_sched_domains_locked();
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_unlock(&cpuset_mutex);
|
2007-10-19 14:40:20 +08:00
|
|
|
}
|
|
|
|
|
2008-02-07 16:14:44 +08:00
|
|
|
/**
|
|
|
|
* cpuset_test_cpumask - test a task's cpus_allowed versus its cpuset's
|
|
|
|
* @tsk: task to test
|
|
|
|
* @scan: struct cgroup_scanner contained in its struct cpuset_hotplug_scanner
|
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Call with cpuset_mutex held. May take callback_mutex during call.
|
2008-02-07 16:14:44 +08:00
|
|
|
* Called for each task in a cgroup by cgroup_scan_tasks().
|
|
|
|
* Return nonzero if this tasks's cpus_allowed mask should be changed (in other
|
|
|
|
* words, if its mask is not equal to its cpuset's mask).
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
*/
|
2008-04-29 16:00:25 +08:00
|
|
|
static int cpuset_test_cpumask(struct task_struct *tsk,
|
|
|
|
struct cgroup_scanner *scan)
|
2008-02-07 16:14:44 +08:00
|
|
|
{
|
2009-01-08 10:08:44 +08:00
|
|
|
return !cpumask_equal(&tsk->cpus_allowed,
|
2008-02-07 16:14:44 +08:00
|
|
|
(cgroup_cs(scan->cg))->cpus_allowed);
|
|
|
|
}
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
|
2008-02-07 16:14:44 +08:00
|
|
|
/**
|
|
|
|
* cpuset_change_cpumask - make a task's cpus_allowed the same as its cpuset's
|
|
|
|
* @tsk: task to test
|
|
|
|
* @scan: struct cgroup_scanner containing the cgroup of the task
|
|
|
|
*
|
|
|
|
* Called by cgroup_scan_tasks() for each task in a cgroup whose
|
|
|
|
* cpus_allowed mask needs to be changed.
|
|
|
|
*
|
|
|
|
* We don't need to re-check for the cgroup/cpuset membership, since we're
|
2013-01-08 00:51:08 +08:00
|
|
|
* holding cpuset_mutex at this point.
|
2008-02-07 16:14:44 +08:00
|
|
|
*/
|
2008-04-29 16:00:25 +08:00
|
|
|
static void cpuset_change_cpumask(struct task_struct *tsk,
|
|
|
|
struct cgroup_scanner *scan)
|
2008-02-07 16:14:44 +08:00
|
|
|
{
|
2009-01-08 10:08:44 +08:00
|
|
|
set_cpus_allowed_ptr(tsk, ((cgroup_cs(scan->cg))->cpus_allowed));
|
2008-02-07 16:14:44 +08:00
|
|
|
}
|
|
|
|
|
2008-07-25 16:47:21 +08:00
|
|
|
/**
|
|
|
|
* update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
|
|
|
|
* @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
|
2008-09-13 17:33:08 +08:00
|
|
|
* @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
|
2008-07-25 16:47:21 +08:00
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Called with cpuset_mutex held
|
2008-07-25 16:47:21 +08:00
|
|
|
*
|
|
|
|
* The cgroup_scan_tasks() function will scan all the tasks in a cgroup,
|
|
|
|
* calling callback functions for each.
|
|
|
|
*
|
2008-09-13 17:33:08 +08:00
|
|
|
* No return value. It's guaranteed that cgroup_scan_tasks() always returns 0
|
|
|
|
* if @heap != NULL.
|
2008-07-25 16:47:21 +08:00
|
|
|
*/
|
2008-09-13 17:33:08 +08:00
|
|
|
static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
|
2008-07-25 16:47:21 +08:00
|
|
|
{
|
|
|
|
struct cgroup_scanner scan;
|
|
|
|
|
|
|
|
scan.cg = cs->css.cgroup;
|
|
|
|
scan.test_task = cpuset_test_cpumask;
|
|
|
|
scan.process_task = cpuset_change_cpumask;
|
2008-09-13 17:33:08 +08:00
|
|
|
scan.heap = heap;
|
|
|
|
cgroup_scan_tasks(&scan);
|
2008-07-25 16:47:21 +08:00
|
|
|
}
|
|
|
|
|
2008-02-07 16:14:44 +08:00
|
|
|
/**
|
|
|
|
* update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
|
|
|
|
* @cs: the cpuset to consider
|
|
|
|
* @buf: buffer of cpu numbers written to this cpuset
|
|
|
|
*/
|
2009-01-08 10:08:43 +08:00
|
|
|
static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
|
|
|
|
const char *buf)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-09-13 17:33:08 +08:00
|
|
|
struct ptr_heap heap;
|
2008-02-07 16:14:44 +08:00
|
|
|
int retval;
|
|
|
|
int is_load_balanced;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-03-29 13:08:31 +08:00
|
|
|
/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map
Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier. Make
this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset. This is a surprising regression over earlier kernels
that didn't have cpusets enabled.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.
Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 16:23:51 +08:00
|
|
|
if (cs == &top_cpuset)
|
|
|
|
return -EACCES;
|
|
|
|
|
2007-05-08 15:31:43 +08:00
|
|
|
/*
|
hotplug cpu: move tasks in empty cpusets to parent various other fixes
Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch
I had had "iff", meaning "if and only if" in a comment. However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
changed it to "if", presumably figuring that the "iff" was a typo. However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.
The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry. The opposite is true.
Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().
A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.
Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.
Removed extra parentheses and unnecessary return statement.
Renamed attach_task() to cpuset_attach() in various comments.
Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Cliff Wickman <cpw@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 16:14:46 +08:00
|
|
|
* An empty cpus_allowed is ok only if the cpuset has no tasks.
|
2007-10-19 14:40:21 +08:00
|
|
|
* Since cpulist_parse() fails on an empty mask, we special case
|
|
|
|
* that parsing. The validate_change() call ensures that cpusets
|
|
|
|
* with tasks have cpus.
|
2007-05-08 15:31:43 +08:00
|
|
|
*/
|
2007-10-19 14:40:21 +08:00
|
|
|
if (!*buf) {
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_clear(trialcs->cpus_allowed);
|
2007-05-08 15:31:43 +08:00
|
|
|
} else {
|
2009-01-08 10:08:44 +08:00
|
|
|
retval = cpulist_parse(buf, trialcs->cpus_allowed);
|
2007-05-08 15:31:43 +08:00
|
|
|
if (retval < 0)
|
|
|
|
return retval;
|
2008-06-06 13:46:32 +08:00
|
|
|
|
2009-11-25 20:31:39 +08:00
|
|
|
if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
|
2008-06-06 13:46:32 +08:00
|
|
|
return -EINVAL;
|
2007-05-08 15:31:43 +08:00
|
|
|
}
|
2009-01-08 10:08:43 +08:00
|
|
|
retval = validate_change(cs, trialcs);
|
2005-06-26 05:57:34 +08:00
|
|
|
if (retval < 0)
|
|
|
|
return retval;
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2007-10-19 14:40:22 +08:00
|
|
|
/* Nothing to do if the cpus didn't change */
|
2009-01-08 10:08:44 +08:00
|
|
|
if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
|
2007-10-19 14:40:22 +08:00
|
|
|
return 0;
|
2008-02-07 16:14:44 +08:00
|
|
|
|
2008-09-13 17:33:08 +08:00
|
|
|
retval = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL);
|
|
|
|
if (retval)
|
|
|
|
return retval;
|
|
|
|
|
2009-01-08 10:08:43 +08:00
|
|
|
is_load_balanced = is_sched_load_balance(trialcs);
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2007-10-19 14:40:22 +08:00
|
|
|
/*
|
|
|
|
* Scan tasks in the cpuset, and update the cpumasks of any
|
2008-02-07 16:14:44 +08:00
|
|
|
* that need an update.
|
2007-10-19 14:40:22 +08:00
|
|
|
*/
|
2008-09-13 17:33:08 +08:00
|
|
|
update_tasks_cpumask(cs, &heap);
|
|
|
|
|
|
|
|
heap_free(&heap);
|
2008-02-07 16:14:44 +08:00
|
|
|
|
2007-10-19 14:40:22 +08:00
|
|
|
if (is_load_balanced)
|
2013-01-08 00:51:07 +08:00
|
|
|
rebuild_sched_domains_locked();
|
2005-06-26 05:57:34 +08:00
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2006-03-31 18:30:52 +08:00
|
|
|
/*
|
|
|
|
* cpuset_migrate_mm
|
|
|
|
*
|
|
|
|
* Migrate memory region from one set of nodes to another.
|
|
|
|
*
|
|
|
|
* Temporarilly set tasks mems_allowed to target nodes of migration,
|
|
|
|
* so that the migration code can allocate pages on these nodes.
|
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Call holding cpuset_mutex, so current's cpuset won't change
|
hotplug cpu: move tasks in empty cpusets to parent various other fixes
Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch
I had had "iff", meaning "if and only if" in a comment. However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
changed it to "if", presumably figuring that the "iff" was a typo. However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.
The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry. The opposite is true.
Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().
A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.
Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.
Removed extra parentheses and unnecessary return statement.
Renamed attach_task() to cpuset_attach() in various comments.
Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Cliff Wickman <cpw@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 16:14:46 +08:00
|
|
|
* during this call, as manage_mutex holds off any cpuset_attach()
|
2006-03-31 18:30:52 +08:00
|
|
|
* calls. Therefore we don't need to take task_lock around the
|
|
|
|
* call to guarantee_online_mems(), as we know no one is changing
|
2008-02-07 16:14:45 +08:00
|
|
|
* our task's cpuset.
|
2006-03-31 18:30:52 +08:00
|
|
|
*
|
|
|
|
* While the mm_struct we are migrating is typically from some
|
|
|
|
* other task, the task_struct mems_allowed that we are hacking
|
|
|
|
* is for our current task, which must allocate new pages for that
|
|
|
|
* migrating memory region.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
|
|
|
|
const nodemask_t *to)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk = current;
|
|
|
|
|
|
|
|
tsk->mems_allowed = *to;
|
|
|
|
|
|
|
|
do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
|
|
|
|
|
2007-10-19 14:39:39 +08:00
|
|
|
guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
|
2006-03-31 18:30:52 +08:00
|
|
|
}
|
|
|
|
|
2009-04-03 07:57:51 +08:00
|
|
|
/*
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
* cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
|
|
|
|
* @tsk: the task to change
|
|
|
|
* @newmems: new nodes that the task will be set
|
|
|
|
*
|
|
|
|
* In order to avoid seeing no nodes if the old and new nodes are disjoint,
|
|
|
|
* we structure updates as setting all new allowed nodes, then clearing newly
|
|
|
|
* disallowed ones.
|
|
|
|
*/
|
|
|
|
static void cpuset_change_task_nodemask(struct task_struct *tsk,
|
|
|
|
nodemask_t *newmems)
|
|
|
|
{
|
2011-12-20 09:11:52 +08:00
|
|
|
bool need_loop;
|
2011-11-03 04:38:39 +08:00
|
|
|
|
2010-05-25 05:32:08 +08:00
|
|
|
/*
|
|
|
|
* Allow tasks that have access to memory reserves because they have
|
|
|
|
* been OOM killed to get memory anywhere.
|
|
|
|
*/
|
|
|
|
if (unlikely(test_thread_flag(TIF_MEMDIE)))
|
|
|
|
return;
|
|
|
|
if (current->flags & PF_EXITING) /* Let dying task have memory */
|
|
|
|
return;
|
|
|
|
|
|
|
|
task_lock(tsk);
|
2011-12-20 09:11:52 +08:00
|
|
|
/*
|
|
|
|
* Determine if a loop is necessary if another thread is doing
|
|
|
|
* get_mems_allowed(). If at least one node remains unchanged and
|
|
|
|
* tsk does not have a mempolicy, then an empty nodemask will not be
|
|
|
|
* possible when mems_allowed is larger than a word.
|
|
|
|
*/
|
|
|
|
need_loop = task_has_mempolicy(tsk) ||
|
|
|
|
!nodes_intersects(*newmems, tsk->mems_allowed);
|
2010-05-25 05:32:08 +08:00
|
|
|
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
|
|
|
if (need_loop)
|
|
|
|
write_seqcount_begin(&tsk->mems_allowed_seq);
|
2010-05-25 05:32:08 +08:00
|
|
|
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
|
|
|
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
|
|
|
|
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
|
2010-05-25 05:32:08 +08:00
|
|
|
|
|
|
|
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
tsk->mems_allowed = *newmems;
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
|
|
|
|
|
|
|
if (need_loop)
|
|
|
|
write_seqcount_end(&tsk->mems_allowed_seq);
|
|
|
|
|
2010-05-25 05:32:08 +08:00
|
|
|
task_unlock(tsk);
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update task's mems_allowed and rebind its mempolicy and vmas' mempolicy
|
|
|
|
* of it to cpuset's new mems_allowed, and migrate pages to new nodes if
|
2013-01-08 00:51:08 +08:00
|
|
|
* memory_migrate flag is set. Called with cpuset_mutex held.
|
2009-04-03 07:57:51 +08:00
|
|
|
*/
|
|
|
|
static void cpuset_change_nodemask(struct task_struct *p,
|
|
|
|
struct cgroup_scanner *scan)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm;
|
|
|
|
struct cpuset *cs;
|
|
|
|
int migrate;
|
|
|
|
const nodemask_t *oldmem = scan->data;
|
2013-01-08 00:51:08 +08:00
|
|
|
static nodemask_t newmems; /* protected by cpuset_mutex */
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
|
|
|
|
cs = cgroup_cs(scan->cg);
|
2011-03-24 07:42:47 +08:00
|
|
|
guarantee_online_mems(cs, &newmems);
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
|
2011-03-24 07:42:47 +08:00
|
|
|
cpuset_change_task_nodemask(p, &newmems);
|
2010-03-24 04:35:35 +08:00
|
|
|
|
2009-04-03 07:57:51 +08:00
|
|
|
mm = get_task_mm(p);
|
|
|
|
if (!mm)
|
|
|
|
return;
|
|
|
|
|
|
|
|
migrate = is_memory_migrate(cs);
|
|
|
|
|
|
|
|
mpol_rebind_mm(mm, &cs->mems_allowed);
|
|
|
|
if (migrate)
|
|
|
|
cpuset_migrate_mm(mm, oldmem, &cs->mems_allowed);
|
|
|
|
mmput(mm);
|
|
|
|
}
|
|
|
|
|
2007-10-19 14:39:39 +08:00
|
|
|
static void *cpuset_being_rebound;
|
|
|
|
|
2008-07-25 16:47:21 +08:00
|
|
|
/**
|
|
|
|
* update_tasks_nodemask - Update the nodemasks of tasks in the cpuset.
|
|
|
|
* @cs: the cpuset in which each task's mems_allowed mask needs to be changed
|
|
|
|
* @oldmem: old mems_allowed of cpuset cs
|
2009-04-03 07:57:52 +08:00
|
|
|
* @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
|
2008-07-25 16:47:21 +08:00
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Called with cpuset_mutex held
|
2009-04-03 07:57:52 +08:00
|
|
|
* No return value. It's guaranteed that cgroup_scan_tasks() always returns 0
|
|
|
|
* if @heap != NULL.
|
2008-07-25 16:47:21 +08:00
|
|
|
*/
|
2009-04-03 07:57:52 +08:00
|
|
|
static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
|
|
|
|
struct ptr_heap *heap)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2009-04-03 07:57:51 +08:00
|
|
|
struct cgroup_scanner scan;
|
2006-01-08 17:01:52 +08:00
|
|
|
|
2008-04-28 17:13:09 +08:00
|
|
|
cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
|
2009-04-03 07:57:51 +08:00
|
|
|
scan.cg = cs->css.cgroup;
|
|
|
|
scan.test_task = NULL;
|
|
|
|
scan.process_task = cpuset_change_nodemask;
|
2009-04-03 07:57:52 +08:00
|
|
|
scan.heap = heap;
|
2009-04-03 07:57:51 +08:00
|
|
|
scan.data = (nodemask_t *)oldmem;
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
|
|
|
|
/*
|
2009-04-03 07:57:51 +08:00
|
|
|
* The mpol_rebind_mm() call takes mmap_sem, which we couldn't
|
|
|
|
* take while holding tasklist_lock. Forks can happen - the
|
|
|
|
* mpol_dup() cpuset_being_rebound check will catch such forks,
|
|
|
|
* and rebind their vma mempolicies too. Because we still hold
|
2013-01-08 00:51:08 +08:00
|
|
|
* the global cpuset_mutex, we know that no other rebind effort
|
2009-04-03 07:57:51 +08:00
|
|
|
* will be contending for the global variable cpuset_being_rebound.
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
* It's ok if we rebind the same mm twice; mpol_rebind_mm()
|
2006-01-08 17:02:00 +08:00
|
|
|
* is idempotent. Also migrate pages in each mm to new nodes.
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
*/
|
2009-04-03 07:57:52 +08:00
|
|
|
cgroup_scan_tasks(&scan);
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
|
2008-02-07 16:14:45 +08:00
|
|
|
/* We're done rebinding vmas to this cpuset's new mems_allowed. */
|
2007-10-19 14:39:39 +08:00
|
|
|
cpuset_being_rebound = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-07-25 16:47:21 +08:00
|
|
|
/*
|
|
|
|
* Handle user request to change the 'mems' memory placement
|
|
|
|
* of a cpuset. Needs to validate the request, update the
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
* cpusets mems_allowed, and for each task in the cpuset,
|
|
|
|
* update mems_allowed and rebind task's mempolicy and any vma
|
|
|
|
* mempolicies and if the cpuset is marked 'memory_migrate',
|
|
|
|
* migrate the tasks pages to the new memory.
|
2008-07-25 16:47:21 +08:00
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Call with cpuset_mutex held. May take callback_mutex during call.
|
2008-07-25 16:47:21 +08:00
|
|
|
* Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
|
|
|
|
* lock each such tasks mm->mmap_sem, scan its vma's and rebind
|
|
|
|
* their mempolicies to the cpusets new mems_allowed.
|
|
|
|
*/
|
2009-01-08 10:08:43 +08:00
|
|
|
static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
|
|
|
|
const char *buf)
|
2008-07-25 16:47:21 +08:00
|
|
|
{
|
2010-03-24 04:35:35 +08:00
|
|
|
NODEMASK_ALLOC(nodemask_t, oldmem, GFP_KERNEL);
|
2008-07-25 16:47:21 +08:00
|
|
|
int retval;
|
2009-04-03 07:57:52 +08:00
|
|
|
struct ptr_heap heap;
|
2008-07-25 16:47:21 +08:00
|
|
|
|
2010-03-24 04:35:35 +08:00
|
|
|
if (!oldmem)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2008-07-25 16:47:21 +08:00
|
|
|
/*
|
2012-12-13 05:51:24 +08:00
|
|
|
* top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
|
2008-07-25 16:47:21 +08:00
|
|
|
* it's read-only
|
|
|
|
*/
|
2010-03-24 04:35:35 +08:00
|
|
|
if (cs == &top_cpuset) {
|
|
|
|
retval = -EACCES;
|
|
|
|
goto done;
|
|
|
|
}
|
2008-07-25 16:47:21 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* An empty mems_allowed is ok iff there are no tasks in the cpuset.
|
|
|
|
* Since nodelist_parse() fails on an empty mask, we special case
|
|
|
|
* that parsing. The validate_change() call ensures that cpusets
|
|
|
|
* with tasks have memory.
|
|
|
|
*/
|
|
|
|
if (!*buf) {
|
2009-01-08 10:08:43 +08:00
|
|
|
nodes_clear(trialcs->mems_allowed);
|
2008-07-25 16:47:21 +08:00
|
|
|
} else {
|
2009-01-08 10:08:43 +08:00
|
|
|
retval = nodelist_parse(buf, trialcs->mems_allowed);
|
2008-07-25 16:47:21 +08:00
|
|
|
if (retval < 0)
|
|
|
|
goto done;
|
|
|
|
|
2009-01-08 10:08:43 +08:00
|
|
|
if (!nodes_subset(trialcs->mems_allowed,
|
2012-12-13 05:51:24 +08:00
|
|
|
node_states[N_MEMORY])) {
|
2010-03-24 04:35:35 +08:00
|
|
|
retval = -EINVAL;
|
|
|
|
goto done;
|
|
|
|
}
|
2008-07-25 16:47:21 +08:00
|
|
|
}
|
2010-03-24 04:35:35 +08:00
|
|
|
*oldmem = cs->mems_allowed;
|
|
|
|
if (nodes_equal(*oldmem, trialcs->mems_allowed)) {
|
2008-07-25 16:47:21 +08:00
|
|
|
retval = 0; /* Too easy - nothing to do */
|
|
|
|
goto done;
|
|
|
|
}
|
2009-01-08 10:08:43 +08:00
|
|
|
retval = validate_change(cs, trialcs);
|
2008-07-25 16:47:21 +08:00
|
|
|
if (retval < 0)
|
|
|
|
goto done;
|
|
|
|
|
2009-04-03 07:57:52 +08:00
|
|
|
retval = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL);
|
|
|
|
if (retval < 0)
|
|
|
|
goto done;
|
|
|
|
|
2008-07-25 16:47:21 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
2009-01-08 10:08:43 +08:00
|
|
|
cs->mems_allowed = trialcs->mems_allowed;
|
2008-07-25 16:47:21 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
|
|
|
|
2010-03-24 04:35:35 +08:00
|
|
|
update_tasks_nodemask(cs, oldmem, &heap);
|
2009-04-03 07:57:52 +08:00
|
|
|
|
|
|
|
heap_free(&heap);
|
2008-07-25 16:47:21 +08:00
|
|
|
done:
|
2010-03-24 04:35:35 +08:00
|
|
|
NODEMASK_FREE(oldmem);
|
2008-07-25 16:47:21 +08:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2007-10-19 14:39:39 +08:00
|
|
|
int current_cpuset_is_being_rebound(void)
|
|
|
|
{
|
|
|
|
return task_cs(current) == cpuset_being_rebound;
|
|
|
|
}
|
|
|
|
|
2008-05-07 11:42:41 +08:00
|
|
|
static int update_relax_domain_level(struct cpuset *cs, s64 val)
|
2008-04-15 13:04:23 +08:00
|
|
|
{
|
2009-04-03 07:57:55 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2011-04-07 20:10:04 +08:00
|
|
|
if (val < -1 || val >= sched_domain_level_max)
|
2008-05-13 10:27:17 +08:00
|
|
|
return -EINVAL;
|
2009-04-03 07:57:55 +08:00
|
|
|
#endif
|
2008-04-15 13:04:23 +08:00
|
|
|
|
|
|
|
if (val != cs->relax_domain_level) {
|
|
|
|
cs->relax_domain_level = val;
|
2009-01-08 10:08:44 +08:00
|
|
|
if (!cpumask_empty(cs->cpus_allowed) &&
|
|
|
|
is_sched_load_balance(cs))
|
2013-01-08 00:51:07 +08:00
|
|
|
rebuild_sched_domains_locked();
|
2008-04-15 13:04:23 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-06-17 06:31:47 +08:00
|
|
|
/*
|
|
|
|
* cpuset_change_flag - make a task's spread flags the same as its cpuset's
|
|
|
|
* @tsk: task to be updated
|
|
|
|
* @scan: struct cgroup_scanner containing the cgroup of the task
|
|
|
|
*
|
|
|
|
* Called by cgroup_scan_tasks() for each task in a cgroup.
|
|
|
|
*
|
|
|
|
* We don't need to re-check for the cgroup/cpuset membership, since we're
|
2013-01-08 00:51:08 +08:00
|
|
|
* holding cpuset_mutex at this point.
|
2009-06-17 06:31:47 +08:00
|
|
|
*/
|
|
|
|
static void cpuset_change_flag(struct task_struct *tsk,
|
|
|
|
struct cgroup_scanner *scan)
|
|
|
|
{
|
|
|
|
cpuset_update_task_spread_flag(cgroup_cs(scan->cg), tsk);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* update_tasks_flags - update the spread flags of tasks in the cpuset.
|
|
|
|
* @cs: the cpuset in which each task's spread flags needs to be changed
|
|
|
|
* @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
|
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Called with cpuset_mutex held
|
2009-06-17 06:31:47 +08:00
|
|
|
*
|
|
|
|
* The cgroup_scan_tasks() function will scan all the tasks in a cgroup,
|
|
|
|
* calling callback functions for each.
|
|
|
|
*
|
|
|
|
* No return value. It's guaranteed that cgroup_scan_tasks() always returns 0
|
|
|
|
* if @heap != NULL.
|
|
|
|
*/
|
|
|
|
static void update_tasks_flags(struct cpuset *cs, struct ptr_heap *heap)
|
|
|
|
{
|
|
|
|
struct cgroup_scanner scan;
|
|
|
|
|
|
|
|
scan.cg = cs->css.cgroup;
|
|
|
|
scan.test_task = NULL;
|
|
|
|
scan.process_task = cpuset_change_flag;
|
|
|
|
scan.heap = heap;
|
|
|
|
cgroup_scan_tasks(&scan);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* update_flag - read a 0 or a 1 in a file and update associated flag
|
2008-04-29 16:00:26 +08:00
|
|
|
* bit: the bit to update (see cpuset_flagbits_t)
|
|
|
|
* cs: the cpuset to update
|
|
|
|
* turning_on: whether the flag is being set or cleared
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Call with cpuset_mutex held.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
2008-04-29 16:00:00 +08:00
|
|
|
static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
|
|
|
|
int turning_on)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2009-01-08 10:08:43 +08:00
|
|
|
struct cpuset *trialcs;
|
2008-10-19 11:28:18 +08:00
|
|
|
int balance_flag_changed;
|
2009-06-17 06:31:47 +08:00
|
|
|
int spread_flag_changed;
|
|
|
|
struct ptr_heap heap;
|
|
|
|
int err;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-01-08 10:08:43 +08:00
|
|
|
trialcs = alloc_trial_cpuset(cs);
|
|
|
|
if (!trialcs)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (turning_on)
|
2009-01-08 10:08:43 +08:00
|
|
|
set_bit(bit, &trialcs->flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
else
|
2009-01-08 10:08:43 +08:00
|
|
|
clear_bit(bit, &trialcs->flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-01-08 10:08:43 +08:00
|
|
|
err = validate_change(cs, trialcs);
|
2005-06-26 05:57:34 +08:00
|
|
|
if (err < 0)
|
2009-01-08 10:08:43 +08:00
|
|
|
goto out;
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2009-06-17 06:31:47 +08:00
|
|
|
err = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL);
|
|
|
|
if (err < 0)
|
|
|
|
goto out;
|
|
|
|
|
2007-10-19 14:40:20 +08:00
|
|
|
balance_flag_changed = (is_sched_load_balance(cs) !=
|
2009-01-08 10:08:43 +08:00
|
|
|
is_sched_load_balance(trialcs));
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2009-06-17 06:31:47 +08:00
|
|
|
spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
|
|
|
|
|| (is_spread_page(cs) != is_spread_page(trialcs)));
|
|
|
|
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
2009-01-08 10:08:43 +08:00
|
|
|
cs->flags = trialcs->flags;
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
2005-06-26 05:57:34 +08:00
|
|
|
|
2009-01-08 10:08:44 +08:00
|
|
|
if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
|
2013-01-08 00:51:07 +08:00
|
|
|
rebuild_sched_domains_locked();
|
2007-10-19 14:40:20 +08:00
|
|
|
|
2009-06-17 06:31:47 +08:00
|
|
|
if (spread_flag_changed)
|
|
|
|
update_tasks_flags(cs, &heap);
|
|
|
|
heap_free(&heap);
|
2009-01-08 10:08:43 +08:00
|
|
|
out:
|
|
|
|
free_trial_cpuset(trialcs);
|
|
|
|
return err;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
/*
|
2006-07-01 00:27:16 +08:00
|
|
|
* Frequency meter - How fast is some event occurring?
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
*
|
|
|
|
* These routines manage a digitally filtered, constant time based,
|
|
|
|
* event frequency meter. There are four routines:
|
|
|
|
* fmeter_init() - initialize a frequency meter.
|
|
|
|
* fmeter_markevent() - called each time the event happens.
|
|
|
|
* fmeter_getrate() - returns the recent rate of such events.
|
|
|
|
* fmeter_update() - internal routine used to update fmeter.
|
|
|
|
*
|
|
|
|
* A common data structure is passed to each of these routines,
|
|
|
|
* which is used to keep track of the state required to manage the
|
|
|
|
* frequency meter and its digital filter.
|
|
|
|
*
|
|
|
|
* The filter works on the number of events marked per unit time.
|
|
|
|
* The filter is single-pole low-pass recursive (IIR). The time unit
|
|
|
|
* is 1 second. Arithmetic is done using 32-bit integers scaled to
|
|
|
|
* simulate 3 decimal digits of precision (multiplied by 1000).
|
|
|
|
*
|
|
|
|
* With an FM_COEF of 933, and a time base of 1 second, the filter
|
|
|
|
* has a half-life of 10 seconds, meaning that if the events quit
|
|
|
|
* happening, then the rate returned from the fmeter_getrate()
|
|
|
|
* will be cut in half each 10 seconds, until it converges to zero.
|
|
|
|
*
|
|
|
|
* It is not worth doing a real infinitely recursive filter. If more
|
|
|
|
* than FM_MAXTICKS ticks have elapsed since the last filter event,
|
|
|
|
* just compute FM_MAXTICKS ticks worth, by which point the level
|
|
|
|
* will be stable.
|
|
|
|
*
|
|
|
|
* Limit the count of unprocessed events to FM_MAXCNT, so as to avoid
|
|
|
|
* arithmetic overflow in the fmeter_update() routine.
|
|
|
|
*
|
|
|
|
* Given the simple 32 bit integer arithmetic used, this meter works
|
|
|
|
* best for reporting rates between one per millisecond (msec) and
|
|
|
|
* one per 32 (approx) seconds. At constant rates faster than one
|
|
|
|
* per msec it maxes out at values just under 1,000,000. At constant
|
|
|
|
* rates between one per msec, and one per second it will stabilize
|
|
|
|
* to a value N*1000, where N is the rate of events per second.
|
|
|
|
* At constant rates between one per second and one per 32 seconds,
|
|
|
|
* it will be choppy, moving up on the seconds that have an event,
|
|
|
|
* and then decaying until the next event. At rates slower than
|
|
|
|
* about one in 32 seconds, it decays all the way back to zero between
|
|
|
|
* each event.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define FM_COEF 933 /* coefficient for half-life of 10 secs */
|
|
|
|
#define FM_MAXTICKS ((time_t)99) /* useless computing more ticks than this */
|
|
|
|
#define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */
|
|
|
|
#define FM_SCALE 1000 /* faux fixed point scale */
|
|
|
|
|
|
|
|
/* Initialize a frequency meter */
|
|
|
|
static void fmeter_init(struct fmeter *fmp)
|
|
|
|
{
|
|
|
|
fmp->cnt = 0;
|
|
|
|
fmp->val = 0;
|
|
|
|
fmp->time = 0;
|
|
|
|
spin_lock_init(&fmp->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Internal meter update - process cnt events and update value */
|
|
|
|
static void fmeter_update(struct fmeter *fmp)
|
|
|
|
{
|
|
|
|
time_t now = get_seconds();
|
|
|
|
time_t ticks = now - fmp->time;
|
|
|
|
|
|
|
|
if (ticks == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ticks = min(FM_MAXTICKS, ticks);
|
|
|
|
while (ticks-- > 0)
|
|
|
|
fmp->val = (FM_COEF * fmp->val) / FM_SCALE;
|
|
|
|
fmp->time = now;
|
|
|
|
|
|
|
|
fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE;
|
|
|
|
fmp->cnt = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Process any previous ticks, then bump cnt by one (times scale). */
|
|
|
|
static void fmeter_markevent(struct fmeter *fmp)
|
|
|
|
{
|
|
|
|
spin_lock(&fmp->lock);
|
|
|
|
fmeter_update(fmp);
|
|
|
|
fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE);
|
|
|
|
spin_unlock(&fmp->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Process any previous ticks, then return current value. */
|
|
|
|
static int fmeter_getrate(struct fmeter *fmp)
|
|
|
|
{
|
|
|
|
int val;
|
|
|
|
|
|
|
|
spin_lock(&fmp->lock);
|
|
|
|
fmeter_update(fmp);
|
|
|
|
val = fmp->val;
|
|
|
|
spin_unlock(&fmp->lock);
|
|
|
|
return val;
|
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
|
2012-01-31 13:47:36 +08:00
|
|
|
static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
|
2011-05-27 07:25:19 +08:00
|
|
|
{
|
2011-12-13 10:12:21 +08:00
|
|
|
struct cpuset *cs = cgroup_cs(cgrp);
|
2011-12-13 10:12:21 +08:00
|
|
|
struct task_struct *task;
|
|
|
|
int ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
|
|
|
|
|
|
|
ret = -ENOSPC;
|
2009-01-08 10:08:44 +08:00
|
|
|
if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
|
2013-01-08 00:51:08 +08:00
|
|
|
goto out_unlock;
|
2008-06-06 03:57:11 +08:00
|
|
|
|
2011-12-13 10:12:21 +08:00
|
|
|
cgroup_taskset_for_each(task, cgrp, tset) {
|
|
|
|
/*
|
|
|
|
* Kthreads bound to specific cpus cannot be moved to a new
|
|
|
|
* cpuset; we cannot change their cpu affinity and
|
|
|
|
* isolating such threads by their set of allowed nodes is
|
|
|
|
* unnecessary. Thus, cpusets are not applicable for such
|
|
|
|
* threads. This prevents checking for success of
|
|
|
|
* set_cpus_allowed_ptr() on all attached tasks before
|
|
|
|
* cpus_allowed may be changed.
|
|
|
|
*/
|
2013-01-08 00:51:08 +08:00
|
|
|
ret = -EINVAL;
|
2011-12-13 10:12:21 +08:00
|
|
|
if (task->flags & PF_THREAD_BOUND)
|
2013-01-08 00:51:08 +08:00
|
|
|
goto out_unlock;
|
|
|
|
ret = security_task_setscheduler(task);
|
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
2011-12-13 10:12:21 +08:00
|
|
|
}
|
2011-05-27 07:25:19 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
|
|
|
* Mark attach is in progress. This makes validate_change() fail
|
|
|
|
* changes which zero cpus/mems_allowed.
|
|
|
|
*/
|
|
|
|
cs->attach_in_progress++;
|
2013-01-08 00:51:08 +08:00
|
|
|
ret = 0;
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&cpuset_mutex);
|
|
|
|
return ret;
|
2007-10-19 14:39:39 +08:00
|
|
|
}
|
2011-05-27 07:25:19 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
static void cpuset_cancel_attach(struct cgroup *cgrp,
|
|
|
|
struct cgroup_taskset *tset)
|
|
|
|
{
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
2013-01-08 00:51:07 +08:00
|
|
|
cgroup_cs(cgrp)->attach_in_progress--;
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_unlock(&cpuset_mutex);
|
2007-10-19 14:39:39 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach()
|
2013-01-08 00:51:07 +08:00
|
|
|
* but we can't allocate it dynamically there. Define it global and
|
|
|
|
* allocate from cpuset_init().
|
|
|
|
*/
|
|
|
|
static cpumask_var_t cpus_attach;
|
|
|
|
|
2012-01-31 13:47:36 +08:00
|
|
|
static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
|
2007-10-19 14:39:39 +08:00
|
|
|
{
|
2013-01-08 00:51:08 +08:00
|
|
|
/* static bufs protected by cpuset_mutex */
|
2013-01-08 00:51:07 +08:00
|
|
|
static nodemask_t cpuset_attach_nodemask_from;
|
|
|
|
static nodemask_t cpuset_attach_nodemask_to;
|
2007-10-19 14:39:39 +08:00
|
|
|
struct mm_struct *mm;
|
2011-12-13 10:12:21 +08:00
|
|
|
struct task_struct *task;
|
|
|
|
struct task_struct *leader = cgroup_taskset_first(tset);
|
2011-12-13 10:12:21 +08:00
|
|
|
struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
|
|
|
|
struct cpuset *cs = cgroup_cs(cgrp);
|
|
|
|
struct cpuset *oldcs = cgroup_cs(oldcgrp);
|
2006-06-23 17:04:00 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* prepare for attach */
|
|
|
|
if (cs == &top_cpuset)
|
|
|
|
cpumask_copy(cpus_attach, cpu_possible_mask);
|
|
|
|
else
|
|
|
|
guarantee_online_cpus(cs, cpus_attach);
|
|
|
|
|
|
|
|
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
|
|
|
|
|
2011-12-13 10:12:21 +08:00
|
|
|
cgroup_taskset_for_each(task, cgrp, tset) {
|
|
|
|
/*
|
|
|
|
* can_attach beforehand should guarantee that this doesn't
|
|
|
|
* fail. TODO: have a better way to handle failure here
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
|
|
|
|
|
|
|
|
cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
|
|
|
|
cpuset_update_task_spread_flag(cs, task);
|
|
|
|
}
|
2006-06-23 17:04:00 +08:00
|
|
|
|
2011-05-27 07:25:19 +08:00
|
|
|
/*
|
|
|
|
* Change mm, possibly for multiple threads in a threadgroup. This is
|
|
|
|
* expensive and may sleep.
|
|
|
|
*/
|
|
|
|
cpuset_attach_nodemask_from = oldcs->mems_allowed;
|
|
|
|
cpuset_attach_nodemask_to = cs->mems_allowed;
|
2011-12-13 10:12:21 +08:00
|
|
|
mm = get_task_mm(leader);
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
if (mm) {
|
2011-05-27 07:25:19 +08:00
|
|
|
mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
|
2006-03-31 18:30:51 +08:00
|
|
|
if (is_memory_migrate(cs))
|
2011-05-27 07:25:19 +08:00
|
|
|
cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
|
|
|
|
&cpuset_attach_nodemask_to);
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
mmput(mm);
|
|
|
|
}
|
2013-01-08 00:51:07 +08:00
|
|
|
|
|
|
|
cs->attach_in_progress--;
|
2013-01-08 00:51:08 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We may have raced with CPU/memory hotunplug. Trigger hotplug
|
|
|
|
* propagation if @cs doesn't have any CPU or memory. It will move
|
|
|
|
* the newly added tasks to the nearest parent which can execute.
|
|
|
|
*/
|
|
|
|
if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
|
|
|
|
schedule_cpuset_propagate_hotplug(cs);
|
2013-01-08 00:51:08 +08:00
|
|
|
|
|
|
|
mutex_unlock(&cpuset_mutex);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* The various types of files and directories in a cpuset file system */
|
|
|
|
|
|
|
|
typedef enum {
|
[PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:00:56 +08:00
|
|
|
FILE_MEMORY_MIGRATE,
|
2005-04-17 06:20:36 +08:00
|
|
|
FILE_CPULIST,
|
|
|
|
FILE_MEMLIST,
|
|
|
|
FILE_CPU_EXCLUSIVE,
|
|
|
|
FILE_MEM_EXCLUSIVE,
|
2008-04-29 16:00:26 +08:00
|
|
|
FILE_MEM_HARDWALL,
|
2007-10-19 14:40:20 +08:00
|
|
|
FILE_SCHED_LOAD_BALANCE,
|
2008-04-15 13:04:23 +08:00
|
|
|
FILE_SCHED_RELAX_DOMAIN_LEVEL,
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
FILE_MEMORY_PRESSURE_ENABLED,
|
|
|
|
FILE_MEMORY_PRESSURE,
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
FILE_SPREAD_PAGE,
|
|
|
|
FILE_SPREAD_SLAB,
|
2005-04-17 06:20:36 +08:00
|
|
|
} cpuset_filetype_t;
|
|
|
|
|
2008-04-29 16:00:00 +08:00
|
|
|
static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
|
|
|
|
{
|
|
|
|
struct cpuset *cs = cgroup_cs(cgrp);
|
|
|
|
cpuset_filetype_t type = cft->private;
|
2013-01-08 00:51:08 +08:00
|
|
|
int retval = -ENODEV;
|
2008-04-29 16:00:00 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
|
|
|
if (!is_cpuset_online(cs))
|
|
|
|
goto out_unlock;
|
2008-04-29 16:00:00 +08:00
|
|
|
|
|
|
|
switch (type) {
|
2005-04-17 06:20:36 +08:00
|
|
|
case FILE_CPU_EXCLUSIVE:
|
2008-04-29 16:00:00 +08:00
|
|
|
retval = update_flag(CS_CPU_EXCLUSIVE, cs, val);
|
2005-04-17 06:20:36 +08:00
|
|
|
break;
|
|
|
|
case FILE_MEM_EXCLUSIVE:
|
2008-04-29 16:00:00 +08:00
|
|
|
retval = update_flag(CS_MEM_EXCLUSIVE, cs, val);
|
2005-04-17 06:20:36 +08:00
|
|
|
break;
|
2008-04-29 16:00:26 +08:00
|
|
|
case FILE_MEM_HARDWALL:
|
|
|
|
retval = update_flag(CS_MEM_HARDWALL, cs, val);
|
|
|
|
break;
|
2007-10-19 14:40:20 +08:00
|
|
|
case FILE_SCHED_LOAD_BALANCE:
|
2008-04-29 16:00:00 +08:00
|
|
|
retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val);
|
2008-04-15 13:04:23 +08:00
|
|
|
break;
|
[PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:00:56 +08:00
|
|
|
case FILE_MEMORY_MIGRATE:
|
2008-04-29 16:00:00 +08:00
|
|
|
retval = update_flag(CS_MEMORY_MIGRATE, cs, val);
|
[PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:00:56 +08:00
|
|
|
break;
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
case FILE_MEMORY_PRESSURE_ENABLED:
|
2008-04-29 16:00:00 +08:00
|
|
|
cpuset_memory_pressure_enabled = !!val;
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
break;
|
|
|
|
case FILE_MEMORY_PRESSURE:
|
|
|
|
retval = -EACCES;
|
|
|
|
break;
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
case FILE_SPREAD_PAGE:
|
2008-04-29 16:00:00 +08:00
|
|
|
retval = update_flag(CS_SPREAD_PAGE, cs, val);
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
break;
|
|
|
|
case FILE_SPREAD_SLAB:
|
2008-04-29 16:00:00 +08:00
|
|
|
retval = update_flag(CS_SPREAD_SLAB, cs, val);
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
break;
|
2005-04-17 06:20:36 +08:00
|
|
|
default:
|
|
|
|
retval = -EINVAL;
|
2008-04-29 16:00:00 +08:00
|
|
|
break;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2013-01-08 00:51:08 +08:00
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&cpuset_mutex);
|
2005-04-17 06:20:36 +08:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2008-05-07 11:42:41 +08:00
|
|
|
static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
|
|
|
|
{
|
|
|
|
struct cpuset *cs = cgroup_cs(cgrp);
|
|
|
|
cpuset_filetype_t type = cft->private;
|
2013-01-08 00:51:08 +08:00
|
|
|
int retval = -ENODEV;
|
2008-05-07 11:42:41 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
|
|
|
if (!is_cpuset_online(cs))
|
|
|
|
goto out_unlock;
|
2008-07-25 16:47:02 +08:00
|
|
|
|
2008-05-07 11:42:41 +08:00
|
|
|
switch (type) {
|
|
|
|
case FILE_SCHED_RELAX_DOMAIN_LEVEL:
|
|
|
|
retval = update_relax_domain_level(cs, val);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
retval = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2013-01-08 00:51:08 +08:00
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&cpuset_mutex);
|
2008-05-07 11:42:41 +08:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2008-07-25 16:47:02 +08:00
|
|
|
/*
|
|
|
|
* Common handling for a write to a "cpus" or "mems" file.
|
|
|
|
*/
|
|
|
|
static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
|
|
|
|
const char *buf)
|
|
|
|
{
|
2009-01-08 10:08:43 +08:00
|
|
|
struct cpuset *cs = cgroup_cs(cgrp);
|
|
|
|
struct cpuset *trialcs;
|
2013-01-08 00:51:08 +08:00
|
|
|
int retval = -ENODEV;
|
2008-07-25 16:47:02 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
|
|
|
* CPU or memory hotunplug may leave @cs w/o any execution
|
|
|
|
* resources, in which case the hotplug code asynchronously updates
|
|
|
|
* configuration and transfers all tasks to the nearest ancestor
|
|
|
|
* which can execute.
|
|
|
|
*
|
|
|
|
* As writes to "cpus" or "mems" may restore @cs's execution
|
|
|
|
* resources, wait for the previously scheduled operations before
|
|
|
|
* proceeding, so that we don't end up keep removing tasks added
|
|
|
|
* after execution capability is restored.
|
2013-01-08 00:51:08 +08:00
|
|
|
*
|
|
|
|
* Flushing cpuset_hotplug_work is enough to synchronize against
|
|
|
|
* hotplug hanlding; however, cpuset_attach() may schedule
|
|
|
|
* propagation work directly. Flush the workqueue too.
|
2013-01-08 00:51:07 +08:00
|
|
|
*/
|
|
|
|
flush_work(&cpuset_hotplug_work);
|
2013-01-08 00:51:08 +08:00
|
|
|
flush_workqueue(cpuset_propagate_hotplug_wq);
|
2013-01-08 00:51:07 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
|
|
|
if (!is_cpuset_online(cs))
|
|
|
|
goto out_unlock;
|
2008-07-25 16:47:02 +08:00
|
|
|
|
2009-01-08 10:08:43 +08:00
|
|
|
trialcs = alloc_trial_cpuset(cs);
|
2011-03-05 09:36:21 +08:00
|
|
|
if (!trialcs) {
|
|
|
|
retval = -ENOMEM;
|
2013-01-08 00:51:08 +08:00
|
|
|
goto out_unlock;
|
2011-03-05 09:36:21 +08:00
|
|
|
}
|
2009-01-08 10:08:43 +08:00
|
|
|
|
2008-07-25 16:47:02 +08:00
|
|
|
switch (cft->private) {
|
|
|
|
case FILE_CPULIST:
|
2009-01-08 10:08:43 +08:00
|
|
|
retval = update_cpumask(cs, trialcs, buf);
|
2008-07-25 16:47:02 +08:00
|
|
|
break;
|
|
|
|
case FILE_MEMLIST:
|
2009-01-08 10:08:43 +08:00
|
|
|
retval = update_nodemask(cs, trialcs, buf);
|
2008-07-25 16:47:02 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
retval = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2009-01-08 10:08:43 +08:00
|
|
|
|
|
|
|
free_trial_cpuset(trialcs);
|
2013-01-08 00:51:08 +08:00
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&cpuset_mutex);
|
2008-07-25 16:47:02 +08:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* These ascii lists should be read in a single call, by using a user
|
|
|
|
* buffer large enough to hold the entire map. If read in smaller
|
|
|
|
* chunks, there is no guarantee of atomicity. Since the display format
|
|
|
|
* used, list of ranges of sequential numbers, is variable length,
|
|
|
|
* and since these maps can change value dynamically, one could read
|
|
|
|
* gibberish by doing partial reads while a list was changing.
|
|
|
|
* A single large read to a buffer that crosses a page boundary is
|
|
|
|
* ok, because the result being copied to user land is not recomputed
|
|
|
|
* across a page fault.
|
|
|
|
*/
|
|
|
|
|
2011-03-24 07:42:45 +08:00
|
|
|
static size_t cpuset_sprintf_cpulist(char *page, struct cpuset *cs)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2011-03-24 07:42:45 +08:00
|
|
|
size_t count;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
2011-03-24 07:42:45 +08:00
|
|
|
count = cpulist_scnprintf(page, PAGE_SIZE, cs->cpus_allowed);
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-03-24 07:42:45 +08:00
|
|
|
return count;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2011-03-24 07:42:45 +08:00
|
|
|
static size_t cpuset_sprintf_memlist(char *page, struct cpuset *cs)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2011-03-24 07:42:45 +08:00
|
|
|
size_t count;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
2011-03-24 07:42:45 +08:00
|
|
|
count = nodelist_scnprintf(page, PAGE_SIZE, cs->mems_allowed);
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-03-24 07:42:45 +08:00
|
|
|
return count;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2007-10-19 14:39:39 +08:00
|
|
|
static ssize_t cpuset_common_file_read(struct cgroup *cont,
|
|
|
|
struct cftype *cft,
|
|
|
|
struct file *file,
|
|
|
|
char __user *buf,
|
|
|
|
size_t nbytes, loff_t *ppos)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-10-19 14:39:39 +08:00
|
|
|
struct cpuset *cs = cgroup_cs(cont);
|
2005-04-17 06:20:36 +08:00
|
|
|
cpuset_filetype_t type = cft->private;
|
|
|
|
char *page;
|
|
|
|
ssize_t retval = 0;
|
|
|
|
char *s;
|
|
|
|
|
2007-10-16 16:25:52 +08:00
|
|
|
if (!(page = (char *)__get_free_page(GFP_TEMPORARY)))
|
2005-04-17 06:20:36 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
s = page;
|
|
|
|
|
|
|
|
switch (type) {
|
|
|
|
case FILE_CPULIST:
|
|
|
|
s += cpuset_sprintf_cpulist(s, cs);
|
|
|
|
break;
|
|
|
|
case FILE_MEMLIST:
|
|
|
|
s += cpuset_sprintf_memlist(s, cs);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
retval = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
*s++ = '\n';
|
|
|
|
|
2005-09-30 10:26:43 +08:00
|
|
|
retval = simple_read_from_buffer(buf, nbytes, ppos, page, s - page);
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
free_page((unsigned long)page);
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2008-04-29 16:00:00 +08:00
|
|
|
static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
|
|
|
|
{
|
|
|
|
struct cpuset *cs = cgroup_cs(cont);
|
|
|
|
cpuset_filetype_t type = cft->private;
|
|
|
|
switch (type) {
|
|
|
|
case FILE_CPU_EXCLUSIVE:
|
|
|
|
return is_cpu_exclusive(cs);
|
|
|
|
case FILE_MEM_EXCLUSIVE:
|
|
|
|
return is_mem_exclusive(cs);
|
2008-04-29 16:00:26 +08:00
|
|
|
case FILE_MEM_HARDWALL:
|
|
|
|
return is_mem_hardwall(cs);
|
2008-04-29 16:00:00 +08:00
|
|
|
case FILE_SCHED_LOAD_BALANCE:
|
|
|
|
return is_sched_load_balance(cs);
|
|
|
|
case FILE_MEMORY_MIGRATE:
|
|
|
|
return is_memory_migrate(cs);
|
|
|
|
case FILE_MEMORY_PRESSURE_ENABLED:
|
|
|
|
return cpuset_memory_pressure_enabled;
|
|
|
|
case FILE_MEMORY_PRESSURE:
|
|
|
|
return fmeter_getrate(&cs->fmeter);
|
|
|
|
case FILE_SPREAD_PAGE:
|
|
|
|
return is_spread_page(cs);
|
|
|
|
case FILE_SPREAD_SLAB:
|
|
|
|
return is_spread_slab(cs);
|
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
|
|
|
|
/* Unreachable but makes gcc happy */
|
|
|
|
return 0;
|
2008-04-29 16:00:00 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-05-07 11:42:41 +08:00
|
|
|
static s64 cpuset_read_s64(struct cgroup *cont, struct cftype *cft)
|
|
|
|
{
|
|
|
|
struct cpuset *cs = cgroup_cs(cont);
|
|
|
|
cpuset_filetype_t type = cft->private;
|
|
|
|
switch (type) {
|
|
|
|
case FILE_SCHED_RELAX_DOMAIN_LEVEL:
|
|
|
|
return cs->relax_domain_level;
|
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
|
|
|
|
/* Unrechable but makes gcc happy */
|
|
|
|
return 0;
|
2008-05-07 11:42:41 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* for the common functions, 'private' gives the type of file
|
|
|
|
*/
|
|
|
|
|
2008-04-29 16:00:26 +08:00
|
|
|
static struct cftype files[] = {
|
|
|
|
{
|
|
|
|
.name = "cpus",
|
|
|
|
.read = cpuset_common_file_read,
|
2008-07-25 16:47:02 +08:00
|
|
|
.write_string = cpuset_write_resmask,
|
|
|
|
.max_write_len = (100U + 6 * NR_CPUS),
|
2008-04-29 16:00:26 +08:00
|
|
|
.private = FILE_CPULIST,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "mems",
|
|
|
|
.read = cpuset_common_file_read,
|
2008-07-25 16:47:02 +08:00
|
|
|
.write_string = cpuset_write_resmask,
|
|
|
|
.max_write_len = (100U + 6 * MAX_NUMNODES),
|
2008-04-29 16:00:26 +08:00
|
|
|
.private = FILE_MEMLIST,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "cpu_exclusive",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_CPU_EXCLUSIVE,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "mem_exclusive",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_MEM_EXCLUSIVE,
|
|
|
|
},
|
|
|
|
|
2008-04-29 16:00:26 +08:00
|
|
|
{
|
|
|
|
.name = "mem_hardwall",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_MEM_HARDWALL,
|
|
|
|
},
|
|
|
|
|
2008-04-29 16:00:26 +08:00
|
|
|
{
|
|
|
|
.name = "sched_load_balance",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_SCHED_LOAD_BALANCE,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "sched_relax_domain_level",
|
2008-05-07 11:42:41 +08:00
|
|
|
.read_s64 = cpuset_read_s64,
|
|
|
|
.write_s64 = cpuset_write_s64,
|
2008-04-29 16:00:26 +08:00
|
|
|
.private = FILE_SCHED_RELAX_DOMAIN_LEVEL,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "memory_migrate",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_MEMORY_MIGRATE,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "memory_pressure",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_MEMORY_PRESSURE,
|
2009-04-03 07:57:29 +08:00
|
|
|
.mode = S_IRUGO,
|
2008-04-29 16:00:26 +08:00
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "memory_spread_page",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_SPREAD_PAGE,
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
.name = "memory_spread_slab",
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_SPREAD_SLAB,
|
|
|
|
},
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
|
2012-04-02 03:09:55 +08:00
|
|
|
{
|
|
|
|
.name = "memory_pressure_enabled",
|
|
|
|
.flags = CFTYPE_ONLY_ON_ROOT,
|
|
|
|
.read_u64 = cpuset_read_u64,
|
|
|
|
.write_u64 = cpuset_write_u64,
|
|
|
|
.private = FILE_MEMORY_PRESSURE_ENABLED,
|
|
|
|
},
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-04-02 03:09:55 +08:00
|
|
|
{ } /* terminate */
|
|
|
|
};
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
2012-11-20 00:13:38 +08:00
|
|
|
* cpuset_css_alloc - allocate a cpuset css
|
2008-02-07 16:14:45 +08:00
|
|
|
* cont: control group that the new cpuset will be part of
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
2012-11-20 00:13:38 +08:00
|
|
|
static struct cgroup_subsys_state *cpuset_css_alloc(struct cgroup *cont)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2013-01-08 00:51:07 +08:00
|
|
|
struct cpuset *cs;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
if (!cont->parent)
|
2007-10-19 14:39:39 +08:00
|
|
|
return &top_cpuset.css;
|
2012-11-20 00:13:39 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
cs = kzalloc(sizeof(*cs), GFP_KERNEL);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!cs)
|
2007-10-19 14:39:39 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2009-01-08 10:08:44 +08:00
|
|
|
if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) {
|
|
|
|
kfree(cs);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2007-10-19 14:40:20 +08:00
|
|
|
set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_clear(cs->cpus_allowed);
|
2008-04-05 09:11:07 +08:00
|
|
|
nodes_clear(cs->mems_allowed);
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
fmeter_init(&cs->fmeter);
|
2013-01-08 00:51:07 +08:00
|
|
|
INIT_WORK(&cs->hotplug_work, cpuset_propagate_hotplug_workfn);
|
2008-04-15 13:04:23 +08:00
|
|
|
cs->relax_domain_level = -1;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
return &cs->css;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cpuset_css_online(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cpuset *cs = cgroup_cs(cgrp);
|
2013-01-08 00:51:08 +08:00
|
|
|
struct cpuset *parent = parent_cs(cs);
|
2013-01-08 00:51:07 +08:00
|
|
|
struct cpuset *tmp_cs;
|
|
|
|
struct cgroup *pos_cg;
|
2013-01-08 00:51:07 +08:00
|
|
|
|
|
|
|
if (!parent)
|
|
|
|
return 0;
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
set_bit(CS_ONLINE, &cs->flags);
|
2013-01-08 00:51:07 +08:00
|
|
|
if (is_spread_page(parent))
|
|
|
|
set_bit(CS_SPREAD_PAGE, &cs->flags);
|
|
|
|
if (is_spread_slab(parent))
|
|
|
|
set_bit(CS_SPREAD_SLAB, &cs->flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-01-08 17:01:57 +08:00
|
|
|
number_of_cpusets++;
|
2012-11-20 00:13:39 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &cgrp->flags))
|
2013-01-08 00:51:08 +08:00
|
|
|
goto out_unlock;
|
2012-11-20 00:13:39 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is
|
|
|
|
* set. This flag handling is implemented in cgroup core for
|
|
|
|
* histrical reasons - the flag may be specified during mount.
|
|
|
|
*
|
|
|
|
* Currently, if any sibling cpusets have exclusive cpus or mem, we
|
|
|
|
* refuse to clone the configuration - thereby refusing the task to
|
|
|
|
* be entered, and as a result refusing the sys_unshare() or
|
|
|
|
* clone() which initiated it. If this becomes a problem for some
|
|
|
|
* users who wish to allow that scenario, then this could be
|
|
|
|
* changed to grant parent->cpus_allowed-sibling_cpus_exclusive
|
|
|
|
* (and likewise for mems) to the new cgroup.
|
|
|
|
*/
|
2013-01-08 00:51:07 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
cpuset_for_each_child(tmp_cs, pos_cg, parent) {
|
|
|
|
if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) {
|
|
|
|
rcu_read_unlock();
|
2013-01-08 00:51:08 +08:00
|
|
|
goto out_unlock;
|
2013-01-08 00:51:07 +08:00
|
|
|
}
|
2012-11-20 00:13:39 +08:00
|
|
|
}
|
2013-01-08 00:51:07 +08:00
|
|
|
rcu_read_unlock();
|
2012-11-20 00:13:39 +08:00
|
|
|
|
|
|
|
mutex_lock(&callback_mutex);
|
|
|
|
cs->mems_allowed = parent->mems_allowed;
|
|
|
|
cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
|
|
|
|
mutex_unlock(&callback_mutex);
|
2013-01-08 00:51:08 +08:00
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&cpuset_mutex);
|
2013-01-08 00:51:07 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void cpuset_css_offline(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cpuset *cs = cgroup_cs(cgrp);
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
2013-01-08 00:51:07 +08:00
|
|
|
|
|
|
|
if (is_sched_load_balance(cs))
|
|
|
|
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
|
|
|
|
|
|
|
|
number_of_cpusets--;
|
2013-01-08 00:51:07 +08:00
|
|
|
clear_bit(CS_ONLINE, &cs->flags);
|
2013-01-08 00:51:07 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_unlock(&cpuset_mutex);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2007-10-19 14:40:20 +08:00
|
|
|
/*
|
|
|
|
* If the cpuset being removed has its flag 'sched_load_balance'
|
|
|
|
* enabled, then simulate turning sched_load_balance off, which
|
2013-01-08 00:51:07 +08:00
|
|
|
* will call rebuild_sched_domains_locked().
|
2007-10-19 14:40:20 +08:00
|
|
|
*/
|
|
|
|
|
2012-11-20 00:13:38 +08:00
|
|
|
static void cpuset_css_free(struct cgroup *cont)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-10-19 14:39:39 +08:00
|
|
|
struct cpuset *cs = cgroup_cs(cont);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-01-08 10:08:44 +08:00
|
|
|
free_cpumask_var(cs->cpus_allowed);
|
2007-10-19 14:39:39 +08:00
|
|
|
kfree(cs);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2007-10-19 14:39:39 +08:00
|
|
|
struct cgroup_subsys cpuset_subsys = {
|
|
|
|
.name = "cpuset",
|
2012-11-20 00:13:38 +08:00
|
|
|
.css_alloc = cpuset_css_alloc,
|
2013-01-08 00:51:07 +08:00
|
|
|
.css_online = cpuset_css_online,
|
|
|
|
.css_offline = cpuset_css_offline,
|
2012-11-20 00:13:38 +08:00
|
|
|
.css_free = cpuset_css_free,
|
2007-10-19 14:39:39 +08:00
|
|
|
.can_attach = cpuset_can_attach,
|
2013-01-08 00:51:07 +08:00
|
|
|
.cancel_attach = cpuset_cancel_attach,
|
2007-10-19 14:39:39 +08:00
|
|
|
.attach = cpuset_attach,
|
|
|
|
.subsys_id = cpuset_subsys_id,
|
2012-04-02 03:09:55 +08:00
|
|
|
.base_cftypes = files,
|
2007-10-19 14:39:39 +08:00
|
|
|
.early_init = 1,
|
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
|
|
|
* cpuset_init - initialize cpusets at system boot
|
|
|
|
*
|
|
|
|
* Description: Initialize top_cpuset and the cpuset internal file system,
|
|
|
|
**/
|
|
|
|
|
|
|
|
int __init cpuset_init(void)
|
|
|
|
{
|
2007-10-19 14:39:39 +08:00
|
|
|
int err = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
if (!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL))
|
|
|
|
BUG();
|
|
|
|
|
2009-01-08 10:08:44 +08:00
|
|
|
cpumask_setall(top_cpuset.cpus_allowed);
|
2008-04-05 09:11:07 +08:00
|
|
|
nodes_setall(top_cpuset.mems_allowed);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
fmeter_init(&top_cpuset.fmeter);
|
2007-10-19 14:40:20 +08:00
|
|
|
set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
|
2008-04-15 13:04:23 +08:00
|
|
|
top_cpuset.relax_domain_level = -1;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
err = register_filesystem(&cpuset_fs_type);
|
|
|
|
if (err < 0)
|
2007-10-19 14:39:39 +08:00
|
|
|
return err;
|
|
|
|
|
2009-01-08 10:08:42 +08:00
|
|
|
if (!alloc_cpumask_var(&cpus_attach, GFP_KERNEL))
|
|
|
|
BUG();
|
|
|
|
|
2006-01-08 17:01:57 +08:00
|
|
|
number_of_cpusets = 1;
|
2007-10-19 14:39:39 +08:00
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-02-07 16:14:43 +08:00
|
|
|
/**
|
|
|
|
* cpuset_do_move_task - move a given task to another cpuset
|
|
|
|
* @tsk: pointer to task_struct the task to move
|
|
|
|
* @scan: struct cgroup_scanner contained in its struct cpuset_hotplug_scanner
|
|
|
|
*
|
|
|
|
* Called by cgroup_scan_tasks() for each task in a cgroup.
|
|
|
|
* Return nonzero to stop the walk through the tasks.
|
|
|
|
*/
|
2008-04-29 16:00:25 +08:00
|
|
|
static void cpuset_do_move_task(struct task_struct *tsk,
|
|
|
|
struct cgroup_scanner *scan)
|
2008-02-07 16:14:43 +08:00
|
|
|
{
|
2009-04-03 07:57:53 +08:00
|
|
|
struct cgroup *new_cgroup = scan->data;
|
2008-02-07 16:14:43 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
cgroup_lock();
|
2009-04-03 07:57:53 +08:00
|
|
|
cgroup_attach_task(new_cgroup, tsk);
|
2013-01-08 00:51:08 +08:00
|
|
|
cgroup_unlock();
|
2008-02-07 16:14:43 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* move_member_tasks_to_cpuset - move tasks from one cpuset to another
|
|
|
|
* @from: cpuset in which the tasks currently reside
|
|
|
|
* @to: cpuset to which the tasks will be moved
|
|
|
|
*
|
2013-01-08 00:51:08 +08:00
|
|
|
* Called with cpuset_mutex held
|
hotplug cpu: move tasks in empty cpusets to parent various other fixes
Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch
I had had "iff", meaning "if and only if" in a comment. However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
changed it to "if", presumably figuring that the "iff" was a typo. However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.
The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry. The opposite is true.
Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().
A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.
Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.
Removed extra parentheses and unnecessary return statement.
Renamed attach_task() to cpuset_attach() in various comments.
Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Cliff Wickman <cpw@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 16:14:46 +08:00
|
|
|
* callback_mutex must not be held, as cpuset_attach() will take it.
|
2008-02-07 16:14:43 +08:00
|
|
|
*
|
|
|
|
* The cgroup_scan_tasks() function will scan all the tasks in a cgroup,
|
|
|
|
* calling callback functions for each.
|
|
|
|
*/
|
|
|
|
static void move_member_tasks_to_cpuset(struct cpuset *from, struct cpuset *to)
|
|
|
|
{
|
2009-04-03 07:57:53 +08:00
|
|
|
struct cgroup_scanner scan;
|
2008-02-07 16:14:43 +08:00
|
|
|
|
2009-04-03 07:57:53 +08:00
|
|
|
scan.cg = from->css.cgroup;
|
|
|
|
scan.test_task = NULL; /* select all tasks in cgroup */
|
|
|
|
scan.process_task = cpuset_do_move_task;
|
|
|
|
scan.heap = NULL;
|
|
|
|
scan.data = to->css.cgroup;
|
2008-02-07 16:14:43 +08:00
|
|
|
|
2009-04-03 07:57:53 +08:00
|
|
|
if (cgroup_scan_tasks(&scan))
|
2008-02-07 16:14:43 +08:00
|
|
|
printk(KERN_ERR "move_member_tasks_to_cpuset: "
|
|
|
|
"cgroup_scan_tasks failed\n");
|
|
|
|
}
|
|
|
|
|
2006-09-29 17:01:17 +08:00
|
|
|
/*
|
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12 05:33:53 +08:00
|
|
|
* If CPU and/or memory hotplug handlers, below, unplug any CPUs
|
2006-09-29 17:01:17 +08:00
|
|
|
* or memory nodes, we need to walk over the cpuset hierarchy,
|
|
|
|
* removing that CPU or node from all cpusets. If this removes the
|
2008-02-07 16:14:43 +08:00
|
|
|
* last CPU or node from a cpuset, then move the tasks in the empty
|
|
|
|
* cpuset to its next-highest non-empty parent.
|
2006-09-29 17:01:17 +08:00
|
|
|
*/
|
2008-02-07 16:14:43 +08:00
|
|
|
static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
|
|
|
|
{
|
|
|
|
struct cpuset *parent;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find its next-highest non-empty parent, (top cpuset
|
|
|
|
* has online cpus, so can't be empty).
|
|
|
|
*/
|
2013-01-08 00:51:08 +08:00
|
|
|
parent = parent_cs(cs);
|
2009-01-08 10:08:44 +08:00
|
|
|
while (cpumask_empty(parent->cpus_allowed) ||
|
2008-02-07 16:14:47 +08:00
|
|
|
nodes_empty(parent->mems_allowed))
|
2013-01-08 00:51:08 +08:00
|
|
|
parent = parent_cs(parent);
|
2008-02-07 16:14:43 +08:00
|
|
|
|
|
|
|
move_member_tasks_to_cpuset(cs, parent);
|
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/**
|
2013-01-08 00:51:07 +08:00
|
|
|
* cpuset_propagate_hotplug_workfn - propagate CPU/memory hotplug to a cpuset
|
2013-01-08 00:51:07 +08:00
|
|
|
* @cs: cpuset in interest
|
2008-02-07 16:14:43 +08:00
|
|
|
*
|
2013-01-08 00:51:07 +08:00
|
|
|
* Compare @cs's cpu and mem masks against top_cpuset and if some have gone
|
|
|
|
* offline, update @cs accordingly. If @cs ends up with no CPU or memory,
|
|
|
|
* all its tasks are moved to the nearest ancestor with both resources.
|
2012-05-24 22:16:41 +08:00
|
|
|
*/
|
2013-01-08 00:51:07 +08:00
|
|
|
static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
|
2012-05-24 22:16:41 +08:00
|
|
|
{
|
2013-01-08 00:51:07 +08:00
|
|
|
static cpumask_t off_cpus;
|
|
|
|
static nodemask_t off_mems, tmp_mems;
|
2013-01-08 00:51:07 +08:00
|
|
|
struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
|
2013-01-08 00:51:08 +08:00
|
|
|
bool is_empty;
|
2012-05-24 22:16:41 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
2012-05-24 22:16:55 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
|
|
|
|
nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
|
2012-05-24 22:16:41 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* remove offline cpus from @cs */
|
|
|
|
if (!cpumask_empty(&off_cpus)) {
|
|
|
|
mutex_lock(&callback_mutex);
|
|
|
|
cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
|
|
|
|
mutex_unlock(&callback_mutex);
|
|
|
|
update_tasks_cpumask(cs, NULL);
|
2012-05-24 22:16:41 +08:00
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* remove offline mems from @cs */
|
|
|
|
if (!nodes_empty(off_mems)) {
|
|
|
|
tmp_mems = cs->mems_allowed;
|
|
|
|
mutex_lock(&callback_mutex);
|
|
|
|
nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
|
|
|
|
mutex_unlock(&callback_mutex);
|
|
|
|
update_tasks_nodemask(cs, &tmp_mems, NULL);
|
2006-09-29 17:01:17 +08:00
|
|
|
}
|
2013-01-08 00:51:07 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
is_empty = cpumask_empty(cs->cpus_allowed) ||
|
|
|
|
nodes_empty(cs->mems_allowed);
|
2013-01-08 00:51:07 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_unlock(&cpuset_mutex);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If @cs became empty, move tasks to the nearest ancestor with
|
|
|
|
* execution resources. This is full cgroup operation which will
|
|
|
|
* also call back into cpuset. Should be done outside any lock.
|
|
|
|
*/
|
|
|
|
if (is_empty)
|
|
|
|
remove_tasks_in_empty_cpuset(cs);
|
2013-01-08 00:51:07 +08:00
|
|
|
|
|
|
|
/* the following may free @cs, should be the last operation */
|
|
|
|
css_put(&cs->css);
|
2012-05-24 22:16:41 +08:00
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/**
|
|
|
|
* schedule_cpuset_propagate_hotplug - schedule hotplug propagation to a cpuset
|
|
|
|
* @cs: cpuset of interest
|
|
|
|
*
|
|
|
|
* Schedule cpuset_propagate_hotplug_workfn() which will update CPU and
|
|
|
|
* memory masks according to top_cpuset.
|
|
|
|
*/
|
|
|
|
static void schedule_cpuset_propagate_hotplug(struct cpuset *cs)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Pin @cs. The refcnt will be released when the work item
|
|
|
|
* finishes executing.
|
|
|
|
*/
|
|
|
|
if (!css_tryget(&cs->css))
|
|
|
|
return;
|
2012-05-24 22:16:41 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
|
|
|
* Queue @cs->hotplug_work. If already pending, lose the css ref.
|
|
|
|
* cpuset_propagate_hotplug_wq is ordered and propagation will
|
|
|
|
* happen in the order this function is called.
|
|
|
|
*/
|
|
|
|
if (!queue_work(cpuset_propagate_hotplug_wq, &cs->hotplug_work))
|
|
|
|
css_put(&cs->css);
|
2006-09-29 17:01:17 +08:00
|
|
|
}
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/**
|
2013-01-08 00:51:07 +08:00
|
|
|
* cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
|
2008-02-07 16:14:43 +08:00
|
|
|
*
|
2013-01-08 00:51:07 +08:00
|
|
|
* This function is called after either CPU or memory configuration has
|
|
|
|
* changed and updates cpuset accordingly. The top_cpuset is always
|
|
|
|
* synchronized to cpu_active_mask and N_MEMORY, which is necessary in
|
|
|
|
* order to make cpusets transparent (of no affect) on systems that are
|
|
|
|
* actively using CPU hotplug but making no active use of cpusets.
|
2008-02-07 16:14:43 +08:00
|
|
|
*
|
2013-01-08 00:51:07 +08:00
|
|
|
* Non-root cpusets are only affected by offlining. If any CPUs or memory
|
|
|
|
* nodes have been taken down, cpuset_propagate_hotplug() is invoked on all
|
|
|
|
* descendants.
|
2008-02-07 16:14:43 +08:00
|
|
|
*
|
2013-01-08 00:51:07 +08:00
|
|
|
* Note that CPU offlining during suspend is ignored. We don't modify
|
|
|
|
* cpusets across suspend/resume cycles at all.
|
2008-02-07 16:14:43 +08:00
|
|
|
*/
|
2013-01-08 00:51:07 +08:00
|
|
|
static void cpuset_hotplug_workfn(struct work_struct *work)
|
2006-09-29 17:01:17 +08:00
|
|
|
{
|
2013-01-08 00:51:07 +08:00
|
|
|
static cpumask_t new_cpus, tmp_cpus;
|
|
|
|
static nodemask_t new_mems, tmp_mems;
|
|
|
|
bool cpus_updated, mems_updated;
|
|
|
|
bool cpus_offlined, mems_offlined;
|
2006-09-29 17:01:17 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
2008-02-07 16:14:43 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* fetch the available cpus/mems and find out which changed how */
|
|
|
|
cpumask_copy(&new_cpus, cpu_active_mask);
|
|
|
|
new_mems = node_states[N_MEMORY];
|
2012-05-24 22:16:55 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
cpus_updated = !cpumask_equal(top_cpuset.cpus_allowed, &new_cpus);
|
|
|
|
cpus_offlined = cpumask_andnot(&tmp_cpus, top_cpuset.cpus_allowed,
|
|
|
|
&new_cpus);
|
2012-05-24 22:16:55 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
mems_updated = !nodes_equal(top_cpuset.mems_allowed, new_mems);
|
|
|
|
nodes_andnot(tmp_mems, top_cpuset.mems_allowed, new_mems);
|
|
|
|
mems_offlined = !nodes_empty(tmp_mems);
|
2012-05-24 22:16:55 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* synchronize cpus_allowed to cpu_active_mask */
|
|
|
|
if (cpus_updated) {
|
|
|
|
mutex_lock(&callback_mutex);
|
|
|
|
cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
|
|
|
|
mutex_unlock(&callback_mutex);
|
|
|
|
/* we don't mess with cpumasks of tasks in top_cpuset */
|
|
|
|
}
|
2008-02-07 16:14:47 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* synchronize mems_allowed to N_MEMORY */
|
|
|
|
if (mems_updated) {
|
|
|
|
tmp_mems = top_cpuset.mems_allowed;
|
|
|
|
mutex_lock(&callback_mutex);
|
|
|
|
top_cpuset.mems_allowed = new_mems;
|
|
|
|
mutex_unlock(&callback_mutex);
|
|
|
|
update_tasks_nodemask(&top_cpuset, &tmp_mems, NULL);
|
|
|
|
}
|
2008-02-07 16:14:47 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* if cpus or mems went down, we need to propagate to descendants */
|
|
|
|
if (cpus_offlined || mems_offlined) {
|
|
|
|
struct cpuset *cs;
|
2013-01-08 00:51:08 +08:00
|
|
|
struct cgroup *pos_cgrp;
|
2008-07-25 16:47:22 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
cpuset_for_each_descendant_pre(cs, pos_cgrp, &top_cpuset)
|
|
|
|
schedule_cpuset_propagate_hotplug(cs);
|
|
|
|
rcu_read_unlock();
|
2013-01-08 00:51:07 +08:00
|
|
|
}
|
2012-05-24 22:16:55 +08:00
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_unlock(&cpuset_mutex);
|
2008-02-07 16:14:47 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* wait for propagations to finish */
|
|
|
|
flush_workqueue(cpuset_propagate_hotplug_wq);
|
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
/* rebuild sched domains if cpus_allowed has changed */
|
|
|
|
if (cpus_updated) {
|
|
|
|
struct sched_domain_attr *attr;
|
|
|
|
cpumask_var_t *doms;
|
|
|
|
int ndoms;
|
|
|
|
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_lock(&cpuset_mutex);
|
2013-01-08 00:51:07 +08:00
|
|
|
ndoms = generate_sched_domains(&doms, &attr);
|
2013-01-08 00:51:08 +08:00
|
|
|
mutex_unlock(&cpuset_mutex);
|
2013-01-08 00:51:07 +08:00
|
|
|
|
|
|
|
partition_sched_domains(ndoms, doms, attr);
|
2006-09-29 17:01:17 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-05-24 22:16:55 +08:00
|
|
|
void cpuset_update_active_cpus(bool cpu_online)
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map
Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier. Make
this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset. This is a surprising regression over earlier kernels
that didn't have cpusets enabled.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.
Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 16:23:51 +08:00
|
|
|
{
|
2013-01-08 00:51:07 +08:00
|
|
|
/*
|
|
|
|
* We're inside cpu hotplug critical region which usually nests
|
|
|
|
* inside cgroup synchronization. Bounce actual hotplug processing
|
|
|
|
* to a work item to avoid reverse locking order.
|
|
|
|
*
|
|
|
|
* We still need to do partition_sched_domains() synchronously;
|
|
|
|
* otherwise, the scheduler will get confused and put tasks to the
|
|
|
|
* dead CPU. Fall back to the default single domain.
|
|
|
|
* cpuset_hotplug_workfn() will rebuild it as necessary.
|
|
|
|
*/
|
|
|
|
partition_sched_domains(1, NULL, NULL);
|
|
|
|
schedule_work(&cpuset_hotplug_work);
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map
Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier. Make
this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset. This is a surprising regression over earlier kernels
that didn't have cpusets enabled.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.
Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 16:23:51 +08:00
|
|
|
}
|
|
|
|
|
2006-09-29 17:01:17 +08:00
|
|
|
#ifdef CONFIG_MEMORY_HOTPLUG
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code. Make this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset. This is a surprising regression
over earlier kernels that didn't have cpusets enabled.
One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.
This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets. The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed. With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes. Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.
[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 17:01:16 +08:00
|
|
|
/*
|
2012-12-13 05:51:24 +08:00
|
|
|
* Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
|
|
|
|
* Call this routine anytime after node_states[N_MEMORY] changes.
|
2012-05-24 22:17:03 +08:00
|
|
|
* See cpuset_update_active_cpus() for CPU hotplug handling.
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code. Make this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset. This is a surprising regression
over earlier kernels that didn't have cpusets enabled.
One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.
This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets. The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed. With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes. Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.
[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 17:01:16 +08:00
|
|
|
*/
|
2008-11-20 07:36:30 +08:00
|
|
|
static int cpuset_track_online_nodes(struct notifier_block *self,
|
|
|
|
unsigned long action, void *arg)
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code. Make this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset. This is a surprising regression
over earlier kernels that didn't have cpusets enabled.
One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.
This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets. The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed. With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes. Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.
[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 17:01:16 +08:00
|
|
|
{
|
2013-01-08 00:51:07 +08:00
|
|
|
schedule_work(&cpuset_hotplug_work);
|
2008-11-20 07:36:30 +08:00
|
|
|
return NOTIFY_OK;
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code. Make this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset. This is a surprising regression
over earlier kernels that didn't have cpusets enabled.
One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.
This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets. The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed. With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes. Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.
[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 17:01:16 +08:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
|
|
|
* cpuset_init_smp - initialize cpus_allowed
|
|
|
|
*
|
|
|
|
* Description: Finish top cpuset after cpu, node maps are initialized
|
|
|
|
**/
|
|
|
|
|
|
|
|
void __init cpuset_init_smp(void)
|
|
|
|
{
|
2009-11-25 20:31:39 +08:00
|
|
|
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
|
2012-12-13 05:51:24 +08:00
|
|
|
top_cpuset.mems_allowed = node_states[N_MEMORY];
|
[PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map
Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier. Make
this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset. This is a surprising regression over earlier kernels
that didn't have cpusets enabled.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.
Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 16:23:51 +08:00
|
|
|
|
2008-11-20 07:36:30 +08:00
|
|
|
hotplug_memory_notifier(cpuset_track_online_nodes, 10);
|
2009-01-16 10:24:10 +08:00
|
|
|
|
2013-01-08 00:51:07 +08:00
|
|
|
cpuset_propagate_hotplug_wq =
|
|
|
|
alloc_ordered_workqueue("cpuset_hotplug", 0);
|
|
|
|
BUG_ON(!cpuset_propagate_hotplug_wq);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
|
|
|
|
* @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
|
2009-01-08 10:08:45 +08:00
|
|
|
* @pmask: pointer to struct cpumask variable to receive cpus_allowed set.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2009-01-08 10:08:44 +08:00
|
|
|
* Description: Returns the cpumask_var_t cpus_allowed of the cpuset
|
2005-04-17 06:20:36 +08:00
|
|
|
* attached to the specified @tsk. Guaranteed to return some non-empty
|
2012-03-29 13:08:31 +08:00
|
|
|
* subset of cpu_online_mask, even if this means going outside the
|
2005-04-17 06:20:36 +08:00
|
|
|
* tasks cpuset.
|
|
|
|
**/
|
|
|
|
|
2009-01-08 10:08:45 +08:00
|
|
|
void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
2006-01-08 17:01:55 +08:00
|
|
|
task_lock(tsk);
|
2008-04-05 09:11:07 +08:00
|
|
|
guarantee_online_cpus(task_cs(tsk), pmask);
|
2006-01-08 17:01:55 +08:00
|
|
|
task_unlock(tsk);
|
2010-03-15 17:10:03 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 22:57:01 +08:00
|
|
|
void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
|
2010-03-15 17:10:27 +08:00
|
|
|
{
|
|
|
|
const struct cpuset *cs;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
cs = task_cs(tsk);
|
|
|
|
if (cs)
|
2011-05-19 14:08:58 +08:00
|
|
|
do_set_cpus_allowed(tsk, cs->cpus_allowed);
|
2010-03-15 17:10:27 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We own tsk->cpus_allowed, nobody can change it under us.
|
|
|
|
*
|
|
|
|
* But we used cs && cs->cpus_allowed lockless and thus can
|
|
|
|
* race with cgroup_attach_task() or update_cpumask() and get
|
|
|
|
* the wrong tsk->cpus_allowed. However, both cases imply the
|
|
|
|
* subsequent cpuset_change_cpumask()->set_cpus_allowed_ptr()
|
|
|
|
* which takes task_rq_lock().
|
|
|
|
*
|
|
|
|
* If we are called after it dropped the lock we must see all
|
|
|
|
* changes in tsk_cs()->cpus_allowed. Otherwise we can temporary
|
|
|
|
* set any mask even if it is not right from task_cs() pov,
|
|
|
|
* the pending set_cpus_allowed_ptr() will fix things.
|
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20 22:57:01 +08:00
|
|
|
*
|
|
|
|
* select_fallback_rq() will fix things ups and set cpu_possible_mask
|
|
|
|
* if required.
|
2010-03-15 17:10:27 +08:00
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
void cpuset_init_current_mems_allowed(void)
|
|
|
|
{
|
2008-04-05 09:11:07 +08:00
|
|
|
nodes_setall(current->mems_allowed);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2006-01-08 17:01:55 +08:00
|
|
|
/**
|
|
|
|
* cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset.
|
|
|
|
* @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed.
|
|
|
|
*
|
|
|
|
* Description: Returns the nodemask_t mems_allowed of the cpuset
|
|
|
|
* attached to the specified @tsk. Guaranteed to return some non-empty
|
2012-12-13 05:51:24 +08:00
|
|
|
* subset of node_states[N_MEMORY], even if this means going outside the
|
2006-01-08 17:01:55 +08:00
|
|
|
* tasks cpuset.
|
|
|
|
**/
|
|
|
|
|
|
|
|
nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
nodemask_t mask;
|
|
|
|
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
2006-01-08 17:01:55 +08:00
|
|
|
task_lock(tsk);
|
2007-10-19 14:39:39 +08:00
|
|
|
guarantee_online_mems(task_cs(tsk), &mask);
|
2006-01-08 17:01:55 +08:00
|
|
|
task_unlock(tsk);
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
2006-01-08 17:01:55 +08:00
|
|
|
|
|
|
|
return mask;
|
|
|
|
}
|
|
|
|
|
2005-07-28 02:45:11 +08:00
|
|
|
/**
|
2008-04-28 17:12:18 +08:00
|
|
|
* cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
|
|
|
|
* @nodemask: the nodemask to be checked
|
2005-07-28 02:45:11 +08:00
|
|
|
*
|
2008-04-28 17:12:18 +08:00
|
|
|
* Are any of the nodes in the nodemask allowed in current->mems_allowed?
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2008-04-28 17:12:18 +08:00
|
|
|
int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-04-28 17:12:18 +08:00
|
|
|
return nodes_intersects(*nodemask, current->mems_allowed);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
/*
|
2008-04-29 16:00:26 +08:00
|
|
|
* nearest_hardwall_ancestor() - Returns the nearest mem_exclusive or
|
|
|
|
* mem_hardwall ancestor to the specified cpuset. Call holding
|
|
|
|
* callback_mutex. If no ancestor is mem_exclusive or mem_hardwall
|
|
|
|
* (an unusual configuration), then returns the root cpuset.
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
*/
|
2008-04-29 16:00:26 +08:00
|
|
|
static const struct cpuset *nearest_hardwall_ancestor(const struct cpuset *cs)
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
{
|
2013-01-08 00:51:08 +08:00
|
|
|
while (!(is_mem_exclusive(cs) || is_mem_hardwall(cs)) && parent_cs(cs))
|
|
|
|
cs = parent_cs(cs);
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
return cs;
|
|
|
|
}
|
|
|
|
|
2005-07-28 02:45:11 +08:00
|
|
|
/**
|
2009-04-03 07:57:54 +08:00
|
|
|
* cpuset_node_allowed_softwall - Can we allocate on a memory node?
|
|
|
|
* @node: is this an allowed node?
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
* @gfp_mask: memory allocation flags
|
2005-07-28 02:45:11 +08:00
|
|
|
*
|
2009-04-03 07:57:54 +08:00
|
|
|
* If we're in interrupt, yes, we can always allocate. If __GFP_THISNODE is
|
|
|
|
* set, yes, we can always allocate. If node is in our task's mems_allowed,
|
|
|
|
* yes. If it's not a __GFP_HARDWALL request and this node is in the nearest
|
|
|
|
* hardwalled cpuset ancestor to this task's cpuset, yes. If the task has been
|
|
|
|
* OOM killed and has access to memory reserves as specified by the TIF_MEMDIE
|
|
|
|
* flag, yes.
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
* Otherwise, no.
|
|
|
|
*
|
2009-04-03 07:57:54 +08:00
|
|
|
* If __GFP_HARDWALL is set, cpuset_node_allowed_softwall() reduces to
|
|
|
|
* cpuset_node_allowed_hardwall(). Otherwise, cpuset_node_allowed_softwall()
|
|
|
|
* might sleep, and might allow a node from an enclosing cpuset.
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
*
|
2009-04-03 07:57:54 +08:00
|
|
|
* cpuset_node_allowed_hardwall() only handles the simpler case of hardwall
|
|
|
|
* cpusets, and never sleeps.
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
*
|
|
|
|
* The __GFP_THISNODE placement logic is really handled elsewhere,
|
|
|
|
* by forcibly using a zonelist starting at a specified node, and by
|
|
|
|
* (in get_page_from_freelist()) refusing to consider the zones for
|
|
|
|
* any node on the zonelist except the first. By the time any such
|
|
|
|
* calls get to this routine, we should just shut up and say 'yes'.
|
|
|
|
*
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
* GFP_USER allocations are marked with the __GFP_HARDWALL bit,
|
2007-05-07 05:49:32 +08:00
|
|
|
* and do not allow allocations outside the current tasks cpuset
|
|
|
|
* unless the task has been OOM killed as is marked TIF_MEMDIE.
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
* GFP_KERNEL allocations are not so marked, so can escape to the
|
2008-04-29 16:00:26 +08:00
|
|
|
* nearest enclosing hardwalled ancestor cpuset.
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
*
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
* Scanning up parent cpusets requires callback_mutex. The
|
|
|
|
* __alloc_pages() routine only calls here with __GFP_HARDWALL bit
|
|
|
|
* _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
|
|
|
|
* current tasks mems_allowed came up empty on the first pass over
|
|
|
|
* the zonelist. So only GFP_KERNEL allocations, if all nodes in the
|
|
|
|
* cpuset are short of memory, might require taking the callback_mutex
|
|
|
|
* mutex.
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
*
|
2006-05-21 06:00:10 +08:00
|
|
|
* The first call here from mm/page_alloc:get_page_from_freelist()
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
* has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
|
|
|
|
* so no allocation on a node outside the cpuset is allowed (unless
|
|
|
|
* in interrupt, of course).
|
2006-05-21 06:00:10 +08:00
|
|
|
*
|
|
|
|
* The second pass through get_page_from_freelist() doesn't even call
|
|
|
|
* here for GFP_ATOMIC calls. For those calls, the __alloc_pages()
|
|
|
|
* variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
|
|
|
|
* in alloc_flags. That logic and the checks below have the combined
|
|
|
|
* affect that:
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
* in_interrupt - any node ok (current task context irrelevant)
|
|
|
|
* GFP_ATOMIC - any node ok
|
2007-05-07 05:49:32 +08:00
|
|
|
* TIF_MEMDIE - any node ok
|
2008-04-29 16:00:26 +08:00
|
|
|
* GFP_KERNEL - any node in enclosing hardwalled cpuset ok
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
* GFP_USER - only nodes in current tasks mems allowed ok.
|
2006-05-21 06:00:10 +08:00
|
|
|
*
|
|
|
|
* Rule:
|
2009-04-03 07:57:54 +08:00
|
|
|
* Don't call cpuset_node_allowed_softwall if you can't sleep, unless you
|
2006-05-21 06:00:10 +08:00
|
|
|
* pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
|
|
|
|
* the code that might scan up ancestor cpusets and sleep.
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
*/
|
2009-04-03 07:57:54 +08:00
|
|
|
int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
const struct cpuset *cs; /* current cpuset ancestors */
|
2006-03-24 19:16:12 +08:00
|
|
|
int allowed; /* is allocation in zone z allowed? */
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
|
2006-09-26 14:31:40 +08:00
|
|
|
if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
return 1;
|
2006-05-21 06:00:11 +08:00
|
|
|
might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
if (node_isset(node, current->mems_allowed))
|
|
|
|
return 1;
|
2007-05-07 05:49:32 +08:00
|
|
|
/*
|
|
|
|
* Allow tasks that have access to memory reserves because they have
|
|
|
|
* been OOM killed to get memory anywhere.
|
|
|
|
*/
|
|
|
|
if (unlikely(test_thread_flag(TIF_MEMDIE)))
|
|
|
|
return 1;
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
|
|
|
|
return 0;
|
|
|
|
|
2005-11-14 08:06:35 +08:00
|
|
|
if (current->flags & PF_EXITING) /* Let dying task have memory */
|
|
|
|
return 1;
|
|
|
|
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
/* Not hardwall and node outside mems_allowed: scan up cpusets */
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_lock(&callback_mutex);
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
|
|
|
|
task_lock(current);
|
2008-04-29 16:00:26 +08:00
|
|
|
cs = nearest_hardwall_ancestor(task_cs(current));
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
task_unlock(current);
|
|
|
|
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
allowed = node_isset(node, cs->mems_allowed);
|
2006-03-23 19:00:18 +08:00
|
|
|
mutex_unlock(&callback_mutex);
|
[PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 06:18:12 +08:00
|
|
|
return allowed;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
/*
|
2009-04-03 07:57:54 +08:00
|
|
|
* cpuset_node_allowed_hardwall - Can we allocate on a memory node?
|
|
|
|
* @node: is this an allowed node?
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
* @gfp_mask: memory allocation flags
|
|
|
|
*
|
2009-04-03 07:57:54 +08:00
|
|
|
* If we're in interrupt, yes, we can always allocate. If __GFP_THISNODE is
|
|
|
|
* set, yes, we can always allocate. If node is in our task's mems_allowed,
|
|
|
|
* yes. If the task has been OOM killed and has access to memory reserves as
|
|
|
|
* specified by the TIF_MEMDIE flag, yes.
|
|
|
|
* Otherwise, no.
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
*
|
|
|
|
* The __GFP_THISNODE placement logic is really handled elsewhere,
|
|
|
|
* by forcibly using a zonelist starting at a specified node, and by
|
|
|
|
* (in get_page_from_freelist()) refusing to consider the zones for
|
|
|
|
* any node on the zonelist except the first. By the time any such
|
|
|
|
* calls get to this routine, we should just shut up and say 'yes'.
|
|
|
|
*
|
2009-04-03 07:57:54 +08:00
|
|
|
* Unlike the cpuset_node_allowed_softwall() variant, above,
|
|
|
|
* this variant requires that the node be in the current task's
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
* mems_allowed or that we're in interrupt. It does not scan up the
|
|
|
|
* cpuset hierarchy for the nearest enclosing mem_exclusive cpuset.
|
|
|
|
* It never sleeps.
|
|
|
|
*/
|
2009-04-03 07:57:54 +08:00
|
|
|
int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
{
|
|
|
|
if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
|
|
|
|
return 1;
|
|
|
|
if (node_isset(node, current->mems_allowed))
|
|
|
|
return 1;
|
2007-10-18 18:06:04 +08:00
|
|
|
/*
|
|
|
|
* Allow tasks that have access to memory reserves because they have
|
|
|
|
* been OOM killed to get memory anywhere.
|
|
|
|
*/
|
|
|
|
if (unlikely(test_thread_flag(TIF_MEMDIE)))
|
|
|
|
return 1;
|
[PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 16:34:25 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
/**
|
2010-05-27 05:42:49 +08:00
|
|
|
* cpuset_mem_spread_node() - On which node to begin search for a file page
|
|
|
|
* cpuset_slab_spread_node() - On which node to begin search for a slab page
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
*
|
|
|
|
* If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
|
|
|
|
* tasks in a cpuset with is_spread_page or is_spread_slab set),
|
|
|
|
* and if the memory allocation used cpuset_mem_spread_node()
|
|
|
|
* to determine on which node to start looking, as it will for
|
|
|
|
* certain page cache or slab cache pages such as used for file
|
|
|
|
* system buffers and inode caches, then instead of starting on the
|
|
|
|
* local node to look for a free page, rather spread the starting
|
|
|
|
* node around the tasks mems_allowed nodes.
|
|
|
|
*
|
|
|
|
* We don't have to worry about the returned node being offline
|
|
|
|
* because "it can't happen", and even if it did, it would be ok.
|
|
|
|
*
|
|
|
|
* The routines calling guarantee_online_mems() are careful to
|
|
|
|
* only set nodes in task->mems_allowed that are online. So it
|
|
|
|
* should not be possible for the following code to return an
|
|
|
|
* offline node. But if it did, that would be ok, as this routine
|
|
|
|
* is not returning the node where the allocation must be, only
|
|
|
|
* the node where the search should start. The zonelist passed to
|
|
|
|
* __alloc_pages() will include all nodes. If the slab allocator
|
|
|
|
* is passed an offline node, it will fall back to the local node.
|
|
|
|
* See kmem_cache_alloc_node().
|
|
|
|
*/
|
|
|
|
|
2010-05-27 05:42:49 +08:00
|
|
|
static int cpuset_spread_node(int *rotor)
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
{
|
|
|
|
int node;
|
|
|
|
|
2010-05-27 05:42:49 +08:00
|
|
|
node = next_node(*rotor, current->mems_allowed);
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
if (node == MAX_NUMNODES)
|
|
|
|
node = first_node(current->mems_allowed);
|
2010-05-27 05:42:49 +08:00
|
|
|
*rotor = node;
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
return node;
|
|
|
|
}
|
2010-05-27 05:42:49 +08:00
|
|
|
|
|
|
|
int cpuset_mem_spread_node(void)
|
|
|
|
{
|
2011-07-27 07:08:30 +08:00
|
|
|
if (current->cpuset_mem_spread_rotor == NUMA_NO_NODE)
|
|
|
|
current->cpuset_mem_spread_rotor =
|
|
|
|
node_random(¤t->mems_allowed);
|
|
|
|
|
2010-05-27 05:42:49 +08:00
|
|
|
return cpuset_spread_node(¤t->cpuset_mem_spread_rotor);
|
|
|
|
}
|
|
|
|
|
|
|
|
int cpuset_slab_spread_node(void)
|
|
|
|
{
|
2011-07-27 07:08:30 +08:00
|
|
|
if (current->cpuset_slab_spread_rotor == NUMA_NO_NODE)
|
|
|
|
current->cpuset_slab_spread_rotor =
|
|
|
|
node_random(¤t->mems_allowed);
|
|
|
|
|
2010-05-27 05:42:49 +08:00
|
|
|
return cpuset_spread_node(¤t->cpuset_slab_spread_rotor);
|
|
|
|
}
|
|
|
|
|
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).
The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.
All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.
There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures. They are called 'memory_spread_page' and 'memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.
The implementation is simple. Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset. In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.
The cpuset_mem_spread_node() routine is also simple. It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.
This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit. Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.
A couple of Copyright year ranges are updated as well. And a couple of email
addresses that can be found in the MAINTAINERS file are removed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:16:03 +08:00
|
|
|
EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
|
|
|
|
|
2005-09-07 06:18:13 +08:00
|
|
|
/**
|
2007-10-17 14:25:58 +08:00
|
|
|
* cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
|
|
|
|
* @tsk1: pointer to task_struct of some task.
|
|
|
|
* @tsk2: pointer to task_struct of some other task.
|
|
|
|
*
|
|
|
|
* Description: Return true if @tsk1's mems_allowed intersects the
|
|
|
|
* mems_allowed of @tsk2. Used by the OOM killer to determine if
|
|
|
|
* one of the task's memory usage might impact the memory available
|
|
|
|
* to the other.
|
2005-09-07 06:18:13 +08:00
|
|
|
**/
|
|
|
|
|
2007-10-17 14:25:58 +08:00
|
|
|
int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
|
|
|
|
const struct task_struct *tsk2)
|
2005-09-07 06:18:13 +08:00
|
|
|
{
|
2007-10-17 14:25:58 +08:00
|
|
|
return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
|
2005-09-07 06:18:13 +08:00
|
|
|
}
|
|
|
|
|
2009-01-07 06:39:01 +08:00
|
|
|
/**
|
|
|
|
* cpuset_print_task_mems_allowed - prints task's cpuset and mems_allowed
|
|
|
|
* @task: pointer to task_struct of some task.
|
|
|
|
*
|
|
|
|
* Description: Prints @task's name, cpuset name, and cached copy of its
|
|
|
|
* mems_allowed to the kernel log. Must hold task_lock(task) to allow
|
|
|
|
* dereferencing task_cs(task).
|
|
|
|
*/
|
|
|
|
void cpuset_print_task_mems_allowed(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct dentry *dentry;
|
|
|
|
|
|
|
|
dentry = task_cs(tsk)->css.cgroup->dentry;
|
|
|
|
spin_lock(&cpuset_buffer_lock);
|
2013-01-25 16:08:01 +08:00
|
|
|
|
|
|
|
if (!dentry) {
|
|
|
|
strcpy(cpuset_name, "/");
|
|
|
|
} else {
|
|
|
|
spin_lock(&dentry->d_lock);
|
|
|
|
strlcpy(cpuset_name, (const char *)dentry->d_name.name,
|
|
|
|
CPUSET_NAME_LEN);
|
|
|
|
spin_unlock(&dentry->d_lock);
|
|
|
|
}
|
|
|
|
|
2009-01-07 06:39:01 +08:00
|
|
|
nodelist_scnprintf(cpuset_nodelist, CPUSET_NODELIST_LEN,
|
|
|
|
tsk->mems_allowed);
|
|
|
|
printk(KERN_INFO "%s cpuset=%s mems_allowed=%s\n",
|
|
|
|
tsk->comm, cpuset_name, cpuset_nodelist);
|
|
|
|
spin_unlock(&cpuset_buffer_lock);
|
|
|
|
}
|
|
|
|
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
/*
|
|
|
|
* Collection of memory_pressure is suppressed unless
|
|
|
|
* this flag is enabled by writing "1" to the special
|
|
|
|
* cpuset file 'memory_pressure_enabled' in the root cpuset.
|
|
|
|
*/
|
|
|
|
|
2006-01-08 17:01:51 +08:00
|
|
|
int cpuset_memory_pressure_enabled __read_mostly;
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims.
|
|
|
|
*
|
|
|
|
* Keep a running average of the rate of synchronous (direct)
|
|
|
|
* page reclaim efforts initiated by tasks in each cpuset.
|
|
|
|
*
|
|
|
|
* This represents the rate at which some task in the cpuset
|
|
|
|
* ran low on memory on all nodes it was allowed to use, and
|
|
|
|
* had to enter the kernels page reclaim code in an effort to
|
|
|
|
* create more free memory by tossing clean pages or swapping
|
|
|
|
* or writing dirty pages.
|
|
|
|
*
|
|
|
|
* Display to user space in the per-cpuset read-only file
|
|
|
|
* "memory_pressure". Value displayed is an integer
|
|
|
|
* representing the recent rate of entry into the synchronous
|
|
|
|
* (direct) page reclaim by any task attached to the cpuset.
|
|
|
|
**/
|
|
|
|
|
|
|
|
void __cpuset_memory_pressure_bump(void)
|
|
|
|
{
|
|
|
|
task_lock(current);
|
2007-10-19 14:39:39 +08:00
|
|
|
fmeter_markevent(&task_cs(current)->fmeter);
|
[PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:49 +08:00
|
|
|
task_unlock(current);
|
|
|
|
}
|
|
|
|
|
2007-10-19 14:39:39 +08:00
|
|
|
#ifdef CONFIG_PROC_PID_CPUSET
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* proc_cpuset_show()
|
|
|
|
* - Print tasks cpuset path into seq_file.
|
|
|
|
* - Used for /proc/<pid>/cpuset.
|
[PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:30 +08:00
|
|
|
* - No need to task_lock(tsk) on this tsk->cpuset reference, as it
|
|
|
|
* doesn't really matter if tsk->cpuset changes after we read it,
|
2013-01-08 00:51:08 +08:00
|
|
|
* and we take cpuset_mutex, keeping cpuset_attach() from changing it
|
2008-02-07 16:14:45 +08:00
|
|
|
* anyway.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2007-10-19 14:40:20 +08:00
|
|
|
static int proc_cpuset_show(struct seq_file *m, void *unused_v)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2006-06-26 15:25:56 +08:00
|
|
|
struct pid *pid;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct task_struct *tsk;
|
|
|
|
char *buf;
|
2007-10-19 14:39:39 +08:00
|
|
|
struct cgroup_subsys_state *css;
|
2006-06-26 15:25:55 +08:00
|
|
|
int retval;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-06-26 15:25:55 +08:00
|
|
|
retval = -ENOMEM;
|
2005-04-17 06:20:36 +08:00
|
|
|
buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
|
|
|
|
if (!buf)
|
2006-06-26 15:25:55 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
retval = -ESRCH;
|
2006-06-26 15:25:56 +08:00
|
|
|
pid = m->private;
|
|
|
|
tsk = get_pid_task(pid, PIDTYPE_PID);
|
2006-06-26 15:25:55 +08:00
|
|
|
if (!tsk)
|
|
|
|
goto out_free;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-01-15 14:10:57 +08:00
|
|
|
rcu_read_lock();
|
2007-10-19 14:39:39 +08:00
|
|
|
css = task_subsys_state(tsk, cpuset_subsys_id);
|
|
|
|
retval = cgroup_path(css->cgroup, buf, PAGE_SIZE);
|
2013-01-15 14:10:57 +08:00
|
|
|
rcu_read_unlock();
|
2005-04-17 06:20:36 +08:00
|
|
|
if (retval < 0)
|
2013-01-15 14:10:57 +08:00
|
|
|
goto out_put_task;
|
2005-04-17 06:20:36 +08:00
|
|
|
seq_puts(m, buf);
|
|
|
|
seq_putc(m, '\n');
|
2013-01-15 14:10:57 +08:00
|
|
|
out_put_task:
|
2006-06-26 15:25:55 +08:00
|
|
|
put_task_struct(tsk);
|
|
|
|
out_free:
|
2005-04-17 06:20:36 +08:00
|
|
|
kfree(buf);
|
2006-06-26 15:25:55 +08:00
|
|
|
out:
|
2005-04-17 06:20:36 +08:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cpuset_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
2006-06-26 15:25:56 +08:00
|
|
|
struct pid *pid = PROC_I(inode)->pid;
|
|
|
|
return single_open(file, proc_cpuset_show, pid);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2007-02-12 16:55:35 +08:00
|
|
|
const struct file_operations proc_cpuset_operations = {
|
2005-04-17 06:20:36 +08:00
|
|
|
.open = cpuset_open,
|
|
|
|
.read = seq_read,
|
|
|
|
.llseek = seq_lseek,
|
|
|
|
.release = single_release,
|
|
|
|
};
|
2007-10-19 14:39:39 +08:00
|
|
|
#endif /* CONFIG_PROC_PID_CPUSET */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-09-21 17:06:27 +08:00
|
|
|
/* Display task mems_allowed in /proc/<pid>/status file. */
|
2008-02-08 20:18:33 +08:00
|
|
|
void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task)
|
|
|
|
{
|
|
|
|
seq_printf(m, "Mems_allowed:\t");
|
2008-10-19 11:28:20 +08:00
|
|
|
seq_nodemask(m, &task->mems_allowed);
|
2008-02-08 20:18:33 +08:00
|
|
|
seq_printf(m, "\n");
|
2008-04-09 02:43:03 +08:00
|
|
|
seq_printf(m, "Mems_allowed_list:\t");
|
2008-10-19 11:28:20 +08:00
|
|
|
seq_nodemask_list(m, &task->mems_allowed);
|
2008-04-09 02:43:03 +08:00
|
|
|
seq_printf(m, "\n");
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|