2005-04-17 06:20:36 +08:00
|
|
|
#ifndef _LINUX_MEMPOLICY_H
|
|
|
|
#define _LINUX_MEMPOLICY_H 1
|
|
|
|
|
|
|
|
#include <linux/errno.h>
|
|
|
|
|
|
|
|
/*
|
|
|
|
* NUMA memory policies for Linux.
|
|
|
|
* Copyright 2003,2004 Andi Kleen SuSE Labs
|
|
|
|
*/
|
|
|
|
|
2008-04-28 17:12:25 +08:00
|
|
|
/*
|
|
|
|
* Both the MPOL_* mempolicy mode and the MPOL_F_* optional mode flags are
|
|
|
|
* passed by the user to either set_mempolicy() or mbind() in an 'int' actual.
|
|
|
|
* The MPOL_MODE_FLAGS macro determines the legal set of optional mode flags.
|
|
|
|
*/
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Policies */
|
mempolicy: convert MPOL constants to enum
The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
MPOL_INTERLEAVE, are better declared as part of an enum since they are
sequentially numbered and cannot be combined.
The policy member of struct mempolicy is also converted from type short to
type unsigned short. A negative policy does not have any legitimate meaning,
so it is possible to change its type in preparation for adding optional mode
flags later.
The equivalent member of struct shmem_sb_info is also changed from int to
unsigned short.
For compatibility, the policy formal to get_mempolicy() remains as a pointer
to an int:
int get_mempolicy(int *policy, unsigned long *nmask,
unsigned long maxnode, unsigned long addr,
unsigned long flags);
although the only possible values is the range of type unsigned short.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:23 +08:00
|
|
|
enum {
|
|
|
|
MPOL_DEFAULT,
|
|
|
|
MPOL_PREFERRED,
|
|
|
|
MPOL_BIND,
|
|
|
|
MPOL_INTERLEAVE,
|
|
|
|
MPOL_MAX, /* always last member of enum */
|
|
|
|
};
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-04-28 17:12:25 +08:00
|
|
|
/* Flags for set_mempolicy */
|
mempolicy: add MPOL_F_STATIC_NODES flag
Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
node remap when the policy is rebound.
Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
a union with cpuset_mems_allowed:
struct mempolicy {
...
union {
nodemask_t cpuset_mems_allowed;
nodemask_t user_nodemask;
} w;
}
that stores the the nodemask that the user passed when he or she created the
mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
which is passed with any mempolicy mode, the user's passed nodemask
intersected with the VMA or task's allowed nodes is always used when
determining the preferred node, setting the MPOL_BIND zonelist, or creating
the interleave nodemask. This happens whenever the policy is rebound,
including when a task's cpuset assignment changes or the cpuset's mems are
changed.
This creates an interesting side-effect in that it allows the mempolicy
"intent" to lie dormant and uneffected until it has access to the node(s) that
it desires. For example, if you currently ask for an interleaved policy over
a set of nodes that you do not have access to, the mempolicy is not created
and the task continues to use the previous policy. With this change, however,
it is possible to create the same mempolicy; it is only effected when access
to nodes in the nodemask is acquired.
It is also possible to mount tmpfs with the static nodemask behavior when
specifying a node or nodemask. To do this, simply add "=static" immediately
following the mempolicy mode at mount time:
mount -o remount mpol=interleave=static:1-3
Also removes mpol_check_policy() and folds its logic into mpol_new() since it
is now obsoleted. The unused vma_mpol_equal() is also removed.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:27 +08:00
|
|
|
#define MPOL_F_STATIC_NODES (1 << 15)
|
mempolicy: add MPOL_F_RELATIVE_NODES flag
Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
nodemasks passed via set_mempolicy() or mbind() should be considered relative
to the current task's mems_allowed.
When the mempolicy is created, the passed nodemask is folded and mapped onto
the current task's mems_allowed. For example, consider a task using
set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
5-7 (the second, third, and fourth node of mems_allowed).
If the same task is attached to a cpuset, the mempolicy nodemask is rebound
each time the mems are changed. Some possible rebinds and results are:
mems result
1-3 1-3
1-7 2-4
1,5-6 1,5-6
1,5-7 5-7
Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
to the resultant nodemask from the relative remap.
In the MPOL_PREFERRED case, the preferred node is remapped from the currently
effected nodemask to the relative nodemask.
This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:30 +08:00
|
|
|
#define MPOL_F_RELATIVE_NODES (1 << 14)
|
mempolicy: add MPOL_F_STATIC_NODES flag
Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
node remap when the policy is rebound.
Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
a union with cpuset_mems_allowed:
struct mempolicy {
...
union {
nodemask_t cpuset_mems_allowed;
nodemask_t user_nodemask;
} w;
}
that stores the the nodemask that the user passed when he or she created the
mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
which is passed with any mempolicy mode, the user's passed nodemask
intersected with the VMA or task's allowed nodes is always used when
determining the preferred node, setting the MPOL_BIND zonelist, or creating
the interleave nodemask. This happens whenever the policy is rebound,
including when a task's cpuset assignment changes or the cpuset's mems are
changed.
This creates an interesting side-effect in that it allows the mempolicy
"intent" to lie dormant and uneffected until it has access to the node(s) that
it desires. For example, if you currently ask for an interleaved policy over
a set of nodes that you do not have access to, the mempolicy is not created
and the task continues to use the previous policy. With this change, however,
it is possible to create the same mempolicy; it is only effected when access
to nodes in the nodemask is acquired.
It is also possible to mount tmpfs with the static nodemask behavior when
specifying a node or nodemask. To do this, simply add "=static" immediately
following the mempolicy mode at mount time:
mount -o remount mpol=interleave=static:1-3
Also removes mpol_check_policy() and folds its logic into mpol_new() since it
is now obsoleted. The unused vma_mpol_equal() is also removed.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:27 +08:00
|
|
|
|
2008-04-28 17:12:25 +08:00
|
|
|
/*
|
|
|
|
* MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
|
|
|
|
* either set_mempolicy() or mbind().
|
|
|
|
*/
|
mempolicy: add MPOL_F_RELATIVE_NODES flag
Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
nodemasks passed via set_mempolicy() or mbind() should be considered relative
to the current task's mems_allowed.
When the mempolicy is created, the passed nodemask is folded and mapped onto
the current task's mems_allowed. For example, consider a task using
set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
5-7 (the second, third, and fourth node of mems_allowed).
If the same task is attached to a cpuset, the mempolicy nodemask is rebound
each time the mems are changed. Some possible rebinds and results are:
mems result
1-3 1-3
1-7 2-4
1,5-6 1,5-6
1,5-7 5-7
Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
to the resultant nodemask from the relative remap.
In the MPOL_PREFERRED case, the preferred node is remapped from the currently
effected nodemask to the relative nodemask.
This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:30 +08:00
|
|
|
#define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES)
|
2008-04-28 17:12:25 +08:00
|
|
|
|
|
|
|
/* Flags for get_mempolicy */
|
2005-04-17 06:20:36 +08:00
|
|
|
#define MPOL_F_NODE (1<<0) /* return next IL mode instead of node mask */
|
|
|
|
#define MPOL_F_ADDR (1<<1) /* look up vma using address */
|
2007-10-16 16:24:51 +08:00
|
|
|
#define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* Flags for mbind */
|
|
|
|
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
|
2006-01-08 17:00:50 +08:00
|
|
|
#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
|
|
|
|
#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
|
|
|
|
#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#ifdef __KERNEL__
|
|
|
|
|
|
|
|
#include <linux/mmzone.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/rbtree.h>
|
|
|
|
#include <linux/spinlock.h>
|
2005-10-30 09:15:48 +08:00
|
|
|
#include <linux/nodemask.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-06-08 15:43:41 +08:00
|
|
|
struct mm_struct;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Describe a memory policy.
|
|
|
|
*
|
|
|
|
* A mempolicy can be either associated with a process or with a VMA.
|
|
|
|
* For VMA related allocations the VMA policy is preferred, otherwise
|
|
|
|
* the process policy is used. Interrupts ignore the memory policy
|
|
|
|
* of the current process.
|
|
|
|
*
|
|
|
|
* Locking policy for interlave:
|
|
|
|
* In process context there is no locking because only the process accesses
|
|
|
|
* its own state. All vma manipulation is somewhat protected by a down_read on
|
2005-10-30 09:16:41 +08:00
|
|
|
* mmap_sem.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* Freeing policy:
|
2008-04-28 17:12:18 +08:00
|
|
|
* Mempolicy objects are reference counted. A mempolicy will be freed when
|
|
|
|
* mpol_free() decrements the reference count to zero.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* Copying policy objects:
|
2008-04-28 17:12:18 +08:00
|
|
|
* mpol_copy() allocates a new mempolicy and copies the specified mempolicy
|
|
|
|
* to the new storage. The reference count of the new object is initialized
|
|
|
|
* to 1, representing the caller of mpol_copy().
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
struct mempolicy {
|
|
|
|
atomic_t refcnt;
|
mempolicy: convert MPOL constants to enum
The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
MPOL_INTERLEAVE, are better declared as part of an enum since they are
sequentially numbered and cannot be combined.
The policy member of struct mempolicy is also converted from type short to
type unsigned short. A negative policy does not have any legitimate meaning,
so it is possible to change its type in preparation for adding optional mode
flags later.
The equivalent member of struct shmem_sb_info is also changed from int to
unsigned short.
For compatibility, the policy formal to get_mempolicy() remains as a pointer
to an int:
int get_mempolicy(int *policy, unsigned long *nmask,
unsigned long maxnode, unsigned long addr,
unsigned long flags);
although the only possible values is the range of type unsigned short.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:23 +08:00
|
|
|
unsigned short policy; /* See MPOL_* above */
|
2008-04-28 17:12:25 +08:00
|
|
|
unsigned short flags; /* See set_mempolicy() MPOL_F_* above */
|
2005-04-17 06:20:36 +08:00
|
|
|
union {
|
|
|
|
short preferred_node; /* preferred */
|
2008-04-28 17:12:18 +08:00
|
|
|
nodemask_t nodes; /* interleave/bind */
|
2005-04-17 06:20:36 +08:00
|
|
|
/* undefined for default */
|
|
|
|
} v;
|
mempolicy: add MPOL_F_STATIC_NODES flag
Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
node remap when the policy is rebound.
Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
a union with cpuset_mems_allowed:
struct mempolicy {
...
union {
nodemask_t cpuset_mems_allowed;
nodemask_t user_nodemask;
} w;
}
that stores the the nodemask that the user passed when he or she created the
mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
which is passed with any mempolicy mode, the user's passed nodemask
intersected with the VMA or task's allowed nodes is always used when
determining the preferred node, setting the MPOL_BIND zonelist, or creating
the interleave nodemask. This happens whenever the policy is rebound,
including when a task's cpuset assignment changes or the cpuset's mems are
changed.
This creates an interesting side-effect in that it allows the mempolicy
"intent" to lie dormant and uneffected until it has access to the node(s) that
it desires. For example, if you currently ask for an interleaved policy over
a set of nodes that you do not have access to, the mempolicy is not created
and the task continues to use the previous policy. With this change, however,
it is possible to create the same mempolicy; it is only effected when access
to nodes in the nodemask is acquired.
It is also possible to mount tmpfs with the static nodemask behavior when
specifying a node or nodemask. To do this, simply add "=static" immediately
following the mempolicy mode at mount time:
mount -o remount mpol=interleave=static:1-3
Also removes mpol_check_policy() and folds its logic into mpol_new() since it
is now obsoleted. The unused vma_mpol_equal() is also removed.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:27 +08:00
|
|
|
union {
|
|
|
|
nodemask_t cpuset_mems_allowed; /* relative to these nodes */
|
|
|
|
nodemask_t user_nodemask; /* nodemask passed by user */
|
|
|
|
} w;
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Support for managing mempolicy data objects (clone, copy, destroy)
|
|
|
|
* The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
|
|
|
|
*/
|
|
|
|
|
|
|
|
extern void __mpol_free(struct mempolicy *pol);
|
|
|
|
static inline void mpol_free(struct mempolicy *pol)
|
|
|
|
{
|
|
|
|
if (pol)
|
|
|
|
__mpol_free(pol);
|
|
|
|
}
|
|
|
|
|
|
|
|
extern struct mempolicy *__mpol_copy(struct mempolicy *pol);
|
|
|
|
static inline struct mempolicy *mpol_copy(struct mempolicy *pol)
|
|
|
|
{
|
|
|
|
if (pol)
|
|
|
|
pol = __mpol_copy(pol);
|
|
|
|
return pol;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define vma_policy(vma) ((vma)->vm_policy)
|
|
|
|
#define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
|
|
|
|
|
|
|
|
static inline void mpol_get(struct mempolicy *pol)
|
|
|
|
{
|
|
|
|
if (pol)
|
|
|
|
atomic_inc(&pol->refcnt);
|
|
|
|
}
|
|
|
|
|
|
|
|
extern int __mpol_equal(struct mempolicy *a, struct mempolicy *b);
|
|
|
|
static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
|
|
|
|
{
|
|
|
|
if (a == b)
|
|
|
|
return 1;
|
|
|
|
return __mpol_equal(a, b);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Tree of shared policies for a shared memory region.
|
|
|
|
* Maintain the policies in a pseudo mm that contains vmas. The vmas
|
|
|
|
* carry the policy. As a special twist the pseudo mm is indexed in pages, not
|
|
|
|
* bytes, so that we can work with shared memory segments bigger than
|
|
|
|
* unsigned long.
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct sp_node {
|
|
|
|
struct rb_node nd;
|
|
|
|
unsigned long start, end;
|
|
|
|
struct mempolicy *policy;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct shared_policy {
|
|
|
|
struct rb_root root;
|
|
|
|
spinlock_t lock;
|
|
|
|
};
|
|
|
|
|
mempolicy: convert MPOL constants to enum
The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
MPOL_INTERLEAVE, are better declared as part of an enum since they are
sequentially numbered and cannot be combined.
The policy member of struct mempolicy is also converted from type short to
type unsigned short. A negative policy does not have any legitimate meaning,
so it is possible to change its type in preparation for adding optional mode
flags later.
The equivalent member of struct shmem_sb_info is also changed from int to
unsigned short.
For compatibility, the policy formal to get_mempolicy() remains as a pointer
to an int:
int get_mempolicy(int *policy, unsigned long *nmask,
unsigned long maxnode, unsigned long addr,
unsigned long flags);
although the only possible values is the range of type unsigned short.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:12:23 +08:00
|
|
|
void mpol_shared_policy_init(struct shared_policy *info, unsigned short policy,
|
2008-04-28 17:12:25 +08:00
|
|
|
unsigned short flags, nodemask_t *nodes);
|
2005-04-17 06:20:36 +08:00
|
|
|
int mpol_set_shared_policy(struct shared_policy *info,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
struct mempolicy *new);
|
|
|
|
void mpol_free_shared_policy(struct shared_policy *p);
|
|
|
|
struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
|
|
|
|
unsigned long idx);
|
|
|
|
|
|
|
|
extern void numa_default_policy(void);
|
|
|
|
extern void numa_policy_init(void);
|
[PATCH] cpuset: numa_policy_rebind cleanup
Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.
The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.
NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.
The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy. It just looked at the tasks
mems_allowed value. This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes. But in an upcoming patch, the vma mempolicies
will be rebound as well. Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.
So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.
Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:56 +08:00
|
|
|
extern void mpol_rebind_task(struct task_struct *tsk,
|
|
|
|
const nodemask_t *new);
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
|
2006-03-24 19:16:08 +08:00
|
|
|
extern void mpol_fix_fork_child_flag(struct task_struct *p);
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
|
2006-01-06 16:10:46 +08:00
|
|
|
extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
|
2008-04-28 17:12:18 +08:00
|
|
|
unsigned long addr, gfp_t gfp_flags,
|
|
|
|
struct mempolicy **mpol, nodemask_t **nodemask);
|
2006-01-19 09:42:36 +08:00
|
|
|
extern unsigned slab_node(struct mempolicy *policy);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-09-26 14:31:18 +08:00
|
|
|
extern enum zone_type policy_zone;
|
2006-01-06 16:11:17 +08:00
|
|
|
|
2006-09-26 14:31:18 +08:00
|
|
|
static inline void check_highest_zone(enum zone_type k)
|
2006-01-06 16:11:17 +08:00
|
|
|
{
|
2007-08-23 05:02:05 +08:00
|
|
|
if (k > policy_zone && k != ZONE_MOVABLE)
|
2006-01-06 16:11:17 +08:00
|
|
|
policy_zone = k;
|
|
|
|
}
|
|
|
|
|
2006-01-08 17:00:51 +08:00
|
|
|
int do_migrate_pages(struct mm_struct *mm,
|
|
|
|
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#else
|
|
|
|
|
|
|
|
struct mempolicy {};
|
|
|
|
|
|
|
|
static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mpol_free(struct mempolicy *p)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mpol_get(struct mempolicy *pol)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct mempolicy *mpol_copy(struct mempolicy *old)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct shared_policy {};
|
|
|
|
|
|
|
|
static inline int mpol_set_shared_policy(struct shared_policy *info,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
struct mempolicy *new)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2006-01-15 05:20:48 +08:00
|
|
|
static inline void mpol_shared_policy_init(struct shared_policy *info,
|
2008-04-28 17:12:25 +08:00
|
|
|
unsigned short policy, unsigned short flags, nodemask_t *nodes)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mpol_free_shared_policy(struct shared_policy *p)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct mempolicy *
|
|
|
|
mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define vma_policy(vma) NULL
|
|
|
|
#define vma_set_policy(vma, pol) do {} while(0)
|
|
|
|
|
|
|
|
static inline void numa_policy_init(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void numa_default_policy(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
[PATCH] cpuset: numa_policy_rebind cleanup
Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.
The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.
NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.
The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy. It just looked at the tasks
mems_allowed value. This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes. But in an upcoming patch, the vma mempolicies
will be rebound as well. Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.
So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.
Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:56 +08:00
|
|
|
static inline void mpol_rebind_task(struct task_struct *tsk,
|
[PATCH] cpusets: automatic numa mempolicy rebinding
This patch automatically updates a tasks NUMA mempolicy when its cpuset
memory placement changes. It does so within the context of the task,
without any need to support low level external mempolicy manipulation.
If a system is not using cpusets, or if running on a system with just the
root (all-encompassing) cpuset, then this remap is a no-op. Only when a
task is moved between cpusets, or a cpusets memory placement is changed
does the following apply. Otherwise, the main routine below,
rebind_policy() is not even called.
When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
essential role of cpusets is to place jobs (several related tasks) on a set
of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
manage a jobs processor placement within its allowed cpuset, and the
essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
memory placement within its allowed cpuset.
However, CPU affinity and NUMA memory placement are managed within the
kernel using absolute system wide numbering, not cpuset relative numbering.
This is ok until a job is migrated to a different cpuset, or what's the
same, a jobs cpuset is moved to different CPUs and Memory Nodes.
Then the CPU affinity and NUMA memory placement of the tasks in the job
need to be updated, to preserve their cpuset-relative position. This can
be done for CPU affinity using sched_setaffinity() from user code, as one
task can modify anothers CPU affinity. This cannot be done from an
external task for NUMA memory placement, as that can only be modified in
the context of the task using it.
However, it easy enough to remap a tasks NUMA mempolicy automatically when
a task is migrated, using the existing cpuset mechanism to trigger a
refresh of a tasks memory placement after its cpuset has changed. All that
is needed is the old and new nodemask, and notice to the task that it needs
to rebind its mempolicy. The tasks mems_allowed has the old mask, the
tasks cpuset has the new mask, and the existing
cpuset_update_current_mems_allowed() mechanism provides the notice. The
bitmap/cpumask/nodemask remap operators provide the cpuset relative
calculations.
This patch leaves open a couple of issues:
1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:
These mempolicies may reference nodes outside of those allowed to
the current task by its cpuset. Tasks are migrated as part of jobs,
which reside on what might be several cpusets in a subtree. When such
a job is migrated, all NUMA memory policy references to nodes within
that cpuset subtree should be translated, and references to any nodes
outside that subtree should be left untouched. A future patch will
provide the cpuset mechanism needed to mark such subtrees. With that
patch, we will be able to correctly migrate these other memory policies
across a job migration.
2) Updating cpuset, affinity and memory policies in user space:
This is harder. Any placement state stored in user space using
system-wide numbering will be invalidated across a migration. More
work will be required to provide user code with a migration-safe means
to manage its cpuset relative placement, while preserving the current
API's that pass system wide numbers, not cpuset relative numbers across
the kernel-user boundary.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31 07:02:36 +08:00
|
|
|
const nodemask_t *new)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
[PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:01:59 +08:00
|
|
|
static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2006-03-24 19:16:08 +08:00
|
|
|
static inline void mpol_fix_fork_child_flag(struct task_struct *p)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2006-01-06 16:10:46 +08:00
|
|
|
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
|
2008-04-28 17:12:18 +08:00
|
|
|
unsigned long addr, gfp_t gfp_flags,
|
|
|
|
struct mempolicy **mpol, nodemask_t **nodemask)
|
2006-01-06 16:10:46 +08:00
|
|
|
{
|
2008-04-28 17:12:18 +08:00
|
|
|
*mpol = NULL;
|
|
|
|
*nodemask = NULL;
|
2008-04-28 17:12:14 +08:00
|
|
|
return node_zonelist(0, gfp_flags);
|
2006-01-06 16:10:46 +08:00
|
|
|
}
|
|
|
|
|
[PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:00:56 +08:00
|
|
|
static inline int do_migrate_pages(struct mm_struct *mm,
|
|
|
|
const nodemask_t *from_nodes,
|
|
|
|
const nodemask_t *to_nodes, int flags)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-01-06 16:11:17 +08:00
|
|
|
static inline void check_highest_zone(int k)
|
|
|
|
{
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* CONFIG_NUMA */
|
|
|
|
#endif /* __KERNEL__ */
|
|
|
|
|
|
|
|
#endif
|