This introduces a limit parameter to the core bootmem allocator; The new
parameter indicates that physical memory allocated by the bootmem
allocator should be within the requested limit.
We also introduce alloc_bootmem_low_pages_limit, alloc_bootmem_node_limit,
alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit
is the only api used for swiotlb.
The existing alloc_bootmem_low_pages() api could instead have been
changed and made to pass right limit to the core allocator. But that
would make the patch more intrusive for 2.6.14, as other arches use
alloc_bootmem_low_pages(). We may be done that post 2.6.14 as a
cleanup.
With this, swiotlb gets memory within 4G for both x86_64 and ia64
arches.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Ravikiran G Thirumalai <kiran@scalex86.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
hugetlbfs allows truncation of its files (should it?), but hugetlb.c often
forgets that: crashes and misaccounting ensue.
copy_hugetlb_page_range better grab the src page_table_lock since we don't
want to guess what happens if concurrently truncated. unmap_hugepage_range
rss accounting must not assume the full range was mapped. follow_hugetlb_page
must guard with page_table_lock and be prepared to exit early.
Restyle copy_hugetlb_page_range with a for loop like the others there.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The hugetlb pages are currently pre-faulted. At the time of mmap of
hugepages, we populate the new PTEs. It is possible that HW has already
cached some of the unused PTEs internally. These stale entries never
get a chance to be purged in existing control flow.
This patch extends the check in page fault code for hugepages. Check if
a faulted address falls with in size for the hugetlb file backing it.
We return VM_FAULT_MINOR for these cases (assuming that the arch
specific page-faulting code purges the stale entry for the archs that
need it).
Signed-off-by: Rohit Seth <rohit.seth@intel.com>
[ This is apparently arguably an ia64 port bug. But the code won't
hurt, and for now it fixes a real problem on some ia64 machines ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As noticed by Nick Piggin, we need to make sure that we check the page
count before we check for PageDirty, since the dirty check is only valid
if the count implies that we're the only possible ones holding the page.
We always did do this, but the code needs a read-memory-barrier to make
sure that the orderign is also honored by the CPU.
(The writer side is ordered due to the atomic decrement and test on the
page count, see the discussion on linux-kernel)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Refuse to install a page into a mapping if the mapping count is already
ridiculously large.
You probably cannot trigger this on 32-bit architectures, but on a
64-bit setup we should protect against it.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Revert this recent correctness change: Douglas Crosher <dcrosher@scieneer.com>
reported that it broke an existing application, and that madvise() works
without error on anonymous mappings on Solaris.
This means that madvise() will remain non-standards-compliant: we should
return -EBADF for all requests against non-file-backed vma's, but Linux only
does this for MADV_WILLNEED requests.
Signed-off-by: Suzuki K P <suzuki@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As requested by Thomas Gleixner <tglx@linutronix.de>:
"5d3d0f7704ed0bc7eaca0501eeae3e5da1ea6c87 breaks a couple of ARM
boards, which depend on the historical bootmem allocation order.
There is a cleaner solution around to remove the pgdat list
completely, but this is a topic for post 2.6.14
Andi signalled ACK already."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In kmalloc_node we are checking if the allocation is for the same node when
interrupts are "on". This may lead to an allocation on another node than
intended.
This patch just shifts the check for the current node in __cache_alloc_node
when interrupts are disabled.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the ZERO_PAGE remapping complexity to the move_pte macro in
asm-generic, have it conditionally depend on
__HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS.
For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes
a noop.
From: Hugh Dickins <hugh@veritas.com>
Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch.
The "pte" at that point may be a swap entry or a pte_file entry: we must
check pte_present before perhaps corrupting such an entry.
Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's
mm/mremap.c, and more dangerous there since it's affecting all arches: I
think the safest course is to send Nick's patch and Yoichi's build fix and
this fix (build tested) on to Linus - so only MIPS can be affected.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As davem points out, this wasn't such a great idea. There may be some code
which does:
size = 1024*1024;
while (kmalloc(size, ...) == 0)
size /= 2;
which will now explode.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Problem: In some circumstances, bd_claim() is returning the wrong error
code.
If we try to swapon an unused block device that isn't swap formatted, we
get -EINVAL. But if that same block device is already mounted, we instead
get -EBUSY, even though it still isn't a valid swap device.
This issue came up on the busybox list trying to get the error message
from "swapon -a" right. If a swap device is already enabled, we get -EBUSY,
and we shouldn't report this as an error. But we can't distinguish the two
-EBUSY conditions, which are very different errors.
In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy
means "somebody other than sys_swapon has already claimed this", and
_that_ means this block device can't be a valid swap device. So return
-EINVAL there.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I had an issue on ia64 where I got a bug in kernel/workqueue because
kzalloc returned a NULL pointer due to the task structure getting too big
for the slab allocator. Usually these cases are caught by the kmalloc
macro in include/linux/slab.h.
Compilation will fail if a too big value is passed to kmalloc.
However, kzalloc uses __kmalloc which has no check for that. This patch
makes __kmalloc bug if a too large entity is requested.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The numa slab allocator may allocate pages from foreign nodes onto the
lists for a particular node if a node runs out of memory. Inspecting the
slab->nodeid field will not reflect that the page is now in use for the
slabs of another node.
This patch fixes that issue by adding a node field to free_block so that
the caller can indicate which node currently uses a slab.
Also removes the check for the current node from kmalloc_cache_node since
the process may shift later to another node which may lead to an allocation
on another node than intended.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is essential that index_of() be inlined. But alpha undoes the gcc
inlining hackery and index_of() ends up out-of-line. So fiddle with things
to make that function inline again.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh made me note this line for permission checking in mprotect():
if ((newflags & ~(newflags >> 4)) & 0xf) {
after figuring out what's that about, I decided it's nasty enough. Btw
Hugh itself didn't like the 0xf.
We can safely change it to VM_READ|VM_WRITE|VM_EXEC because we never change
VM_SHARED, so no need to check that.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
That comment is plain wrong (we even take the pagetable lock inside
unmap_region()).
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With the new changes that we made in the initialization of the slab
allocator, we first setup the cache from which array caches are allocated,
and then the cache, from which kmem_list3's are allocated.
Now if the array cache comes from a cache in which objsize > 32, (in this
instance size-64) then, first size-64 cache will be allocated and then the
size-128 (if this is the cache from which kmem_list3's are going to be
allocated).
So with these new changes, we are not guaranteed that we will be
initializing the malloc_sizes array in a serialized order. Thus there is
a bug in __find_general_cachep, as we are checking whether the first
cache_sizes ptr is NULL.
This is replaced by checking whether the array-cache cache is initialized.
Attached is a patch which does that. Boots fine on a x86-64, with
DEBUG_SPIN, DEBUG_SLAB, and preempt.
Attached is a patch which does that. Boots fine on a x86-64, with
DEBUG_SPIN, DEBUG_SLAB, and preempt.Thanks & Regards, Alok
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhitdayal.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pavel Emelianov and Kirill Korotaev observe that fs and arch users of
security_vm_enough_memory tend to forget to vm_unacct_memory when a
failure occurs further down (typically in setup_arg_pages variants).
These are all users of insert_vm_struct, and that reservation will only
be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some
cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't.
So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into
Committed_AS each time they're run. But don't add VM_ACCOUNT to them,
it's inappropriate to reserve against the very unlikely case that gdb
be used to COW a vDSO page - we ought to do something about that in
do_wp_page, but there are yet other inconsistencies to be resolved.
The safe and economical way to fix this is to let insert_vm_struct do
the security_vm_enough_memory check when it finds VM_ACCOUNT is set.
And the MIPS irix_brk has been calling security_vm_enough_memory before
calling do_brk which repeats it, doubly accounting and so also leaking.
Remove that, and all the fs and arch calls to security_vm_enough_memory:
give it a less misleading name later on.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the add_taint() interface for setting tainted bit flags instead of
doing it manually.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There was a pretty bad bug in there that the code would always check the full
VMA, not the range the user requested.
When the VMA to be checked was merged with the previous VMA this could lead to
spurious failures.
Signed-off-by: "Andi Kleen" <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the pgdat pointer we've already defined in wakeup_kswapd
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This leads to bootmem allocating first from node 0 instead
of from the last node. This avoids swiotlb allocating on the last node, which
doesn't really work on a machine with >4GB.
Note: there is a better patch around from someone else that gets
rid of the pgdat list completely.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move call to get_mm_counter() in update_mem_hiwater() to be
inside the check for tsk->mm being null. Otherwise you can be
following a null pointer here. This patch submitted by
Javier Herrero <jherrero@hvsistemas.es>.
Modify the end check for munmap regions to allow for the
legacy behavior of 0 being valid. Pretty much all current
uClinux system libc malloc's pass in 0 as the end point.
A hard check will fail on these, so change the check so
that if it is non-zero it must be valid otherwise it fails.
A passed in value will always succeed (as it used too).
Also export a few more mm system functions - to be consistent
with the VM code exports.
Signed-off-by: Greg Ungerer <gerg@uclinux.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the sparse warning "implicit cast to nocast type"
Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the sparse warning "implicit cast to nocast type"
Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With Nick Piggin <npiggin@suse.de>
Give some things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up timer initialization by introducing DEFINE_TIMER a'la
DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been
been in the -RT tree for some time.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch clarifies NULL handling of kfree() and vfree(). I addition,
wording of calling context restriction for vfree() and vunmap() are changed
from "may not" to "must not."
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The NUMA API change that introduced kmalloc_node was accepted for
2.6.12-rc3. Now it is possible to do slab allocations on a node to
localize memory structures. This API was used by the pageset localization
patch and the block layer localization patch now in mm. The existing
kmalloc_node is slow since it simply searches through all pages of the slab
to find a page that is on the node requested. The two patches do a one
time allocation of slab structures at initialization and therefore the
speed of kmalloc node does not matter.
This patch allows kmalloc_node to be as fast as kmalloc by introducing node
specific page lists for partial, free and full slabs. Slab allocation
improves in a NUMA system so that we are seeing a performance gain in AIM7
of about 5% with this patch alone.
More NUMA localizations are possible if kmalloc_node operates in an fast
way like kmalloc.
Test run on a 32p systems with 32G Ram.
w/o patch
Tasks jobs/min jti jobs/min/task real cpu
1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005
100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005
200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005
300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005
400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005
500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005
600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005
700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005
800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005
900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005
1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005
w/patch
Tasks jobs/min jti jobs/min/task real cpu
1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005
100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005
200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005
300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005
400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005
500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005
600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005
700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005
800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005
900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005
1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005
These are measurement taken directly after boot and show a greater
improvement than 5%. However, the performance improvements become less
over time if the AIM7 runs are repeated and settle down at around 5%.
Links to earlier discussions:
http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2
Changelog V4-V5:
- alloc_arraycache and alloc_aliencache take node parameter instead of cpu
- fix initialization so that nodes without cpus are properly handled.
- simplify code in kmem_cache_init
- patch against Andrews temp mm3 release
- Add Shai to credits
- fallback to __cache_alloc from __cache_alloc_node if the node's cache
is not available yet.
Changelog V3-V4:
- Patch against 2.6.12-rc5-mm1
- Cleanup patch integrated
- More and better use of for_each_node and for_each_cpu
- GCC 2.95 fix (do not use [] use [0])
- Correct determination of INDEX_AC
- Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes.
- Remove list3_data and list3_data_ptr macros for better readability
Changelog V2-V3:
- Made to patch against 2.6.12-rc4-mm1
- Revised bootstrap mechanism so that larger size kmem_list3 structs can be
supported. Do a generic solution so that the right slab can be found
for the internal structs.
- use for_each_online_node
Changelog V1-V2:
- Batching for freeing of wrong-node objects (alien caches)
- Locking changes and NUMA #ifdefs as requested by Manfred
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch modifies tmpfs to call the inode_init_security LSM hook to set
up the incore inode security state for new inodes before the inode becomes
accessible via the dcache.
As there is no underlying storage of security xattrs in this case, it is
not necessary for the hook to return the (name, value, len) triple to the
tmpfs code, so this patch also modifies the SELinux hook function to
correctly handle the case where the (name, value, len) pointers are NULL.
The hook call is needed in tmpfs in order to support proper security
labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With
this change in place, we should then be able to remove the
security_inode_post_create/mkdir/... hooks safely.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update the file systems in fs/ implementing a delete_inode() callback to
call truncate_inode_pages(). One implementation note: In developing this
patch I put the calls to truncate_inode_pages() at the very top of those
filesystems delete_inode() callbacks in order to retain the previous
behavior. I'm guessing that some of those could probably be optimized.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Run PCI driver initialization on local node
Instead of adding messy kmalloc_node()s everywhere run the
PCI driver probe on the node local to the device.
This would not have helped for IDE, but should for
other more clean drivers that do more initialization in probe().
It won't help for drivers that do most of the work
on first open (like many network drivers)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch introduces a kzalloc wrapper and converts kernel/ to use it. It
saves a little program text.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now the real motivation for this cpuset mem_exclusive patch series seems
trivial.
This patch keeps a task in or under one mem_exclusive cpuset from provoking an
oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only
interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
containment, there is little to gain from oom killing a task under a
non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
allocation must come from disjoint memory nodes.
This patch enables configuring a system so that a runaway job under one
mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
that might be using very high compute and memory resources for a prolonged
time.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch series extends the use of the cpuset attribute 'mem_exclusive'
to support cpuset configurations that:
1) allow GFP_KERNEL allocations to come from a potentially larger
set of memory nodes than GFP_USER allocations, and
2) can constrain the oom killer to tasks running in cpusets in
a specified subtree of the cpuset hierarchy.
Here's an example usage scenario. For a few hours or more, a large NUMA
system at a University is to be divided in two halves, with a bunch of student
jobs running in half the system under some form of batch manager, and with a
big research project running in the other half. Each of the student jobs is
placed in a small cpuset, but should share the classic Unix time share
facilities, such as buffered pages of files in /bin and /usr/lib. The big
research project wants no interference whatsoever from the student jobs, and
has highly tuned, unusual memory and i/o patterns that intend to make full use
of all the main memory on the nodes available to it.
In this example, we have two big sibling cpusets, one of which is further
divided into a more dynamic set of child cpusets.
We want kernel memory allocations constrained by the two big cpusets, and user
allocations constrained by the smaller child cpusets where present. And we
require that the oom killer not operate across the two halves of this system,
or else the first time a student job runs amuck, the big research project will
likely be first inline to get shot.
Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project
really does run amuck allocating memory, it should be shot, not some other
task outside the research projects mem_exclusive cpuset.
I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage
such scenarios. Let memory allocations for user space (GFP_USER) be
constrained by a tasks current cpuset, but memory allocations for kernel space
(GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the
current cpuset, even though kernel space allocations will still _prefer_ to
remain within the current tasks cpuset, if memory is easily available.
Let the oom killer be constrained to consider only tasks that are in
overlapping mem_exclusive cpusets (it won't help much to kill a task that
normally cannot allocate memory on any of the same nodes as the ones on which
the current task can allocate.)
The current constraints imposed on setting mem_exclusive are unchanged. A
cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a
mem_exclusive cpuset may not overlap any of its siblings memory nodes.
This patch was presented on linux-mm in early July 2005, though did not
generate much feedback at that time. It has been built for a variety of
arch's using cross tools, and built, booted and tested for function on SN2
(ia64).
There are 4 patches in this set:
1) Some minor cleanup, and some improvements to the code layout
of one routine to make subsequent patches cleaner.
2) Add another GFP flag - __GFP_HARDWALL. It marks memory
requests for USER space, which are tightly confined by the
current tasks cpuset.
3) Now memory requests (such as KERNEL) that not marked HARDWALL can
if short on memory, look in the potentially larger pool of memory
defined by the nearest mem_exclusive ancestor cpuset of the current
tasks cpuset.
4) Finally, modify the oom killer to skip any task whose mem_exclusive
cpuset doesn't overlap ours.
Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32
bytes of kernel text space. Patch (2) has no affect on the size of kernel
text space (it just adds a preprocessor flag). Patches (3) and (4) added
about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which
matters only if CONFIG_CPUSET is enabled.
This patch:
This patch applies a few comment and code cleanups to mm/oom_kill.c prior to
applying a few small patches to improve cpuset management of memory placement.
The comment changed in oom_kill.c was seriously misleading. The code layout
change in select_bad_process() makes room for adding another condition on
which a process can be spared the oom killer (see the subsequent
cpuset_nodes_overlap patch for this addition).
Also a couple typos and spellos that bugged me, while I was here.
This patch should have no material affect.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark variables which are usually accessed for reads with __readmostly.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We don't reset the cache hit count until after readahead does a successful
readahead. This seems to leave a corner case open where we miss in cache,
but don't restart the readhead right away.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move some more frequently read variables that showed up during some of our
performance tests as sometimes ending up in hot cachelines to the
read_mostly section.
Fix: Move the __read_mostly from before hpet_usec_quotient to follow the
variable like the other uses of __read_mostly.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch modifies the VFS setxattr, getxattr, and listxattr code to fall
back to the security module for security xattrs if the filesystem does not
support xattrs natively. This allows security modules to export the incore
inode security label information to userspace even if the filesystem does
not provide xattr storage, and eliminates the need to individually patch
various pseudo filesystem types to provide such access. The patch removes
the existing xattr code from devpts and tmpfs as it is then no longer
needed.
The patch restructures the code flow slightly to reduce duplication between
the normal path and the fallback path, but this should only have one
user-visible side effect - a program may get -EACCES rather than
-EOPNOTSUPP if policy denied access but the filesystem didn't support the
operation anyway. Note that the post_setxattr hook call is not needed in
the fallback case, as the inode_setsecurity hook call handles the incore
inode security state update directly. In contrast, we do call fsnotify in
both cases.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add page_state info to the per-node meminfo file in sysfs. This is mostly
just for informational purposes.
The lack of this information was brought up recently during a discussion
regarding pagecache clearing, and I put this patch together to test out one
of the suggestions.
It seems like interesting info to have, so I'm submitting the patch.
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Proposed by and based on a patch from Eric Dumazet <dada1@cosmosbay.com>:
This patch removes unnecessary critical section in ksize() function, as
cli/sti are rather expensive on modern CPUS.
It additionally adds a docbook entry for ksize() and further simplifies the
code.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mostobjects returned by __cache_alloc() will be written by the caller,
(but not all callers want to write all the object, but just at the
begining) prefetchw() tells the modern CPU to think about the future
writes, ie start some memory transactions in advance.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new accessor for PTEs, which passes the full hint from the mmu_gather
struct; this allows architectures with hardware pagetables to optimize away
atomic PTE operations when destroying an address space. Removing the
locked operation should allow better pipelining of memory access in this
loop. I measured an average savings of 30-35 cycles per zap_pte_range on
the first 500 destructions on Pentium-M, but I believe the optimization
would win more on older processors which still assert the bus lock on xchg
for an exclusive cacheline.
Update: I made some new measurements, and this saves exactly 26 cycles over
ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
full address space destruction.
pte_clear_full is not yet used, but is provided for future optimizations
(in particular, when running inside of a hypervisor that queues page table
updates, the full hint allows us to avoid queueing unnecessary page table
update for an address space in the process of being destroyed.
This is not a huge win, but it does help a bit, and sets the stage for
further hypervisor optimization of the mm layer on all architectures.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is used only in slab.c and each architecture gets to define whcih
underlying type is to be used.
Seems a bit silly - move it to slab.c and use the same type for all
architectures: unsigned int.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial Post (Wed, 17 Aug 2005)
This patch moves the
if (! pte_none(*pte))
hugetlb_clean_stale_pgtable(pte);
logic into huge_pte_alloc() so all of its callers can be immune to the bug
described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246
> It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
> huge_pte_alloc() might return a pmd that points to a PTE page. It happens
> if the virtual address for hugetlb mmap is recycled from previously used
> normal page mmap. free_pgtables() might not scrub the pmd entry on
> munmap and hugetlb_prefault skips on any pmd presence regardless what type
> it is.
Unless I am missing something, it seems more correct to place the check inside
huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
It also allows checking for this condition when lazily faulting huge pages
later in the series.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Version 6 of the ARM architecture introduces the concept of 16MB pages
(supersections) and 36-bit (40-bit actually, but nobody uses this) physical
addresses. 36-bit addressed memory and I/O and ARMv6 can only be mapped
using supersections and the requirement on these is that both virtual and
physical addresses be 16MB aligned. In trying to add support for ioremap()
of 36-bit I/O, we run into the issue that get_vm_area() allows for a
maximum of 512K alignment via the IOREMAP_MAX_ORDER constant. To work
around this, we can:
- Allocate a larger VM area than needed (size + (1ul << IOREMAP_MAX_ORDER))
and then align the pointer ourselves, but this ends up with 512K of
wasted VM per ioremap().
- Provide a new __get_vm_area_aligned() API and make __get_vm_area() sit
on top of this. I did this and it works but I don't like the idea
adding another VM API just for this one case.
- My preferred solution which is to allow the architecture to override
the IOREMAP_MAX_ORDER constant with it's own version.
Signed-off-by: Deepak Saxena <dsaxena@plexity.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If !vma->vm-ops we already BUG above, so retesting it is useless. The
compiler cannot optimize this because BUG is a macro and is not thus marked
noreturn; that should possibly be fixed.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Either shmem_getpage returns a failure, or it found a page, or it was told
it couldn't do any I/O. So it's useless to check nonblock in the else
branch. We could add a BUG() there but I preferred to comment the
offending function.
This was taken out from one Ingo Molnar's old patch I'm resurrecting.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a small spelling mistake. subtile->subtle
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Better late than never, I've at last reviewed the madvise vma merging
going into 2.6.13. Remove a pointless check and fix two little bugs -
a simple test (with /proc/<pid>/maps hacked to show ReadHints) showed
both mismerges in practice: though being madvise, neither was disastrous.
1. Correct placement of the success label in madvise_behavior: as in
mprotect_fixup and mlock_fixup, it is necessary to update vm_flags
when vma_merge succeeds (to handle the exceptional Case 8 noted in
the comments above vma_merge itself).
2. Correct initial value of prev when starting part way into a vma: as
in sys_mprotect and do_mlock, it needs to be set to vma in this case
(vma_merge handles only that minimum of cases shown in its comments).
3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next,
so it's pointless to make that same assignment again in sys_madvise.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Lameter and Marcelo Tosatti asked to get rid of the
atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code.
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a capability check to sys_set_zone_reclaim(). This syscall is not
something that should be available to a user.
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This bitop does not need to be atomic because it is performed when there will
be no references to the page (ie. the page is being freed).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it
has no data page to map there: then if the hole in the file is later filled,
__xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that
offset by the newly allocated file page, so that established mappings will see
the newly written data.
However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs,
chosen for coloring by user virtual address; and if mremap has meanwhile been
used to move a mapping containing a ZERO_PAGE, it will generally not match the
ZERO_PAGE(address) __xip_unmap is looking for.
To maintain XIP's established mappings correctly on MIPS, we need Nick's fix
to mremap's move_one_page (originally presented as an optimization), to
replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE
appropriate to the new address.
(But when I first saw this, I was thinking the ZERO_PAGEs themselves would get
corrupted, very bad. Now I think it's the other way round, that the
established mappings will fail to see the newly written data: incorrect, but
not corrupting everything else. Whether filemap_xip's technique is generally
safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried
to do that in tmpfs.)
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thanks to Bill Irwin for pointing this out.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Microoptimise page_add_anon_rmap. Although these expressions are used only in
the taken branch of the if() statement, the compiler can't reorder them inside
because atomic_inc_and_test is a barrier.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Just be clear that VM_RESERVED pages here are a bug, and the test is not there
because they are expected.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch was recently discussed on linux-mm:
http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2
I inherited a large code base from Ray for page migration. There was a
small patch in there that I find to be very useful since it allows the
display of the locality of the pages in use by a process. I reworked that
patch and came up with a /proc/<pid>/numa_maps that gives more information
about the vma's of a process. numa_maps is indexes by the start address
found in /proc/<pid>/maps. F.e. with this patch you can see the page use
of the "getty" process:
margin:/proc/12008 # cat maps
00000000-00004000 r--p 00000000 00:00 0
2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so
2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so
2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0
2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0
2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE
2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache
2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0
4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty
6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty
6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap]
60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0
60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0 [stack]
a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso]
cat numa_maps
2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2
2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16
2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3
2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3
2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2
2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1
4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2
6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
getty uses ld.so. The first vma is the code segment which is used by 43
other processes and the pages are evenly distributed over the 4 nodes.
The second vma is the process specific data portion for ld.so. This is
only one page.
The display format is:
<startaddress> Links to information in /proc/<pid>/map
<memory policy> This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}"
MaxRef= <maximum reference to a page in this vma>
Pages= <Nr of pages in use>
Mapped= <Nr of pages with mapcount >
Anon= <nr of anonymous pages>
Nx= <Nr of pages on Node x>
The content of the proc-file is self-evident. If this would be tied into
the sparsemem system then the contents of this file would not be too
useful.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a
time when testing rss was important to avoid a particular race between
dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo-
optimization, made even more obscure by the get_mm_counter macro.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Three of the four BUG_ONs in delete_from_swap_cache are immediately
repeated in __delete_from_swap_cache: delete those and add the one. But
perhaps mm/ is altogether overprovisioned with historic BUGs?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.
The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention. However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split. Certainly the split is mere
overhead in the common case of a single swap device.
So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).
If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The get_swap_page/scan_swap_map latency can be so bad that even those without
preemption configured deserve relief: periodically cond_resched.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
get_swap_page has often shown up on latency traces, doing lengthy scans while
holding two spinlocks. swap_list_lock is already dropped, now scan_swap_map
drop swap_device_lock before scanning the swap_map.
While scanning for an empty cluster, don't worry that racing tasks may
allocate what was free and free what was allocated; but when allocating an
entry, check it's still free after retaking the lock. Avoid dropping the lock
in the expected common path. No barriers beyond the locks, just let the
cookie crumble; highest_bit limit is volatile, but benign.
Guard against swapoff: must check SWP_WRITEOK before allocating, must raise
SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to
fall - just use schedule_timeout, we don't want to burden scan_swap_map
itself, and it's very unlikely that anyone can really still be in
scan_swap_map once swapoff gets this far.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rewrite scan_swap_map to allocate in just the same way as before (taking the
next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly
empty cluster, falling back to lowest entry if none), but with a view towards
dropping the lock in the next patch.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rewrite get_swap_page to allocate in just the same sequence as before, but
without holding swap_list_lock across its scan_swap_map. Decrement
nr_swap_pages and update swap_list.next in advance, while still holding
swap_list_lock. Skip full devices by testing highest_bit. Swapoff hold
swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK. Reduces lock
contention when there are parallel swap devices of the same priority.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes negligible difference in practice: but swap_list.next should not be
updated to a higher prio in the general helper swap_info_get, but rather in
swap_entry_free; and then only in the case when entry is actually freed.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The swap header's unsigned int last_page determines the range of swap pages,
but swap_info has been using int or unsigned long in some cases: use unsigned
int throughout (except, in several places a local unsigned long is useful to
avoid overflows when adding).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The "Adding %dk swap" message shows the number of swap extents, as a guide to
how fragmented the swapfile may be. But a useful further guide is what total
extent they span across (sometimes scarily large).
And there's no need to keep nr_extents in swap_info: it's unused after the
initial message, so save a little space by keeping it on stack.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are several comments that swap's extent_list.prev points to the lowest
extent: that's not so, it's extent_list.next which points to it, as you'd
expect. And a couple of loops in add_swap_extent which go all the way through
the list, when they should just add to the other end.
Fix those up, and let map_swap_page search the list forwards: profiles shows
it to be twice as quick that way - because prefetch works better on how the
structs are typically kmalloc'ed? or because usually more is written to than
read from swap, and swap is allocated ascendingly?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_swapon's call to destroy_swap_extents on failure is made after the final
swap_list_unlock, which is faintly unsafe: another sys_swapon might already be
setting up that swap_info_struct. Calling it earlier, before taking
swap_list_lock, is safe. sys_swapoff's call to destroy_swap_extents was safe,
but likewise move it earlier, before taking the locks (once try_to_unuse has
completed, nothing can be needing the swap extents).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If a regular swapfile lies on a filesystem whose blocksize is less than
PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap
pages; but sys_swapon's nr_good_pages was not expecting that. Also,
setup_swap_extents takes no account of badpages listed in the swap header: not
worth doing so, but ensure nr_badpages is 0 for a regular swapfile.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This splits up sparse_index_alloc() into two pieces. This is needed
because we'll allocate the memory for the second level in a different place
from where we actually consume it to keep the allocation from happening
underneath a lock
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With cleanups from Dave Hansen <haveblue@us.ibm.com>
SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to
mem_sections. This two level layout scheme is able to achieve smaller
memory requirements for SPARSEMEM with the tradeoff of an additional shift
and load when fetching the memory section. The current SPARSEMEM
implementation is a one dimensional array of mem_sections which is the
default SPARSEMEM configuration. The patch attempts isolates the
implementation details of the physical layout of the sparsemem section
array.
SPARSEMEM_EXTREME requires bootmem to be functioning at the time of
memory_present() calls. This is not always feasible, so architectures
which do not need it may allocate everything statically by using
SPARSEMEM_STATIC.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A new option for SPARSEMEM is ARCH_SPARSEMEM_EXTREME. Architecture
platforms with a very sparse physical address space would likely want to
select this option. For those architecture platforms that don't select the
option, the code generated is equivalent to SPARSEMEM currently in -mm.
I'll be posting a patch on ia64 ml which uses this new SPARSEMEM feature.
ARCH_SPARSEMEM_EXTREME makes mem_section a one dimensional array of
pointers to mem_sections. This two level layout scheme is able to achieve
smaller memory requirements for SPARSEMEM with the tradeoff of an
additional shift and load when fetching the memory section. The current
SPARSEMEM -mm implementation is a one dimensional array of mem_sections
which is the default SPARSEMEM configuration. The patch attempts isolates
the implementation details of the physical layout of the sparsemem section
array.
ARCH_SPARSEMEM_EXTREME depends on 64BIT and is by default boolean false.
I've boot tested under aim load ia64 configured for ARCH_SPARSEMEM_EXTREME.
I've also boot tested a 4 way Opteron machine with !ARCH_SPARSEMEM_EXTREME
and tested with aim.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Defer copying of ptes until fault time when it is possible to reconstruct
the pte from backing store. Idea from Andi Kleen and Nick Piggin.
Thanks to input from Rik van Riel and Linus and to Hugh for correcting
my blundering.
Ray Fucillo <fucillo@intersystems.com> reports:
"I applied this latest patch to a 2.6.12 kernel and found that it does
resolve the problem. Prior to the patch on this machine, I was
seeing about 23ms spent in fork for ever 100MB of shared memory
segment.
After applying the patch, fork is taking about 1ms regardless of the
shared memory size."
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This bug could cause oopses and page state corruption, because ncpfs
used the generic page-cache symlink handlign functions. But those
functions only work if the page cache is guaranteed to be "stable", ie a
page that was installed when the symlink walk was started has to still
be installed in the page cache at the end of the walk.
We could have fixed ncpfs to not use the generic helper routines, but it
is in many ways much cleaner to instead improve on the symlink walking
helper routines so that they don't require that absolute stability.
We do this by allowing "follow_link()" to return a error-pointer as a
cookie, which is fed back to the cleanup "put_link()" routine. This
also simplifies NFS symlink handling.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a crash in the hugepage code. unmap_hugepage_area() was
assuming that (due to prefault) PTEs must exist for all the area in
question. However, this may not be the case, if mmap() encounters an error
before the prefault and calls unmap_region() to clean up any partial
mapping.
Depending on the hugepage configuration, this crash can be triggered by an
unpriveleged user.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have found what seems to be a small bug in __vm_enough_memory() when
sysctl_overcommit_memory is set to OVERCOMMIT_NEVER.
When this bug occurs the systems fails to boot, with /sbin/init whining
about fork() returning ENOMEM.
We hunted down the problem to this:
The deferred update mecanism used in vm_acct_memory(), on a SMP system,
allows the vm_committed_space counter to have a negative value.
This should not be a problem since this counter is known to be inaccurate.
But in __vm_enough_memory() this counter is compared to the `allowed'
variable, which is an unsigned long. This comparison is broken since it
will consider the negative values of vm_committed_space to be huge positive
values, resulting in a memory allocation failure.
Signed-off-by: <Jean-Marc.Saffroy@ext.bull.net>
Signed-off-by: <Simon.Derr@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mremap's move_vma is applying __vm_stat_account to the old vma which may
have already been freed: move it to just before the do_munmap.
mremapping to and fro with CONFIG_DEBUG_SLAB=y showed /proc/<pid>/status
VmSize and VmData wrapping just like in kernel bugzilla #4842, and fixed by
this patch - worth including in 2.6.13, though not yet confirmed that it
fixes that specific report from Frank van Maarseveen.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The VM_FAULT_WRITE thing is an extra bit, not a valid return value, and
has to be treated as such by get_user_pages().
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.
So revert __follow_page to check pte_write as before, and make
do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has
done its full job: once get_user_pages receives this value, it no longer
requires pte_write in __follow_page.
But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case. To avoid changing
them all in a hurry, make an inline wrapper function (using the old
name) that masks off the new bit, and use the extended interface with
double underscores.
Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
[ Cleanups by Nick Piggin ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A kernel BUG() is triggered by a call to set_mempolicy() with a negative
first argument. This is because the mode is declared as an int, and the
validity check doesnt check < 0 values. Alternatively, mode could be
declared as unsigned int or unsigned long.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 has a large sparse gate area between VSYSCALL_START and
VSYSCALL_END, not all of it presently backed by pmds. Alexander Nyberg has
found that in some circumstances gdb may try to ptrace here, and hit
get_user_pages BUG_ON. It seems odd that gdb should be accessing here, but
it certainly shouldn't crash in this way: relax BUG_ON to -EFAULT. Fixes
kernel bugzilla #4801.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There's no real guarantee that handle_mm_fault() will always be able to
break a COW situation - if an update from another thread ends up
modifying the page table some way, handle_mm_fault() may end up
requiring us to re-try the operation.
That's normally fine, but get_user_pages() ended up re-trying it as a
read, and thus a write access could in theory end up losing the dirty
bit or be done on a page that had not been properly COW'ed.
This makes get_user_pages() always retry write accesses as write
accesses by making "follow_page()" require that a writable follow has
the dirty bit set. That simplifies the code and solves the race: if the
COW break fails for some reason, we'll just loop around and try again.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We are iterating over all nodes in nr_free_zone_pages(). Because the
fallback zonelists contain all nodes in the system, and we walk all the
zonelists, we're counting memory multiple times (once for each node). This
caused us to make a size estimate of 32GB for an 8GB AMD64 box, which makes
all the dirty ratio calculations, etc incorrect.
There's still a further bug to fix from e820 holes causing overestimation
as well, but this fix is separate, and good as is, and fixes one class of
problems. Problem found by Badari, and tested by Ram Pai - thanks!
Signed-off-by: Martin J. Bligh <mbligh@mbligh.org>
Signed-off-by: Matt Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Originally __free_pages_bulk used the relative page number within a zone to
define its buddies. This meant that to maintain the "maximally aligned"
requirements (that an allocation of size N will be aligned at least to N
physically) zones had to also be aligned to 1<<MAX_ORDER pages. When
__free_pages_bulk was updated to use the relative page frame numbers of the
free'd pages to pair buddies this released the alignment constraint on the
'left' edge of the zone. This allows _either_ edge of the zone to contain
partial MAX_ORDER sized buddies. These simply never will have matching
buddies and thus will never make it to the 'top' of the pyramid.
The patch below removes a now redundant check ensuring that the mem_map was
aligned to MAX_ORDER.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The madvise() system call returns -EBADF for areas which does not map to
files, only for *behaviour* request MADV_WILLNEED.
According to man pages, madvise returns :
EBADF - the map exists, but the area maps something that isn't a file.
Fixes bug 2995.
Signed-off-by: Suzuki K P <suzuki@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix bug identifued by Richard Purdie <rpurdie@rpsys.net>.
oprofile calls check_user_page_readable() from interrupt context, so we
deadlock over various VFS locks.
But check_user_page_readable() doesn't imply either a read or a write of the
page's contents. Change __follow_page() so that check_user_page_readable()
can tell __follow_page() that we're not accessing the page's contents, and use
that info to avoid the troublesome lock-takings.
Also, make follow_page() inline for the single callsite in memory.c to save a
bit of stack space.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch includes feedback from Andrew and Christoph. Thanks for
taking time to review.
Use of empty_zero_page was eliminated to fix compilation for architectures
that don't have it.
This patch removes setting pages up-to-date in ext2_get_xip_page and all
bug checks to verify that the page is indeed up to date. Setting the page
state on mapping to userland is bogus. None of the code patchs involved
with these pages in mm cares about the page state.
still on my ToDo list: identify a place outside second extended where
__inode_direct_access should reside
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mm/filemap_xip.c: In function `__xip_unmap':
mm/filemap_xip.c:194: request for member `pte' in something not a structure or union
Apparently pte_pfn() takes a pte_t, not a pointer to a pte_t. From looking
at asm/page.h, it seems to be the same on ia32 or ppc (iff
STRICT_MM_TYPECHECKS is enabled, which is disabled by default on ppc).
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We now print statistics when invoking the OOM killer, however this
information is not rate limited and you can get into situations where the
console is continually spammed.
For example, when a task is exiting the OOM killer will simply return
(waiting for that task to exit and clear up memory). If the VM continually
calls back into the OOM killer we get thousands of copies of show_mem() on
the console.
Use printk_ratelimit() to quieten it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove completly bogus comment from did_some_progress != 0 handling (that
same comment is a few lines below on did_some_progress = 0 case, where it
belongs).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch used to be in Andrew's tree before the NUMA slab allocator went
in. Either this patch or the NUMA slab allocator is needed in order for
kmalloc_node to work correctly.
pcibus_to_node may be used to generate the node information passed to
kmalloc_node. pcibus_to_node returns -1 if it was not able to determine
on which node a pcibus is located. For that case kmalloc_node must
work like kmalloc.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I spotted this issue while in memmap_init last week. I can't say the
change has any test coverage by me. start_pfn was formerly used in main
"for" loop. The fix is replace start_pfn with pfn.
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes use of ALIGN() to remove duplicate round-up code.
Signed-off-by: Nick Wilson <njw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch retrieves the max_pfn being used by previous kernel and stores it
in a safe location (saved_max_pfn) before it is overwritten due to user
defined memory map. This pfn is used to make sure that user does not try to
read the physical memory beyond saved_max_pfn.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here is the fix for the problem described in
http://bugzilla.kernel.org/show_bug.cgi?id=4721
Basically, problem is generic_file_buffered_write() is accessing beyond end
of the iov[] vector after handling the last vector. If we happen to cross
page boundary, we get a fault.
I think this simple patch is good enough. If we really don't want to
depend on the "count", then we need pass nr_segs to
filemap_set_next_iovec() and decrement it and check it.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
CONFIG_PM_DISK is long gone, but it still managed to survived at few
places.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Out-of-tree user of remap_pfn_range hit kernel BUG at mm/memory.c:1112! It
passes an unrounded size to remap_pfn_range, which was okay before 2.6.12,
but misses remap_pte_range's new end condition. An audit of all the other
ptwalks confirms that this is the only one so exposed.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a bug on error handling in the direct I/O function.
Currently, if a file is opened with the O_DIRECT|O_SYNC flag, the write()
syscall cannot receive the EIO error after an I/O error (SCSI cable is
disconnected etc.).
Return values of other points that call generic_osync_inode() are treated
appropriately.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch reworks filemap_xip.c with the goal to reduce code duplication
from mm/filemap.c. It applies agains 2.6.12-rc6-mm1. Instead of
implementing the aio functions, this one implements the synchronous
read/write functions only. For readv and writev, the generic fallback is
used. For aio, we rely on the application doing the fallback. Since our
"synchronous" function does memcpy immediately anyway, there is no
performance difference between using the fallbacks or implementing each
operation.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- generic_file* file operations do no longer have a xip/non-xip split
- filemap_xip.c implements a new set of fops that require get_xip_page
aop to work proper. all new fops are exported GPL-only (don't like to
see whatever code use those except GPL modules)
- __xip_unmap now uses page_check_address, which is no longer static
in rmap.c, and defined in linux/rmap.h
- mm/filemap.h is now much more clean, plainly having just Linus'
inline funcs moved here from filemap.c
- fix includes in filemap_xip to make it build cleanly on i386
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch updates some comments to match code changes.
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch removes the f_error field and all checks of f_error.
Trond said:
f_error was introduced for NFS, and made sense when we were guaranteed
always to have a file pointer around when write errors occurred. Since
then, we have (for various reasons) had to introduce the nfs_open_context in
order to track the file read/write state, and it made sense to move our
f_error tracking there too.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's a small patch to improve the performance of mempool_alloc by only
initializing the wait queue when we're about to wait.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes redundant VM_ClearReadHint from mm/madvice.c which was
left there by Prasanna's patch.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.
Most of the changes come from the sound and net subsystems. The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.
I left UML alone for now because I would need more time to read the code
carefully before making changes there.
Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Patch to allocate the control structures for for ide devices on the node of
the device itself (for NUMA systems). The patch depends on the Slab API
change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
posted today.
Does some realignment too.
Signed-off-by: Justin M. Forbes <jmforbes@linuxtx.org>
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Pravin Shelar <pravin@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sparse's initalization be accessible at runtime. This allows sparse
mappings to be created after boot in a hotplug situation.
This patch is separated from the previous one just to give an indication how
much of the sparse infrastructure is *just* for hotplug memory.
The section_mem_map doesn't really store a pointer. It stores something that
is convenient to do some math against to get a pointer. It isn't valid to
just do *section_mem_map, so I don't think it should be stored as a pointer.
There are a couple of things I'd like to store about a section. First of all,
the fact that it is !NULL does not mean that it is present. There could be
such a combination where section_mem_map *is* NULL, but the math gets you
properly to a real mem_map. So, I don't think that check is safe.
Since we're storing 32-bit-aligned structures, we have a few bits in the
bottom of the pointer to play with. Use one bit to encode whether there's
really a mem_map there, and the other one to tell whether there's a valid
section there. We need to distinguish between the two because sometimes
there's a gap between when a section is discovered to be present and when we
can get the mem_map for it.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The part of the sparsemem patch which modifies memmap_init_zone() has recently
become a problem. It changes behavior so that there is a call to
pfn_to_page() for each individual page inside of a node's range:
node_start_pfn through node_end_pfn. It used to simply do this once, at the
beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
of a node made it necessary to change.
Mike Kravetz recently wrote a patch which made the NUMA code accept some new
kinds of layouts. The system's memory was laid out like this, with node 0's
memory in two pieces: one before and one after node 1's memory:
Node 0: +++++ +++++
Node 1: +++++
Previous behavior before Mike's patch was to assign nodes like this:
Node 0: 00000 XXXXX
Node 1: 11111
Where the 'X' areas were simply thrown away. The new behavior was to make the
pg_data_t span node 0 across all of its areas, including areas that are really
node 1's: Node 0: 000000000000000 Node 1: 11111
This wastes a little bit of mem_map space, but ends up being OK, and more
fully utilizes the system's memory. memmap_init_zone() initializes all of the
"struct page"s for node 0, even for the "hole", but those never get used,
because there is no pfn_to_page() that resolves to those pages. However, only
calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
allocated for node0->node_mem_map because:
struct page *start = pfn_to_page(start_pfn);
// effectively start = &node->node_mem_map[0]
for (page = start; page < (start + size); page++) {
init_page_here();...
page++;
}
Slow, and wasteful, but generally harmless.
But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
does):
for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
page = pfn_to_page(pfn);
}
And you end up trying to initialize node 1's pages too early, along with bogus
data from node 0. This patch checks for those weird layouts and declines to
touch the pages, making the more frequent pfn_to_page() calls OK to do.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow architectures to indicate that they will be providing hooks to indice
installed memory areas, memory_present(). Provide prototypes for the i386
implementation.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This gives DISCONTIGMEM a bit more help text to explain what it does, not just
when to choose it.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I got some feedback from users who think that the new "Memory Model" menu is a
little invasive. This patch will hide that menu, except when
CONFIG_EXPERIMENTAL is enabled *or* when an individual architecture wants it.
An individual arch may want to enable it because they've removed their
arch-specific DISCONTIG prompt in favor of the mm/Kconfig one.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch applies on top of 2.6.12-rc2-mm1. It fixes a minor
user interaction issue, and an early reference to SPARSEMEM.
This "choice" menu would always default to FLATMEM, as it was listed first.
Move it to the end so that the other defaults have a chance first.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is some confusion that arose when working on SPARSEMEM patch between
what is needed for DISCONTIG vs. NUMA.
Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
All of the current NUMA implementations require an implementation of
DISCONTIG. Because of this, quite a lot of code which is really needed for
NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some
of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
case.
Introducing this new NEED_MULTIPLE_NODES config option allows code that is
needed for both NUMA or DISCONTIG to be separated out from code that is
specific to DISCONTIG.
One great advantage of this approach is that it doesn't require every
architecture to be converted over. All of the current implementations
should "just work", only the ones implementing SPARSEMEM will have to be
fixed up.
The change to free_area_init() makes it work inside, or out of the new
config option.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With sparsemem being introduced, we need a central place for new
memory-related .config options: mm/Kconfig. This allows us to remove many
of the duplicated arch-specific options.
The new option, CONFIG_FLATMEM, is there to enable us to detangle NUMA and
DISCONTIGMEM. This is a requirement for sparsemem because sparsemem uses
the NUMA code without the presence of DISCONTIGMEM. The sparsemem patches
use CONFIG_FLATMEM in generic code, so this patch is a requirement before
applying them.
Almost all places that used to do '#ifndef CONFIG_DISCONTIGMEM' should use
'#ifdef CONFIG_FLATMEM' instead.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Generify the value fields in the page_flags. The aim is to allow the location
and size of these fields to be varied. Additionally we want to move away from
fixed allocations per field whilst still enforcing the overall bit utilisation
limits. We rely on the compiler to spot and optimise the accessor functions.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce a simple allocator for the NUMA remap space. This space is very
scarce, used for structures which are best allocated node local.
This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
the pgdat->node_mem_map initialized in a consistent place for all
architectures.
Issues:
o alloc_remap takes a node_id where we might expect a pgdat which was intended
to allow us to allocate the pgdat's using this mechanism; which we do not yet
do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The boot_pageset needs to be preserved for hotplugging and for off line
processors and nodes. Otherwise pointers will point into memory that has
now a different use. /proc/zoneinfo is currently showing strange results
if processors / nodes are not present.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's common practice to msync a large address range regularly, in which
often only a few ptes have actually been dirtied since the previous pass.
sync_pte_range then goes much faster if it tests whether pte is dirty
before locating and accessing each struct page cacheline; and it is hardly
slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
when every pte actually is dirty.
But beware, s390's pte_dirty always says false, since its dirty bit is kept
in the storage key, located via the struct page address. So skip this
optimization in its case: use a pte_maybe_dirty macro which just says true
if page_test_and_clear_dirty is implemented.
Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remember that ironic get_user_pages race? when the raised page_count on a
page swapped out led do_wp_page to decide that it had to copy on write, so
substituted a different page into userspace. 2.6.7 onwards have Andrea's
solution, where try_to_unmap_one backs out if it finds page_count raised.
Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
and was found a few months ago to hang an intensive page migration test. A
year ago I was hesitant to engage page_mapcount, now it seems the right fix.
So remove the page_count hack from try_to_unmap_one; and use activate_page in
unuse_mm when dropping lock, to replace its secondary effect of helping
swapoff to make progress in that case.
Simplify can_share_swap_page (now called only on anonymous pages) to check
page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
their (pessimistic) sum, but does not need swapper_space.tree_lock for that.
In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
keep sum on the high side, and correct when can_share_swap_page called.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A small optimization to do_wp_page's check for whether to avoid copy by
reusing the page already mapped. It can never share a cached file page,
nor can it share a reserved page (often the empty zero page), so it's a
waste of time to lock and unlock in those cases. Which nowadays can both
be neatly excluded by a preliminary PageAnon test.
Christoph has reported that a preliminary page_count test proved valuable
for scalability here, but PageAnon covers more common cases all at once.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since its birth, get_user_pages has been calling a misguided get_page_map
function. follow_page has already returned NULL if the pfn is invalid, we
cannot reach an invalid pfn from a validated struct page.
Remove get_page_map, and the messy rewind in get_user_pages to cope with
its failure. Oh, and could we please call that "struct page *page" like
everywhere else, instead of "struct page *map"?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
ought to clear them to avoid repetitive reports (Nikita noticed this too).
Let prep_new_page check page_count and PG_slab as free_pages_check does.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Strict mbind's check for currently mapped pages being on node has been
using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry:
replace that by a standard four-level page table walk like others in mm.
Since mmap_sem is held for writing, page_table_lock can be taken at the
inner level to limit latency.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Strict mbind's check that pages already mapped are on right node has been
using pte_page without checking if pfn_valid, and without page_table_lock
to prevent spurious failures when try_to_unmap_one intervenes between the
pte_present and the pte_page.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To improve shmem scalability, we allowed tmpfs instances which don't need
their blocks or inodes limited not to count them, and not to allocate any
sbinfo. Which was okay when the only use for the sbinfo was accounting
blocks and inodes; but since then a couple of unrelated projects extending
tmpfs want to store other data in the sbinfo. Whether either extension
reaches mainline is beside the point: I'm guilty of a bad design decision,
and should restore sbinfo to make any such future extensions easier.
So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
inodes. Brent Casavant verified (many months ago) that this does not
perceptibly impact the scalability (since the unlimited sbinfo cacheline is
repeatedly accessed but only once dirtied).
And merge shmem_set_size into its sole caller shmem_remount_fs.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reduce size of the huge per_cpu_pageset structure in __initdata introduced
into mm1 with the pageset localization patchset. Use one specially
configured pageset per cpu for all zones and nodes during bootup.
- Avoid duplication of pageset initialization code.
- do the adding to the pageset list before potential free_pages_bulk
in free_hot_cold_page (otherwise we would have to hold a page
in a pageset during the period that the boot pagesets are in use).
- remove mistaken __cpuinitdata attribute and revert back to __initdata
for the boot pageset. A boot pageset is not necessary for cpu hotplug.
Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
sparsemem patches)
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The pageset array can potentially acquire a huge amount of memory on large
NUMA systems. F.e. on a system with 512 processors and 256 nodes there
will be 256*512 pagesets. If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets. The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets. Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.
The slab allocator flushes its per cpu caches every 2 seconds. The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.
The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides more debug info when the system is OOM. It displays
memory stats (basically sysrq-m info) from __alloc_pages() when page
allocation fails and during OOM kill.
Thanks to Dave Jones for coming up with the idea.
Signed-off-by: Janet Morgan <janetmor@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
By making the offset argument of __read_page_state an unsigned long instead of
unsigned, we can avoid forcing the compiler to sign extend a usually constant
argument. This saves 1 instruction on x86-64.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
By making the offset argument of __mod_page_state an unsigned long instead
of unsigned, we can avoid forcing the compiler to sign extend a usually
constant argument. This saves 1 instruction on x86-64.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0. The following patch removes the argument and updates all the
calls to try_to_free_pages.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The topdown changes in 2.6.12-rc1 can cause large allocations with large
stack limit to fail, despite there being space available. The
mmap_base-len is only valid when len >= mmap_base. However, nothing in
topdown allocator checks this. It's only (now) caught at higher level,
which will cause allocation to simply fail. The following change restores
the fallback to bottom-up path, which will allow large allocations with
large stack limit to potentially still succeed.
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>