OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Christoph Lameter	8695949a1d	[PATCH] Thin out scan_control: remove nr_to_scan and priority Make nr_to_scan and priority a parameter instead of putting it into scan control. This allows various small optimizations and IMHO makes the code easier to read. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:59 -08:00
Andrew Morton	a07fa3944b	[PATCH] slab: use on_each_cpu() Slab duplicates on_each_cpu(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:59 -08:00
Christoph Lameter	ac2b898ca6	[PATCH] slab: Remove SLAB_NO_REAP option SLAB_NO_REAP is documented as an option that will cause this slab not to be reaped under memory pressure. However, that is not what happens. The only thing that SLAB_NO_REAP controls at the moment is the reclaim of the unused slab elements that were allocated in batch in cache_reap(). Cache_reap() is run every few seconds independently of memory pressure. Could we remove the whole thing? Its only used by three slabs anyways and I cannot find a reason for having this option. There is an additional problem with SLAB_NO_REAP. If set then the recovery of objects from alien caches is switched off. Objects not freed on the same node where they were initially allocated will only be reused if a certain amount of objects accumulates from one alien node (not very likely) or if the cache is explicitly shrunk. (Strangely __cache_shrink does not check for SLAB_NO_REAP) Getting rid of SLAB_NO_REAP fixes the problems with alien cache freeing. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:59 -08:00
Randy Dunlap	911851e6ee	[PATCH] slab: fix kernel-doc warnings Fix kernel-doc warnings in mm/slab.c. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:59 -08:00
Pekka Enberg	fcc234f888	[PATCH] mm: kill kmem_cache_t usage We have struct kmem_cache now so use it instead of the old typedef. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:58 -08:00
Ravikiran G Thirumalai	b5d8ca7c50	[PATCH] slab: remove cachep->spinlock Remove cachep->spinlock. Locking has moved to the kmem_list3 and most of the structures protected earlier by cachep->spinlock is now protected by the l3->list_lock. slab cache tunables like batchcount are accessed always with the cache_chain_mutex held. Patch tested on SMP and NUMA kernels with dbench processes running, constant onlining/offlining, and constant cache tuning, all at the same time. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:58 -08:00
Andrew Morton	a737b3e2fc	[PATCH] slab cleanup slab.c has become a bit revolting again. Try to repair it. - Coding style fixes - Don't do assignments-in-if-statements. - Don't typecast assignments to/from void* Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:58 -08:00
Pekka Enberg	f30cf7d13e	[PATCH] slab: extract setup_cpu_cache Extract setup_cpu_cache() function from kmem_cache_create() to make the latter a little less complex. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:58 -08:00
Pekka Enberg	8fea4e96a8	[PATCH] slab: object to index mapping cleanup Clean up the object to index mapping that has been spread around mm/slab.c. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:58 -08:00
Nick Piggin	a482289d46	[PATCH] hugepage allocator cleanup Insert "fresh" huge pages into the hugepage allocator by the same means as they are freed back into it. This reduces code size and allows enqueue_huge_page to be inlined into the hugepage free fastpath. Eliminate occurances of hugepages on the free list with non-zero refcount. This can allow stricter refcount checks in future. Also required for lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> "This patch also eliminates a leak "cleaned up" by re-clobbering the refcount on every allocation from the hugepage freelists. With respect to the lockless pagecache, the crucial aspect is to eliminate unconditional set_page_count() to 0 on pages with potentially nonzero refcounts, though closer inspection suggests the assignments removed are entirely spurious." Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:58 -08:00
Nick Piggin	545b1ea9bf	[PATCH] mm: cleanup bootmem The bootmem code added to page_alloc.c duplicated some page freeing code that it really doesn't need to because it is not so performance critical. While we're here, make prefetching work properly by actually prefetching the page we're about to use before prefetching ahead to the next one (ie. get the most important transaction started first). Also prefetch just a single page ahead rather than leaving a gap of 16. Jack Steiner reported no problems with SGI's ia64 simulator. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:58 -08:00
Nick Piggin	8dfcc9ba27	[PATCH] mm: split highorder pages Have an explicit mm call to split higher order pages into individual pages. Should help to avoid bugs and be more explicit about the code's intention. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:57 -08:00
Nick Piggin	7c8ee9a863	[PATCH] mm: simplify vmscan vs release refcounting The VM has an interesting race where a page refcount can drop to zero, but it is still on the LRU lists for a short time. This was solved by testing a 0->1 refcount transition when picking up pages from the LRU, and dropping the refcount in that case. Instead, use atomic_add_unless to ensure we never pick up a 0 refcount page from the LRU, thus a 0 refcount page will never have its refcount elevated until it is allocated again. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:57 -08:00
Nick Piggin	f205b2fe62	[PATCH] mm: slab less atomics Atomic operation removal from slab Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:57 -08:00
Nick Piggin	5e9dace8d3	[PATCH] mm: page_alloc less atomics More atomic operation removal from page allocator Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:57 -08:00
Nick Piggin	674539115c	[PATCH] mm: less atomic ops In the page release paths, we can be sure that nobody will mess with our page->flags because the refcount has dropped to 0. So no need for atomic operations here. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:57 -08:00
Nick Piggin	4c84cacfa4	[PATCH] mm: PageActive no testset PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear operations. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:57 -08:00
Nick Piggin	8d438f96d2	[PATCH] mm: PageLRU no testset PG_lru is protected by zone->lru_lock. It does not need TestSet/TestClear operations. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:56 -08:00
Nick Piggin	46453a6e19	[PATCH] mm: never ClearPageLRU released pages If vmscan finds a zero refcount page on the lru list, never ClearPageLRU it. This means the release code need not hold ->lru_lock to stabilise PageLRU, so that lock may be skipped entirely when releasing !PageLRU pages (because we know PageLRU won't have been temporarily cleared by vmscan, which was previously guaranteed by holding the lock to synchronise against vmscan). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:56 -08:00
Andrew Morton	b40607fc02	[PATCH] __get_page_state() cpumask cleanup and fix __get_page_state() has an open-coded for_each_cpu_mask() loop in it. Tidy that up, then notice that the code was buggy: while (cpu < NR_CPUS) { unsigned long in, out, off; if (!cpu_isset(cpu, *cpumask)) continue; an obvious infinite loop. I guess we just never call it with a holey cpu mask. Even after my cpumask size-reduction work, this patch increases code size :( Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-22 07:53:55 -08:00
Hugh Dickins	6f5e6b9e69	[PATCH] fix free swap cache latency Lee Revell reported 28ms latency when process with lots of swapped memory exits. 2.6.15 introduced a latency regression when unmapping: in accounting the zap_work latency breaker, pte_none counted 1, pte_present PAGE_SIZE, but a swap entry counted nothing at all. We think of pages present as the slow case, but Lee's trace shows that free_swap_and_cache's radix tree lookup can make a lot of work - and we could have been doing it many thousands of times without a latency break. Move the zap_work update up to account swap entries like pages present. This does account non-linear pte_file entries, and unmap_mapping_range skipping over swap entries, by the same amount even though they're quick: but neither of those cases deserves complicating the code (and they're treated no worse than they were in 2.6.14). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-17 07:51:26 -08:00
Christoph Lameter	5b40dc780e	[PATCH] fix race in pagevec_strip? We can call try_to_release_page() with PagePrivate off and a valid page->mapping This may cause all sorts of trouble for the filesystem *_releasepage() handlers. XFS bombs out in that case. Lock the page before checking for page private. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-17 07:51:25 -08:00
Christoph Lameter	90036ee593	[PATCH] page migration: Fail with error if swap not setup Currently the migration of anonymous pages will silently fail if no swap is setup. This patch makes page migration functions check for available swap and fail with -ENODEV if no swap space is available. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-17 07:51:25 -08:00
Christoph Lameter	74c0024105	[PATCH] Consistent capabilites associated with MPOL_MOVE_ALL It seems that setting scheduling policy and priorities is also the kind of thing that might be performed in apps that also use the NUMA API, so it would seem consistent to use CAP_SYS_NICE for NUMA also. So use CAP_SYS_NICE for controlling migration permissions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-14 21:43:02 -08:00
Christoph Lameter	4983da07f1	[PATCH] page migration: fail if page is in a vma flagged VM_LOCKED page migration currently simply retries a couple of times if try_to_unmap() fails without inspecting the return code. However, SWAP_FAIL indicates that the page is in a vma that has the VM_LOCKED flag set (if ignore_refs ==1). We can check for that return code and avoid retrying the migration. migrate_page_remove_references() now needs to return a reason why the failure occured. So switch migrate_page_remove_references to use -Exx style error messages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-14 21:43:02 -08:00
Christoph Lameter	8fce4d8e3b	[PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. The cache reaper currently tries to free all alien caches and all remote per cpu pages in each pass of cache_reap. For a machines with large number of nodes (such as Altix) this may lead to sporadic delays of around ~10ms. Interrupts are disabled while reclaiming creating unacceptable delays. This patch changes that behavior by adding a per cpu reap_node variable. Instead of attempting to free all caches, we free only one alien cache and the per cpu pages from one remote node. That reduces the time spend in cache_reap. However, doing so will lengthen the time it takes to completely drain all remote per cpu pagesets and all alien caches. The time needed will grow with the number of nodes in the system. All caches are drained when they overflow their respective capacity. So the drawback here is only that a bit of memory may be wasted for awhile longer. Details: 1. Rename drain_remote_pages to drain_node_pages to allow the specification of the node to drain of pcp pages. 2. Add additional functions init_reap_node, next_reap_node for NUMA that manage a per cpu reap_node counter. 3. Add a reap_alien function that reaps only from the current reap_node. For us this seems to be a critical issue. Holdoffs of an average of ~7ms cause some HPC benchmarks to slow down significantly. F.e. NAS parallel slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with the additional interrupt holdoff reductions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-09 19:47:38 -08:00
Yasunori Goto	f2937be589	[PATCH] memory hotadd: pgdat->node_present_pages fix When pages are onlined, not only zone->present_pages but also pgdat->node_present_pages should be refreshed. This parameter is used to show information at /sys/device/system/node/nodeX/meminfo via si_meminfo_node(). So, it shows strange value for MemUsed which is calculated (node_present_pages - all zones free pages). Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-09 19:47:38 -08:00
Christoph Lameter	a6bf527091	[PATCH] vmscan: no zone_reclaim if PF_MALLOC is set If the process has already set PF_MALLOC and is already using current->reclaim_state then do not try to reclaim memory from the zone. This is set by kswapd and/or synchrononous global reclaim which will not take it lightly if we zap the reclaim_state. Signed-off-by: Christoph Lameter <clameter@sig.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-09 19:47:37 -08:00
Hugh Dickins	85a6cd03a9	[PATCH] page_add_file_rmap(): remove BUG_ON()s Remove two early-development BUG_ONs from page_add_file_rmap. The pfn_valid test (originally useful for checking that nobody passed an artificial struct page) comes too late, since we already have the struct page. The PageAnon test (useful when anon was first distinguished from file rmap) prevents ->nopage implementations from reusing ->mapping, which would otherwise be available. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-09 19:47:36 -08:00
Jack Steiner	07ed76b2a0	[PATCH] slab: allocate larger cache_cache if order 0 fails kmem_cache_init() incorrectly assumes that the cache_cache object will fit in an order 0 allocation. On very large systems, this is not true. Change the code to try larger order allocations if order 0 fails. Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-08 14:15:04 -08:00
Andrew Morton	e2bab3d924	[PATCH] percpu_counter_sum() Implement percpu_counter_sum(). This is a more accurate but slower version of percpu_counter_read_positive(). We need this for Alex's speedup-ext3_statfs patch and for the nr_file accounting fix. Otherwise these things would be too inaccurate on large CPU counts. Cc: Ravikiran G Thirumalai <kiran@scalex86.org> Cc: Alex Tomas <alex@clusterfs.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-08 14:14:01 -08:00
Andrew Morton	7f709ed0e3	[PATCH] numa_maps-update fix Fix the mm/mempolicy.c build for !CONFIG_HUGETLB_PAGE. Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin Bligh <mbligh@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-08 14:14:00 -08:00
Linus Torvalds	f78bb8ad48	slab: fix calculate_slab_order() for SLAB_RECLAIM_ACCOUNT Instead of having a hard-to-read and confusing conditional in the caller, just make the slab order calculation handle this special case, since it's simple and obvious there. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-08 10:33:05 -08:00
Christoph Lameter	397874dfe9	[PATCH] numa_maps update Change the format of numa_maps to be more compact and contain additional information that is useful for managing and troubleshooting memory on a NUMA system. Numa_maps can now also support huge pages. Fixes: 1. More compact format. Only display fields if they contain additional information. 2. Always display information for all vmas. The old numa_maps did not display vma with no mapped entries. This was a bit confusing because page migration removes ptes for file backed vmas. After page migration a part of the vmas vanished. 3. Rename maxref to maxmap. This is the maximum mapcount of all the pages in a vma and may be used as an indicator as to how many processes may be using a certain vma. 4. Include the ability to scan over huge page vmas. New items shown: dirty Number of pages in a vma that have either the dirty bit set in the page_struct or in the pte. file=<filename> The file backing the pages if any stack Stack area heap Heap area huge Huge page area. The number of pages shows is the number of huge pages not the regular sized pages. swapcache Number of pages with swap references. Must be >0 in order to be shown. active Number of active pages. Only displayed if different from the number of pages mapped. writeback Number of pages under writeback. Only displayed if >0. Sample ouput of a process using huge pages: 00000000 default 2000000000000000 default file=/lib/ld-2.3.90.so mapped=13 mapmax=30 N0=13 2000000000044000 default file=/lib/ld-2.3.90.so anon=2 dirty=2 swapcache=2 N2=2 2000000000064000 default file=/lib/librt-2.3.90.so mapped=2 active=1 N1=1 N3=1 2000000000074000 default file=/lib/librt-2.3.90.so 2000000000080000 default file=/lib/librt-2.3.90.so anon=1 swapcache=1 N2=1 2000000000084000 default 2000000000088000 default file=/lib/libc-2.3.90.so mapped=52 mapmax=32 active=48 N0=52 20000000002bc000 default file=/lib/libc-2.3.90.so 20000000002c8000 default file=/lib/libc-2.3.90.so anon=3 dirty=2 swapcache=3 active=2 N1=1 N2=2 20000000002d4000 default anon=1 swapcache=1 N1=1 20000000002d8000 default file=/lib/libpthread-2.3.90.so mapped=8 mapmax=3 active=7 N2=2 N3=6 20000000002fc000 default file=/lib/libpthread-2.3.90.so 2000000000308000 default file=/lib/libpthread-2.3.90.so anon=1 dirty=1 swapcache=1 N1=1 200000000030c000 default anon=1 dirty=1 swapcache=1 N1=1 2000000000320000 default anon=1 dirty=1 N1=1 200000000071c000 default 2000000000720000 default anon=2 dirty=2 swapcache=1 N1=1 N2=1 2000000000f1c000 default 2000000000f20000 default anon=2 dirty=2 swapcache=1 active=1 N2=1 N3=1 200000000171c000 default 2000000001720000 default anon=1 dirty=1 swapcache=1 N1=1 2000000001b20000 default 2000000001b38000 default file=/lib/libgcc_s.so.1 mapped=2 N1=2 2000000001b48000 default file=/lib/libgcc_s.so.1 2000000001b54000 default file=/lib/libgcc_s.so.1 anon=1 dirty=1 active=0 N1=1 2000000001b58000 default file=/lib/libunwind.so.7.0.0 mapped=2 active=1 N1=2 2000000001b74000 default file=/lib/libunwind.so.7.0.0 2000000001b80000 default file=/lib/libunwind.so.7.0.0 2000000001b84000 default 4000000000000000 default file=/media/huge/test9 mapped=1 N1=1 6000000000000000 default file=/media/huge/test9 anon=1 dirty=1 active=0 N1=1 6000000000004000 default heap 607fffff7fffc000 default anon=1 dirty=1 swapcache=1 N2=1 607fffffff06c000 default stack anon=1 dirty=1 active=0 N1=1 8000000060000000 default file=/mnt/huge/test0 huge dirty=3 N1=3 8000000090000000 default file=/mnt/huge/test1 huge dirty=3 N0=1 N2=2 80000000c0000000 default file=/mnt/huge/test2 huge dirty=3 N1=1 N3=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-06 18:40:45 -08:00
Linus Torvalds	9888e6fa7b	slab: clarify and fix calculate_slab_order() If we triggered the 'offslab_limit' test, we would return with cachep->gfporder incremented once too many times. This clarifies the logic somewhat, and fixes that bug. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-06 17:44:43 -08:00
Linus Torvalds	264132bc62	Fix "check_slabp" printout size calculation We want to use the "struct slab" size, not the size of the pointer to same. As it is, we'd not print out the last <n> entry pointers in the slab (where <n> is ~10, depending on whether it's a 32-bit or 64-bit kernel). Gaah, that slab code was written by somebody who likes unreadable crud. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-06 12:10:07 -08:00
Christoph Lameter	a57ebfdb2c	[PATCH] numa_maps: Fix potential crash on non IA64 platforms numa_maps should not scan over huge vmas in order not to cause problems for non IA64 platforms that may have pte entries pointing to huge pages in a variety of ways in their page tables. Add a simple check to ignore vmas containing huge pages. Signed-off-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-02 08:33:07 -08:00
Andrew Morton	140ffcec4d	[PATCH] out_of_memory() locking fix I seem to have lost this read_unlock(). While we're there, let's turn that interruptible sleep unto uninterruptible, so we don't get a busywait if signal_pending(). (Again. We seem to have a habit of doing this). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-02 08:33:07 -08:00
Andrew Morton	d6713e0463	[PATCH] out_of_memory(): use of uninitialised Under some circumstances `points' can get printed before it's initialised. Spotted by Carlos Martin <carlos@cmartin.tk>. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-28 20:53:44 -08:00
Andrew Morton	f61388822a	[PATCH] nommu: implement vmalloc_node() Fix oprofile linkage. Pointed out by "Luke Yang" <luke.adi@gmail.com>. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-28 20:53:44 -08:00
Christoph Lameter	e8788c0cce	[PATCH] remove_from_swap: fix locking remove_from_swap() currently attempts to use page_lock_anon_vma to obtain an anon_vma lock. That is not working since the page may have been remapped via swap ptes in order to move the page. However, do_migrate_pages() obtain the mmap_sem lock and therefore there is a guarantee that the anonymous vma will not vanish from under us. There is therefore no need to use page_lock_anon_vma. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-28 20:53:44 -08:00
Christoph Lameter	511030bcd2	[PATCH] Fix sys_migrate_pages: Move all pages when invoked from root Currently sys_migrate_pages only moves pages belonging to a process. This is okay when invoked from a regular user. But if invoked from root it should move all pages as documented in the migrate_pages manpage. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-28 20:53:43 -08:00
Christoph Lameter	d4f7796e9b	[PATCH] vmscan: fix zone_reclaim - PF_SWAPWRITE needs to be set for RECLAIM_SWAP to be able to write out pages to swap. Currently RECLAIM_SWAP may not do that. - remove setting nr_reclaimed pages after slab reclaim since the slab shrinking code does not use that and the nr_reclaimed pages is just right for the intended follow up action. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-24 14:31:39 -08:00
Christoph Lameter	1e275d406b	[PATCH] page migration: Fix MPOL_INTERLEAVE behavior for migration via mbind() migrate_pages_to() allocates a list of new pages on the intended target node or with the intended policy and then uses the list of new pages as targets for the migration of a list of pages out of place. When the pages are allocated it is not clear which of the out of place pages will be moved to the new pages. So we cannot specify an address as needed by alloc_page_vma(). This causes problem for MPOL_INTERLEAVE which will currently allocate the pages on the first node of the set. If mbind is used with vma that has the policy of MPOL_INTERLEAVE then the interleaving of pages may be destroyed. This patch fixes that by generating a fake address for each alloc_page_vma which will result is a distribution of pages as prescribed by MPOL_INTERLEAVE. Lee also noted that the sequence of nodes for the new pages seems to be inverted. So we also invert the way the lists of pages for migration are build. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Looks-ok-to: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-24 14:31:38 -08:00
Hugh Dickins	b00dc3ad74	[PATCH] tmpfs: fix mount mpol nodelist parsing I've been dissatisfied with the mpol_nodelist mount option which was added to tmpfs earlier in -rc. Replace it by mpol=policy:nodelist. And it was broken: a nodelist is a comma-separated list of numbers and ranges; the mount options are a comma-separated list of token=values. Whoops, blindly strsep'ing on commas doesn't work so well: since we've no numeric tokens, and unlikely to add them, use that to distinguish. Move the mpol= parsing to shmem_parse_mpol under CONFIG_NUMA, reject all its options as invalid if not NUMA. /proc shows MPOL_PREFERRED as "prefer", so use that name for the policy instead of "preferred". Enforce that mpol=default has no nodelist; that mpol=prefer has one node only; that mpol=bind has a nodelist; but let mpol=interleave use node_online_map if no nodelist given. Describe this in tmpfs.txt. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Robin Holt <holt@sgi.com> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-21 17:10:15 -08:00
Alexey Dobriyan	fcab6f3513	[PATCH] mm/mempolicy.c: fix 'if ();' typo [akpm; it happens that the code was still correct, only inefficient ] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-20 20:00:11 -08:00
Luke Yang	7a9166e3b0	[PATCH] Fix undefined symbols for nommu architecture Signed-off-by: Luke Yang <luke.adi@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-20 20:00:11 -08:00
Andi Kleen	a9c930bac1	[PATCH] Fix units in mbind check maxnode is a bit index and can't be directly compared against a byte length like PAGE_SIZE Signed-off-by: Andi Kleen <ak@suse.de> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-20 20:00:10 -08:00
Christoph Lameter	9b0f8b040a	[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-20 20:00:09 -08:00
Kurt Garloff	9827b781f2	[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children / list_for_each(tsk, &p->children) { struct task_struct chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-20 20:00:09 -08:00
Chris Wright	636f13c174	[PATCH] sys_mbind sanity checking Make sure maxnodes is safe size before calculating nlongs in get_nodes(). Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-17 14:09:22 -08:00
Linus Torvalds	4cf808eb44	[PATCH] Handle holes in node mask in node fallback list setup Change the find_next_best_node algorithm to correctly skip over holes in the node online mask. Previously it would not handle missing nodes correctly and cause crashes at boot. [Written by Linus, tested by AK] Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-17 13:27:06 -08:00
Andi Kleen	dd942ae331	[PATCH] Handle all and empty zones when setting up custom zonelists for mbind The memory allocator doesn't like empty zones (which have an uninitialized freelist), so a x86-64 system with a node fully in GFP_DMA32 only would crash on mbind. Fix that up by putting all possible zones as fallback into the zonelist and skipping the empty ones. In fact the code always enough allocated space for all zones, but only used it for the highest. This change just uses all the memory that was allocated before. This should work fine for now, but whoever implements node hot removal needs to fix this somewhere else too (or make sure zone datastructures by itself never go away, only their memory) Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-17 08:18:14 -08:00
Andi Kleen	a62eaf151d	[PATCH] x86_64: Add boot option to disable randomized mappings and cleanup AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution installation it's easiest if it's a boot time option. Also I moved the variable to a more appropiate place and make it independent from sysctl And marked __read_mostly which it is. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-17 08:00:40 -08:00
Michael S. Tsirkin	f822566165	[PATCH] madvise MADV_DONTFORK/MADV_DOFORK Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMA's into this page after the COW. In case of mlock'd memory, the parent is not getting the realtime/security benefits of mlock. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. This patch adds madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-14 16:09:34 -08:00
Hugh Dickins	d98c7a0984	[PATCH] compound page: default destructor Somehow I imagined that calling a NULL destructor would free a compound page rather than oopsing. No, we must supply a default destructor, __free_pages_ok using the order noted by prep_compound_page. hugetlb can still replace this as before with its own free_huge_page pointer. The case that needs this is not common: rarely does put_compound_page's put_page_testzero bring the count down to 0. But if get_user_pages is applied to some part of a compound page, without immediate release (e.g. AIO or Infiniband), then it's possible for its put_page to come after the containing vma has been unmapped and the driver done its free_pages. That's just the kind of case compound pages are supposed to be guarding against (but Nick points out, nor did PageReserved handle this right). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-14 16:09:33 -08:00
Hugh Dickins	41d78ba550	[PATCH] compound page: use page[1].lru If a compound page has its own put_page_testzero destructor (the only current example is free_huge_page), that is noted in page[1].mapping of the compound page. But that's rather a poor place to keep it: functions which call set_page_dirty_lock after get_user_pages (e.g. Infiniband's __ib_umem_release) ought to be checking first, otherwise set_page_dirty is liable to crash on what's not the address of a struct address_space. And now I'm about to make that worse: it turns out that every compound page needs a destructor, so we can no longer rely on hugetlb pages going their own special way, to avoid further problems of page->mapping reuse. For example, not many people know that: on 50% of i386 -Os builds, the first tail page of a compound page purports to be PageAnon (when its destructor has an odd address), which surprises page_add_file_rmap. Keep the compound page destructor in page[1].lru.next instead. And to free up the common pairing of mapping and index, also move compound page order from index to lru.prev. Slab reuses page->lru too: but if we ever need slab to use compound pages, it can easily stack its use above this. (akpm: decoded version of the above: the tail pages of a compound page now have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]() caller to check that they're not compund pages before doing the dirty). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-14 16:09:33 -08:00
Christoph Lameter	2903fb1694	[PATCH] vmscan: skip reclaim_mapped determination if we do not swap This puts the variables and the way to get to reclaim_mapped in one block. And allows zone_reclaim or other things to skip the determination (maybe this whole block of code does not belong into refill_inactive_zone()?) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-11 21:41:11 -08:00
Christoph Lameter	072eaa5d9c	[PATCH] vmscan: remove duplicate increment of reclaim_in_progress shrink_zone() already increments reclaim_in_progress. No need to do it in balance_pgdat. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-11 21:41:11 -08:00
Christoph Lameter	80e4342601	[PATCH] zone reclaim: do not check references to a page during zone reclaim shrink_list() and refill_inactive() check all ptes pointing to a page for reference bits in order to decide if the page should be put on the active list. This is not necessary for zone_reclaim since we are only interested in removing unmapped pages. Skip the checks in both functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-11 21:41:11 -08:00
Christoph Lameter	418aade459	[PATCH] Updates for page migration This adds some additional comments in order to help others figure out how exactly the code works. And fix a variable name. Also swap_page does need to ignore all reference bits when unmapping a page. Otherwise we may have to repeatedly unmap a frequently touched page. So change the try_to_unmap parameter to 1. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-10 08:13:13 -08:00
Ravikiran G Thirumalai	f0188f4748	[PATCH] slab: Avoid deadlock at kmem_cache_create/kmem_cache_destroy Prevents deadlock situation between kmem_cache_create()/kmem_cache_destory(), and kmem_cache_create() /cpu hotplug. The locking order probably got moved over time. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-10 08:13:12 -08:00
Ingo Molnar	9934a7939e	[PATCH] SLOB=y && SMP=y fix fix CONFIG_SLOB=y (when CONFIG_SMP=y): get rid of the 'align' parameter from its __alloc_percpu() implementation. Boot-tested on x86. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-08 07:52:58 -08:00
Nick Piggin	8519fb30e4	[PATCH] mm: compound release fix Compound pages on SMP systems can now often be freed from pagetables via the release_pages path. This uses put_page_testzero which does not handle compound pages at all. Releasing constituent pages from process mappings decrements their count to a large negative number and leaks the reference at the head page - net result is a memory leak. The problem was hidden because the debug check in put_page_testzero itself actually did take compound pages into consideration. Fix the bug and the debug check. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-07 16:12:33 -08:00
Christoph Lameter	0df420d8b6	[PATCH] hugetlbpage: return VM_FAULT_OOM on oom Remove wrong and misleading comments. Return VM_FAULT_OOM if the hugetlbpage fault handler cannot allocate a page. do_no_page will end up doing do_exit(SIGKILL). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-07 16:12:31 -08:00
David Gibson	a2dfef6947	[PATCH] Hugepages need clear_user_highpage() not clear_highpage() When hugepages are newly allocated to a file in mm/hugetlb.c, we clear them with a call to clear_highpage() on each of the subpages. We should be using clear_user_highpage(): on powerpc, at least, clear_highpage() doesn't correctly mark the page as icache dirty so if the page is executed shortly after it's possible to get strange results. Signed-off-by: David Gibson <dwg@au1.ibm.com> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-07 16:12:31 -08:00
Linus Torvalds	7a21ef6fe9	mm/slab.c (non-NUMA): Fix compile warning and clean up code The non-NUMA case would do an unmatched "free_alien_cache()" on an alien pointer that had never been allocated. It might not matter from a code generation standpoint (since in the non-NUMA case, the code doesn't actually _do_ anything), but it not only results in a compiler warning, it's really really ugly too. Fix the compiler warning by just having a matching dummy allocation. That also avoids an unnecessary #ifdef in the code. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-05 11:26:38 -08:00
Ravikiran G Thirumalai	4484ebf12b	[PATCH] NUMA slab locking fixes: fix cpu down and up locking This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab allocator. Sonny Rao <sonny@burdell.org> reported problems sometime back on POWER5 boxes, when the last cpu on the nodes were being offlined. We could not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not being updated on cpu down. Since that issue is now fixed, we can reproduce Sonny's problems on x86_64 NUMA, and here is the fix. The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go down, the array_caches (shared, alien) and the kmem_list3 of the node were being freed (kfree) with the kmem_list3 lock held. If the l3 or the array_caches were to come from the same cache being cleared, we hit on badness. This patch cleans up the locking in cpu_up and cpu_down path. We cannot really free l3 on cpu down because, there is no node offlining yet and even though a cpu is not yet up, node local memory can be allocated for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy. Hence, we don't need cachep->spinlock protection to get to the cachep->nodelist[nodeid] either. Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4 dbench process running all the time. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-05 11:06:53 -08:00
Ravikiran G Thirumalai	ca3b9b9173	[PATCH] NUMA slab locking fixes: irq disabling from cahep->spinlock to l3 lock Earlier, we had to disable on chip interrupts while taking the cachep->spinlock because, at cache_grow, on every addition of a slab to a slab cache, we incremented colour_next which was protected by the cachep->spinlock, and cache_grow could occur at interrupt context. Since, now we protect the per-node colour_next with the node's list_lock, we do not need to disable on chip interrupts while taking the per-cache spinlock, but we just need to disable interrupts when taking the per-node kmem_list3 list_lock. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-05 11:06:53 -08:00
Ravikiran G Thirumalai	2e1217cf96	[PATCH] NUMA slab locking fixes: move color_next to l3 colour_next is used as an index to add a colouring offset to a new slab in the cache (colour_off * colour_next). Now with the NUMA aware slab allocator, it makes sense to colour slabs added on the same node sequentially with colour_next. This patch moves the colouring index "colour_next" per-node by placing it on kmem_list3 rather than kmem_cache. This also helps simplify locking for CPU up and down paths. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-05 11:06:53 -08:00
Christoph Lameter	64b4a954b0	[PATCH] hugetlb: add comment explaining reasons for Bus Errors I just spent some time researching a Bus Error. Turns out that the huge page fault handler can return VM_FAULT_SIGBUS for various conditions where no huge page is available. Add a note explaining the reasoning in the source. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-05 11:06:53 -08:00
Eric Dumazet	88a2a4ac6b	[PATCH] percpu data: only iterate over possible CPUs percpu_data blindly allocates bootmem memory to store NR_CPUS instances of cpudata, instead of allocating memory only for possible cpus. As a preparation for changing that, we need to convert various 0 -> NR_CPUS loops to use for_each_cpu(). (The above only applies to users of asm-generic/percpu.h. powerpc has gone it alone and is presently only allocating memory for present CPUs, so it's currently corrupting memory). Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@suse.de> Cc: Anton Blanchard <anton@samba.org> Acked-by: William Irwin <wli@holomorphy.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-05 11:06:51 -08:00
Chen, Kenneth W	00ac59adfc	[PATCH] x86_64: Fix memory policy build without CONFIG_HUGETLBFS > mm/mempolicy.c: In function `huge_zonelist': > mm/mempolicy.c:1045: error: `HPAGE_SHIFT' undeclared (first use in this function) > mm/mempolicy.c:1045: error: (Each undeclared identifier is reported only once > mm/mempolicy.c:1045: error: for each function it appears in.) > make[1]: *** [mm/mempolicy.o] Error 1 Need to wrap huge_zonelist function with CONFIG_HUGETLBFS. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-04 16:43:14 -08:00
Randy Dunlap	ee13d785ea	[PATCH] slab: fix sparse warning mm/slab.c:1522:13: error: incompatible types for operation (&) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:18 -08:00
Randy.Dunlap	a70773ddb9	[PATCH] mm/slab: add kernel-doc for one function Fix kernel-doc for calculate_slab_order(). Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:18 -08:00
Pekka Enberg	7fd6b14130	[PATCH] slab: fix kzalloc and kstrdup caller report for CONFIG_DEBUG_SLAB Fix kzalloc() and kstrdup() caller report for CONFIG_DEBUG_SLAB. We must pass the caller to __cache_alloc() instead of directly doing __builtin_return_address(0) there; otherwise kzalloc() and kstrdup() are reported as the allocation site instead of the real one. Thanks to Valdis Kletnieks for reporting the problem and Steven Rostedt for the original idea. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:18 -08:00
Andrew Morton	b958f7d9f3	[PATCH] dump_stack() in oom handler Sometimes it's nice to know who's calling. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:18 -08:00
Pekka Enberg	343e0d7a93	[PATCH] slab: replace kmem_cache_t with struct kmem_cache Replace uses of kmem_cache_t with proper struct kmem_cache in mm/slab.c. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:18 -08:00
Pekka Enberg	9a2dba4b49	[PATCH] slab: rename ac_data to cpu_cache_get Rename the ac_data() function to more descriptive cpu_cache_get(). Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:18 -08:00
Pekka Enberg	6ed5eb2211	[PATCH] slab: extract virt_to_{cache\|slab} Introduce virt_to_cache() and virt_to_slab() functions to reduce duplicate code and introduce a proper abstraction should we want to support other kind of mapping for address to slab and cache (eg. for vmalloc() or I/O memory). Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:18 -08:00
Pekka Enberg	5295a74cc0	[PATCH] slab: reduce inlining From: Manfred Spraul <manfred@colorfullife.com> Reduce the amount of inline functions in slab to the functions that are used in the hot path: - no inline for debug functions - no __always_inline, inline is already __always_inline - remove inline from a few numa support functions. Before: text data bss dec hex filename 13588 752 48 14388 3834 mm/slab.o (defconfig) 16671 2492 48 19211 4b0b mm/slab.o (numa) After: text data bss dec hex filename 13366 752 48 14166 3756 mm/slab.o (defconfig) 16230 2492 48 18770 4952 mm/slab.o (numa) Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Matthew Dobson	78d382d77c	[PATCH] slab: extract slab_{put\|get}_obj Create two helper functions slab_get_obj() and slab_put_obj() to replace duplicated code in mm/slab.c Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Matthew Dobson	12dd36faec	[PATCH] slab: extract slab_destroy_objs() Create a helper function, slab_destroy_objs() which called from slab_destroy(). This makes slab_destroy() smaller and more readable, and moves ifdefs outside the function body. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Steven Rostedt	fbaccacff1	[PATCH] slab: cache_estimate cleanup Clean up cache_estimate() in mm/slab.c and improves the algorithm from O(n) to O(1). We first calculate the maximum number of objects a slab can hold after struct slab and kmem_bufctl_t for each object has been given enough space. After that, to respect alignment rules, we decrease the number of objects if necessary. As required padding is at most align-1 and memory of obj_size is at least align, it is always enough to decrease number of objects by one. The optimization was originally made by Balbir Singh with more improvements from Steven Rostedt. Manfred Spraul provider further modifications: no loop at all for the off-slab case and added comments to explain the background. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Steven Rostedt	5ec8a847bb	[PATCH] slab: have index_of bug at compile time I noticed the code for index_of is a creative way of finding the cache index using the compiler to optimize to a single hard coded number. But I couldn't help noticing that it uses two methods to let you know that someone used it wrong. One is at compile time (the correct way), and the other is at run time (not good). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Christoph Lameter	18f820f655	[PATCH] slab: minor cleanup to kmem_cache_alloc_node Clean up kmem_cache_alloc_node a bit. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Manfred Spraul	3dafccf227	[PATCH] slab: distinguish between object and buffer size An object cache has two different object lengths: - the amount of memory available for the user (object size) - the amount of memory allocated internally (buffer size) This patch does some renames to make the code reflect that better. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Christoph Lameter	e965f9630c	[PATCH] Direct Migration V9: Avoid writeback / page_migrate() method Migrate a page with buffers without requiring writeback This introduces a new address space operation migratepage() that may be used by a filesystem to implement its own version of page migration. A version is provided that migrates buffers attached to pages. Some filesystems (ext2, ext3, xfs) are modified to utilize this feature. The swapper address space operation are modified so that a regular migrate_page() will occur for anonymous pages without writeback (migrate_pages forces every anonymous page to have a swap entry). Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:17 -08:00
Christoph Lameter	7e2ab150d1	[PATCH] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages() Modify policy layer to support direct page migration - Add migrate_pages_to() allowing the migration of a list of pages to a a specified node or to vma with a specific allocation policy in sets of MIGRATE_CHUNK_SIZE pages - Modify do_migrate_pages() to do a staged move of pages from the source nodes to the target nodes. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	a3351e525e	[PATCH] Direct Migration V9: remove_from_swap() to remove swap ptes Add remove_from_swap remove_from_swap() allows the restoration of the pte entries that existed before page migration occurred for anonymous pages by walking the reverse maps. This reduces swap use and establishes regular pte's without the need for page faults. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	a48d07afdf	[PATCH] Direct Migration V9: migrate_pages() extension Add direct migration support with fall back to swap. Direct migration support on top of the swap based page migration facility. This allows the direct migration of anonymous pages and the migration of file backed pages by dropping the associated buffers (requires writeout). Fall back to swap out if necessary. The patch is based on lots of patches from the hotplug project but the code was restructured, documented and simplified as much as possible. Note that an additional patch that defines the migrate_page() method for filesystems is necessary in order to avoid writeback for anonymous and file backed pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	b16664e44c	[PATCH] Direct Migration V9: PageSwapCache checks Check for PageSwapCache after looking up and locking a swap page. The page migration code may change a swap pte to point to a different page under lock_page(). If that happens then the vm must retry the lookup operation in the swap space to find the correct page number. There are a couple of locations in the VM where a lock_page() is done on a swap page. In these locations we need to check afterwards if the page was migrated. If the page was migrated then the old page that was looked up before was freed and no longer has the PageSwapCache bit set. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	2a16e3f4b0	[PATCH] Reclaim slab during zone reclaim If large amounts of zone memory are used by empty slabs then zone_reclaim becomes uneffective. This patch shakes the slab a bit. The problem with this patch is that the slab reclaim is not containable to a zone. Thus slab reclaim may affect the whole system and be extremely slow. This also means that we cannot determine how many pages were freed in this zone. Thus we need to go off node for at least one allocation. The functionality is disabled by default. We could modify the shrinkers to take a zone parameter but that would be quite invasive. Better ideas are welcome. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	1b2ffb7896	[PATCH] Zone reclaim: Allow modification of zone reclaim behavior In some situations one may want zone_reclaim to behave differently. For example a process writing large amounts of memory will spew unto other nodes to cache the writes if many pages in a zone become dirty. This may impact the performance of processes running on other nodes. Allowing writes during reclaim puts a stop to that behavior and throttles the process by restricting the pages to the local zone. Similarly one may want to contain processes to local memory by enabling regular swap behavior during zone_reclaim. Off node memory allocation can then be controlled through memory policies and cpusets. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	2a11ff06d7	[PATCH] zone_reclaim: configurable off node allocation period. Currently the zone_reclaim code has a fixed window of 30 seconds of off node allocations should a local zone have no unused pagecache pages left. Reclaim will be attempted again after this timeout period to avoid repeated useless scans for memory. This is also useful to established sufficiently large off node allocation chunks to relieve the local node. It may be beneficial to adjust that time period for some special situations. For example if memory use was exceeding node capacity one may want to give up for longer periods of time. If memory spikes intermittendly then one may want to shorten the time period to reduce the number of off node allocations. This patch allows just that.... Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	a92f71263a	[PATCH] zone_reclaim: partial scans instead of full scan Instead of scanning all the pages in a zone, imitate real swap and scan only a portion of the pages and gradually scan more if we do not free up enough pages. This avoids a zone suddenly loosing all unused pagecache pages (we may after all access some of these again so they deserve another chance) but it still frees up large chunks of memory if a zone only contains unused pagecache pages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:16 -08:00
Christoph Lameter	aa3f18b339	[PATCH] zone_reclaim: do not unmap file backed pages zone_reclaim should leave that to the real swapper. We are only interested in evicting unmapped pages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:15 -08:00
Benjamin LaHaise	9884fd8df1	[PATCH] Use 32 bit division in slab_put_obj() Improve the performance of slab_put_obj(). Without the cast, gcc considers ptrdiff_t a 64 bit signed integer and ends up emitting code to use a full signed 128 bit divide on EM64T, which is substantially slower than a 32 bit unsigned divide. I noticed this when looking at the profile of a case where the slab balance is just on edge and thrashes back and forth freeing a block. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:15 -08:00
Christoph Lameter	c84db23c6e	[PATCH] zone_reclaim: minor fixes - If we only reclaim nr_pages then its okay to stay on node. Switch from > to >= for the comparison. - vm_table[] entry for zone_reclaim_mode is a bit screwed up. - Add empty lines around shrink_zone to show that this is the central function to be called. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:15 -08:00
Christoph Lameter	52a8363eae	[PATCH] mm: improve function of sc->may_writepage Make sc->may_writepage control the writeout behavior of shrink_list. Remove the laptop_mode trick from shrink_list and instead set may_writepage in try_to_free_pages properly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:15 -08:00
Christoph Lameter	42c722d4cb	[PATCH] zone_reclaim: reclaim on memory only node support Zone reclaim is usually only run on the local node. Headless nodes do not have any local processors. This patch checks for headless nodes and performs zone reclaim on them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:14 -08:00
Christoph Lameter	8928862398	[PATCH] Optimize off-node performance of zone reclaim Ensure that the performance of off node pages stays the same as before. Off node pagefault tests showed an 18% drop in performance without this patch. - Increase the timeout to 30 seconds to reduce the overhead. - Move all code possible out of the off node hot path for zone reclaim (Sorry Andrew, the struct initialization had to be sacrificed). The read_page_state() bit us there. - Check first for the timeout before any other checks. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:14 -08:00
Ashok Raj	6292d9aaf3	[PATCH] __cpuinit functions wrongly marked __meminit __meminit has overzelously been modified and crept its way into marking cpuup callbacks as __meminit. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-02-01 08:53:09 -08:00
Christoph Lameter	86c562a9d6	[PATCH] mm: optimize numa policy handling in slab allocator Move the interrupt check from slab_node into ___cache_alloc and adds an "unlikely()" to avoid pipeline stalls on some architectures. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:18 -08:00
Christoph Lameter	dc85da15d4	[PATCH] NUMA policies in the slab allocator V2 This patch fixes a regression in 2.6.14 against 2.6.13 that causes an imbalance in memory allocation during bootup. The slab allocator in 2.6.13 is not numa aware and simply calls alloc_pages(). This means that memory policies may control the behavior of alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE resulting in the spreading out of allocations during bootup over all available nodes. The slab allocator in 2.6.13 has only a single list of slab pages. As a result the per cpu slab cache and the spinlock controlled page lists may contain slab entries from off node memory. The slab allocator in 2.6.13 makes no effort to discern the locality of an entry on its lists. The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages explicitly by calling alloc_pages_node(). The NUMA slab allocator manages slab entries by having lists of available slab pages for each node. The per cpu slab cache can only contain slab entries associated with the node local to the processor. This guarantees that the default allocation mode of the slab allocator always assigns local memory if available. Setting MPOL_INTERLEAVE as a default policy during bootup has no effect anymore. In 2.6.14 all node unspecific slab allocations are performed on the boot processor. This means that most of key data structures are allocated on one node. Most processors will have to refer to these structures making the boot node a potential bottleneck. This may reduce performance and cause unnecessary memory pressure on the boot node. This patch implements NUMA policies in the slab layer. There is the need of explicit application of NUMA memory policies by the slab allcator itself since the NUMA slab allocator does no longer let the page_allocator control locality. The check for policies is made directly at the beginning of __cache_alloc using current->mempolicy. The memory policy is already frequently checked by the page allocator (alloc_page_vma() and alloc_page_current()). So it is highly likely that the cacheline is present. For MPOL_INTERLEAVE kmalloc() will spread out each request to one node after another so that an equal distribution of allocations can be obtained during bootup. It is not possible to push the policy check to lower layers of the NUMA slab allocator since the per cpu caches are now only containing slab entries from the current node. If the policy says that the local node is not to be preferred or forbidden then there is no point in checking the slab cache or local list of slab pages. The allocation better be directed immediately to the lists containing slab entries for the allowed set of nodes. This way of applying policy also fixes another strange behavior in 2.6.13. alloc_pages() is controlled by the memory allocation policy of the current process. It could therefore be that one process is running with MPOL_INTERLEAVE and would f.e. obtain a new page following that policy since no slab entries are in the lists anymore. A page can typically be used for multiple slab entries but lets say that the current process is only using one. The other entries are then added to the slab lists. These are now non local entries in the slab lists despite of the possible availability of local pages that would provide faster access and increase the performance of the application. Another process without MPOL_INTERLEAVE may now run and expect a local slab entry from kmalloc(). However, there are still these free slab entries from the off node page obtained from the other process via MPOL_INTERLEAVE in the cache. The process will then get an off node slab entry although other slab entries may be available that are local to that process. This means that the policy if one process may contaminate the locality of the slab caches for other processes. This patch in effect insures that a per process policy is followed for the allocation of slab entries and that there cannot be a memory policy influence from one process to another. A process with default policy will always get a local slab entry if one is available. And the process using memory policies will get its memory arranged as requested. Off-node slab allocation will require the use of spinlocks and will make the use of per cpu caches not possible. A process using memory policies to redirect allocations offnode will have to cope with additional lock overhead in addition to the latency added by the need to access a remote slab entry. Changes V1->V2 - Remove #ifdef CONFIG_NUMA by moving forward declaration into prior #ifdef CONFIG_NUMA section. - Give the function determining the node number to use a saner name. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:18 -08:00
Ingo Molnar	fc0abb1451	[PATCH] sem2mutex: mm/slab.c Convert mm/swapfile.c's swapon_sem to swapon_mutex. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:18 -08:00
Christoph Lameter	9eeff2395e	[PATCH] Zone reclaim: Reclaim logic Some bits for zone reclaim exists in 2.6.15 but they are not usable. This patch fixes them up, removes unused code and makes zone reclaim usable. Zone reclaim allows the reclaiming of pages from a zone if the number of free pages falls below the watermarks even if other zones still have enough pages available. Zone reclaim is of particular importance for NUMA machines. It can be more beneficial to reclaim a page than taking the performance penalties that come with allocating a page on a remote zone. Zone reclaim is enabled if the maximum distance to another node is higher than RECLAIM_DISTANCE, which may be defined by an arch. By default RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same component (enclosure or motherboard) on IA64. The meaning of the NUMA distance information seems to vary by arch. If zone reclaim is not successful then no further reclaim attempts will occur for a certain time period (ZONE_RECLAIM_INTERVAL). This patch was discussed before. See http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:17 -08:00
Christoph Lameter	f1fd1067ec	[PATCH] Zone reclaim: resurrect may_swap Zone reclaim has a huge impact on NUMA performance (f.e. our maximum throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination of numa nodes destroys locality if one just does a large copy operation which results in performance dropping for good until reboot). This patch: Resurrect may_swap in struct scan_control Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:17 -08:00
Christoph Lameter	fc30128963	[PATCH] Simplify migrate_page_add Simplify migrate_page_add after feedback from Hugh. This also allows us to drop one parameter from migrate_page_add. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:17 -08:00
Nick Piggin	053837fce7	[PATCH] mm: migration page refcounting fix Migration code currently does not take a reference to target page properly, so between unlocking the pte and trying to take a new reference to the page with isolate_lru_page, anything could happen to it. Fix this by holding the pte lock until we get a chance to elevate the refcount. Other small cleanups while we're here. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:17 -08:00
Andrew Morton	e236a166b2	[PATCH] mm: dirty_exceeded speedup Ravikiran reports that this variable is bouncing all around nodes on NUMA machines, causing measurable performance problems. Fix that up by only writing to it when it actually changed. And put it in a new cacheline to prevent it sharing with other things (this happened). Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-18 19:20:17 -08:00
Matt Tolentino	c09b42404d	[PATCH] x86_64: add __meminit for memory hotplug Add __meminit to the __init lineup to ensure functions default to __init when memory hotplug is not enabled. Replace __devinit with __meminit on functions that were changed when the memory hotplug code was introduced. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-16 23:18:35 -08:00
Paul Jackson	505970b96e	[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-14 18:27:10 -08:00
Robin Holt	7339ff8302	[PATCH] Add tmpfs options for memory placement policies Anything that writes into a tmpfs filesystem is liable to disproportionately decrease the available memory on a particular node. Since there's no telling what sort of application (e.g. dd/cp/cat) might be dropping large files there, this lets the admin choose the appropriate default behavior for their site's situation. Introduce a tmpfs mount option which allows specifying a memory policy and a second option to specify the nodelist for that policy. With the default policy, tmpfs will behave as it does today. This patch adds support for preferred, bind, and interleave policies. The default policy will cause pages to be added to tmpfs files on the node which is doing the writing. Some jobs expect a single process to create and manage the tmpfs files. This results in a node which has a significantly reduced number of free pages. With this patch, the administrator can specify the policy and nodes for that policy where they would prefer allocations. This patch was originally written by Brent Casavant and Hugh Dickins. I added support for the bind and preferred policies and the mpol_nodelist mount option. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-14 18:27:07 -08:00
Linus Torvalds	9f5974c873	Merge git://oss.sgi.com:8090/oss/git/xfs-2.6	2006-01-12 09:10:34 -08:00
Greg Ungerer	cbe8dd4af2	[PATCH] memmap_init_zone(): remove uneccesary page++ Remove unecessary page++ from memmap_init_zone loop. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-12 09:08:49 -08:00
Catalin Marinas	2a7e2f7dcb	[PATCH] do_truncate() call fix in tiny-shmem.c Adapt tiny-shmem.c to the new do_truncate() prototype. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Matt Mackall <mpm@selenic.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-12 09:08:49 -08:00
Christoph Lameter	f4598c8b36	[PATCH] migration: make sure there is no attempt to migrate reserved pages. This ensures that reserved pages are not migrated. Reserved pages currently cause the WARN_ON to trigger in migrate_page_add() Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-12 09:08:48 -08:00
Randy.Dunlap	c59ede7b78	[PATCH] move capable() to capability.h - Move capable() from sched.h to capability.h; - Use <linux/capability.h> where capable() is used (in include/, block/, ipc/, kernel/, a few drivers/, mm/, security/, & sound/; many more drivers/ to go) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-11 18:42:13 -08:00
Paul Jackson	4eac915d02	[PATCH] mm: gfp_atomic comments Clarify in comments that GFP_ATOMIC means both "don't sleep" and "use emergency pools", hence both ALLOC_HARDER and ALLOC_HIGH. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-11 18:42:09 -08:00
Hugh Dickins	7365f3d169	[PATCH] Restore KERN_EMERG to each line printed by bad_page Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-11 18:42:08 -08:00
Nathan Scott	ddae9c2ea7	Merge HEAD from oss.sgi.com:/oss/git/linux-2.6.git	2006-01-12 13:34:47 +11:00
David Woodhouse	a4fc7ab1d0	[PATCH] fix/simplify mutex debugging code Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as arguments instead, since all its callers were just calculating the 'to' address for themselves anyway... (and sometimes doing so badly). Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-11 08:14:16 -08:00
Christoph Hellwig	78539fdfa4	[XFS] Export pagevec_lookup for use on the XFS page writeout path, for dealing with delayed allocate and unwritten extents (as well). Signed-off-by: Christoph Hellwig <hch@sgi.com> Signed-off-by: Nathan Scott <nathans@sgi.com>	2006-01-11 20:47:41 +11:00
Jesper Juhl	e97a31117c	add missing printk loglevel in mm/swapfile.c in mm/swapfile.c a printk() is missing a loglevel. I believe the proper loglevel for this situation is KERN_ERR, so that's what the patch below sets -if you agree, please apply. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-01-11 01:50:28 +01:00
Christoph Hellwig	870f481793	[PATCH] replace inode_update_time with file_update_time To allow various options to work per-mount instead of per-sb we need a struct vfsmount when updating ctime and mtime. This preparation patch replaces the inode_update_time routine with a file_update_atime routine so we can easily get at the vfsmount. (and the file makes more sense in this context anyway). Also get rid of the unused second argument - we always want to update the ctime when calling this routine. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-10 08:01:30 -08:00
Jes Sorensen	1b1dcc1b57	[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2006-01-09 15:59:24 -08:00
Ingo Molnar	de5097c2e7	[PATCH] mutex subsystem, more debugging code more mutex debugging: check for held locks during memory freeing, task exit, enable sysrq printouts, etc. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org>	2006-01-09 15:59:21 -08:00
Linus Torvalds	6150c32589	Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-merge	2006-01-09 10:03:44 -08:00
Valentine Barshak	87ba81dba4	[PATCH] fadvise: return ESPIPE on FIFO/pipe The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be fully POSIX-compliant. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:14:00 -08:00
OGAWA Hirofumi	28fd129827	[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait) This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it. See mm/filemap.c: And changes the filemap_write_and_wait() and filemap_write_and_wait_range(). Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite() returns error. However, even if filemap_fdatawrite() returned an error, it may have submitted the partially data pages to the device. (e.g. in the case of -ENOSPC) <quotation> Andrew Morton writes, If filemap_fdatawrite() returns an error, this might be due to some I/O problem: dead disk, unplugged cable, etc. Given the generally crappy quality of the kernel's handling of such exceptions, there's a good chance that the filemap_fdatawait() will get stuck in D state forever. </quotation> So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO. Trond, could you please review the nfs part? Especially I'm not sure, nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not. Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:47 -08:00
OGAWA Hirofumi	268fc16e34	[PATCH] export/change sync_page_range/_nolock() This exports/changes the sync_page_range/_nolock(). The fatfs needs sync_page_range/_nolock() for expanding truncate, and changes "size_t count" to "loff_t count". Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:47 -08:00
Paul Jackson	4225399a66	[PATCH] cpuset: rebind vma mempolicies fix Fix more of longstanding bug in cpuset/mempolicy interaction. NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset to just the Memory Nodes allowed by that cpuset. The kernel maintains internal state for each mempolicy, tracking what nodes are used for the MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies. When a tasks cpuset memory placement changes, whether because the cpuset changed, or because the task was attached to a different cpuset, then the tasks mempolicies have to be rebound to the new cpuset placement, so as to preserve the cpuset-relative numbering of the nodes in that policy. An earlier fix handled such mempolicy rebinding for mempolicies attached to a task. This fix rebinds mempolicies attached to vma's (address ranges in a tasks address space.) Due to the need to hold the task->mm->mmap_sem semaphore while updating vma's, the rebinding of vma mempolicies has to be done when the cpuset memory placement is changed, at which time mmap_sem can be safely acquired. The tasks mempolicy is rebound later, when the task next attempts to allocate memory and notices that its task->cpuset_mems_generation is out-of-date with its cpusets mems_generation. Because walking the tasklist to find all tasks attached to a changing cpuset requires holding tasklist_lock, a spinlock, one cannot update the vma's of the affected tasks while doing the tasklist scan. In general, one cannot acquire a semaphore (which can sleep) while already holding a spinlock (such as tasklist_lock). So a list of mm references has to be built up during the tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem acquired, and the vma's in that mm rebound. Once the tasklist lock is dropped, affected tasks may fork new tasks, before their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to point to the cpuset being rebound (there can only be one; cpuset modifications are done under a global 'manage_sem' semaphore), and the mpol_copy code that is used to copy a tasks mempolicies during fork catches such forking tasks, and ensures their children are also rebound. When a task is moved to a different cpuset, it is easier, as there is only one task involved. It's mm->vma's are scanned, using the same mpol_rebind_policy() as used above. It may happen that both the mpol_copy hook and the update done via the tasklist scan update the same mm twice. This is ok, as the mempolicies of each vma in an mm keep track of what mems_allowed they are relative to, and safely no-op a second request to rebind to the same nodes. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:44 -08:00
Paul Jackson	74cb21553f	[PATCH] cpuset: numa_policy_rebind cleanup Cleanup, reorganize and make more robust the mempolicy.c code to rebind mempolicies relative to the containing cpuset after a tasks memory placement changes. The real motivator for this cleanup patch is to lay more groundwork for the upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's after the containing cpuset memory placement changes. NUMA mempolicies are constrained by the cpuset their task is a member of. When either (1) a task is moved to a different cpuset, or (2) the 'mems' mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to be recalculated, relative to their new cpuset placement. The old code used an unreliable method of determining what was the old mems_allowed constraining the mempolicy. It just looked at the tasks mems_allowed value. This sort of worked with the present code, that just rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken, referring to the old nodes. But in an upcoming patch, the vma mempolicies will be rebound as well. Then the order in which the various task and vma mempolicies are updated will no longer be deterministic, and one can no longer count on the task->mems_allowed holding the old value for as long as needed. It's not even clear if the current code was guaranteed to work reliably for task mempolicies. So I added a mems_allowed field to each mempolicy, stating exactly what mems_allowed the policy is relative to, and updated synchronously and reliably anytime that the mempolicy is rebound. Also removed a useless wrapper routine, numa_policy_rebind(), and had its caller, cpuset_update_task_memory_state(), call directly to the rewritten policy_rebind() routine, and made that rebind routine extern instead of static, and added a "mpol_" prefix to its name, making it mpol_rebind_policy(). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:44 -08:00
Paul Jackson	909d75a3b7	[PATCH] cpuset: implement cpuset_mems_allowed Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code needed, to obtain the mems_allowed vector of a cpuset, and replaced the workaround in sys_migrate_pages() to call this new method. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:44 -08:00
Paul Jackson	cf2a473c40	[PATCH] cpuset: combine refresh_mems and update_mems The important code paths through alloc_pages_current() and alloc_page_vma(), by which most kernel page allocations go, both called cpuset_update_current_mems_allowed(), which in turn called refresh_mems(). -Both- of these latter two routines did a tasklock, got the tasks cpuset pointer, and checked for out of date cpuset->mems_generation. That was a silly duplication of code and waste of CPU cycles on an important code path. Consolidated those two routines into a single routine, called cpuset_update_task_memory_state(), since it updates more than just mems_allowed. Changed all callers of either routine to call the new consolidated routine. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:43 -08:00
Paul Jackson	3e0d98b9f1	[PATCH] cpuset: memory pressure meter Provide a simple per-cpuset metric of memory pressure, tracking the -rate- that the tasks in a cpuset call try_to_free_pages(), the synchronous (direct) memory reclaim code. This enables batch managers monitoring jobs running in dedicated cpusets to efficiently detect what level of memory pressure that job is causing. This is useful both on tightly managed systems running a wide mix of submitted jobs, which may choose to terminate or reprioritize jobs that are trying to use more memory than allowed on the nodes assigned them, and with tightly coupled, long running, massively parallel scientific computing jobs that will dramatically fail to meet required performance goals if they start to use more memory than allowed to them. This patch just provides a very economical way for the batch manager to monitor a cpuset for signs of memory pressure. It's up to the batch manager or other user code to decide what to do about it and take action. ==> Unless this feature is enabled by writing "1" to the special file /dev/cpuset/memory_pressure_enabled, the hook in the rebalance code of __alloc_pages() for this metric reduces to simply noticing that the cpuset_memory_pressure_enabled flag is zero. So only systems that enable this feature will compute the metric. Why a per-cpuset, running average: Because this meter is per-cpuset, rather than per-task or mm, the system load imposed by a batch scheduler monitoring this metric is sharply reduced on large systems, because a scan of the tasklist can be avoided on each set of queries. Because this meter is a running average, instead of an accumulating counter, a batch scheduler can detect memory pressure with a single read, instead of having to read and accumulate results for a period of time. Because this meter is per-cpuset rather than per-task or mm, the batch scheduler can obtain the key information, memory pressure in a cpuset, with a single read, rather than having to query and accumulate results over all the (dynamically changing) set of tasks in the cpuset. A per-cpuset simple digital filter (requires a spinlock and 3 words of data per-cpuset) is kept, and updated by any task attached to that cpuset, if it enters the synchronous (direct) page reclaim code. A per-cpuset file provides an integer number representing the recent (half-life of 10 seconds) rate of direct page reclaims caused by the tasks in the cpuset, in units of reclaims attempted per second, times 1000. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:42 -08:00
Paul Jackson	5966514db6	[PATCH] cpuset: mempolicy one more nodemask conversion Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous conversion had left one routine using bitmaps, since it involved a corresponding change to kernel/cpuset.c Fix that interface by replacing with a simple macro that calls nodes_subset(), or if !CONFIG_CPUSET, returns (1). Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:42 -08:00
Matt Mackall	10cef60295	[PATCH] slob: introduce the SLOB allocator configurable replacement for slab allocator This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is disabled, the kernel falls back to using the 'SLOB' allocator. SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer, similar to the original Linux kmalloc allocator that SLAB replaced. It's signicantly smaller code and is more memory efficient. But like all similar allocators, it scales poorly and suffers from fragmentation more than SLAB, so it's only appropriate for small systems. It's been tested extensively in the Linux-tiny tree. I've also stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not recommended). Here's a comparison for otherwise identical builds, showing SLOB saving nearly half a megabyte of RAM: $ size vmlinux* text data bss dec hex filename 3336372 529360 190812 4056544 3de5e0 vmlinux-slab 3323208 527948 190684 4041840 3dac70 vmlinux-slob $ size mm/{slab,slob}.o text data bss dec hex filename 13221 752 48 14021 36c5 mm/slab.o 1896 52 8 1956 7a4 mm/slob.o /proc/meminfo: SLAB SLOB delta MemTotal: 27964 kB 27980 kB +16 kB MemFree: 24596 kB 25092 kB +496 kB Buffers: 36 kB 36 kB 0 kB Cached: 1188 kB 1188 kB 0 kB SwapCached: 0 kB 0 kB 0 kB Active: 608 kB 600 kB -8 kB Inactive: 808 kB 812 kB +4 kB HighTotal: 0 kB 0 kB 0 kB HighFree: 0 kB 0 kB 0 kB LowTotal: 27964 kB 27980 kB +16 kB LowFree: 24596 kB 25092 kB +496 kB SwapTotal: 0 kB 0 kB 0 kB SwapFree: 0 kB 0 kB 0 kB Dirty: 4 kB 12 kB +8 kB Writeback: 0 kB 0 kB 0 kB Mapped: 560 kB 556 kB -4 kB Slab: 1756 kB 0 kB -1756 kB CommitLimit: 13980 kB 13988 kB +8 kB Committed_AS: 4208 kB 4208 kB 0 kB PageTables: 28 kB 28 kB 0 kB VmallocTotal: 1007312 kB 1007312 kB 0 kB VmallocUsed: 48 kB 48 kB 0 kB VmallocChunk: 1007264 kB 1007264 kB 0 kB (this work has been sponsored in part by CELF) From: Ingo Molnar <mingo@elte.hu> Fix 32-bitness bugs in mm/slob.c. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:41 -08:00
Matt Mackall	30992c97ae	[PATCH] slob: introduce mm/util.c for shared functions Add mm/util.c for functions common between SLAB and SLOB. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:41 -08:00
Ravikiran G Thirumalai	22fc6eccbf	[PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros ____cacheline_maxaligned_in_smp is currently used to align critical structures and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find L1_CACHE_SHIFT_MAX useless. However, we have been using ____cacheline_maxaligned_in_smp to align structures on the internode cacheline size. As per Andi's suggestion, following patch kills ____cacheline_maxaligned_in_smp and introduces INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches. Arches needing L3/Internode cacheline alignment can define INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp With this patch, L1_CACHE_SHIFT_MAX can be killed Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:38 -08:00
Kirill Korotaev	2f659f462d	[PATCH] Optimise oom kill of current task When oom_killer kills current there's no need to call schedule_timeout_interruptible() since task must die ASAP. Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:45 -08:00
Christoph Lameter	6ce3c4c0ff	[PATCH] Move page migration related functions near do_migrate_pages() Group page migration functions in mempolicy.c Add a forward declaration for migrate_page_add (like gather_stats()) and use our new found mobility to group all page migration related function around do_migrate_pages(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	48fce3429d	[PATCH] mempolicies: unexport get_vma_policy() Since the numa_maps functionality is now in mempolicy.c we no longer need to export get_vma_policy(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	132beacf97	[PATCH] Drop page table lock before calling migrate_page_add() migrate_page_add cannot be called with a spinlock held (calls isolate_lru_page which calles schedule_on_each_cpu). Drop ptl lock in check_pte_range before calling migrate_page_add(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	1a75a6c825	[PATCH] Fold numa_maps into mempolicies.c First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2 - Use the check_range() in mempolicy.c to gather statistics. - Improve the numa_maps code in general and fix some comments. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	38e35860db	[PATCH] mempolicies: private pointer in check_range and MPOL_MF_INVERT This was was first posted at http://marc.theaimsgroup.com/?l=linux-mm&m=113149240227584&w=2 (Part of this functionality is also contained in the direct migration pathset. The functionality here is more generic and independent of that patchset.) - Add internal flags MPOL_MF_INVERT to control check_range() behavior. - Replace the pagelist passed through by check_range by a general private pointer that may be used for other purposes. (The following patches will use that to merge numa_maps into mempolicy.c and to better group the page migration code in the policy layer) - Improve some comments. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Dave Jones	ef2bf0dc86	[PATCH] rmap: additional diagnostics in page_remove_rmap() We seem to be hitting this assertion failure too often for it to be hardware bugs. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Tobias Klauser	cd105df459	[PATCH] mm: clean up local variables Clean up a local variable with the same name as a variable in a larger block. Also move a variable into the block where it's actually used. Spotted by http://linuxicc.sourceforge.net/ Signed-off-by: Tobias Klauser <tklauser@nuerscht.ch> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:43 -08:00
Christoph Lameter	aea47ff363	[PATCH] mm: make hugepages obey cpusets. See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2 http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2 Make hugepages obey cpusets. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:43 -08:00
Christoph Lameter	d0d963281c	[PATCH] SwapMig: Switch error handling in migrate_pages to use -Exx Use -Exxx instead of numeric return codes and cleanup the code in migrate_pages() using -Exx error codes. Consolidate successful migration handling Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	d498471133	[PATCH] SwapMig: Extend parameters for migrate_pages() Extend the parameters of migrate_pages() to allow the caller control over the fate of successfully migrated or impossible to migrate pages. Swap migration and direct migration will have the same interface after this patch so that patches can be independently applied to the policy layer and the core migration code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	ee27497df3	[PATCH] SwapMig: Drop unused pages immediately Drop unused pages immediately If a page is encountered that is only referenced by the migration code then there is no reason to swap or migrate the page. Release the page by calling move_to_lru(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	1480a540c9	[PATCH] SwapMig: add_to_swap() avoid atomic allocations Add gfp_mask to add_to_swap add_to_swap does allocations with GFP_ATOMIC in order not to interfere with swapping. During migration we may have use add_to_swap extensively which may lead to out of memory errors. This patch makes add_to_swap take a parameter that specifies the gfp mask. The page migration code can then make add_to_swap use GFP_KERNEL. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	8419c31810	[PATCH] SwapMig: CONFIG_MIGRATION fixes Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by CONFIG_MIGRATION saving some codesize for single processor kernels. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	39743889aa	[PATCH] Swap Migration V5: sys_migrate_pages interface sys_migrate_pages implementation using swap based page migration This is the original API proposed by Ray Bryant in his posts during the first half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org. The intent of sys_migrate is to migrate memory of a process. A process may have migrated to another node. Memory was allocated optimally for the prior context. sys_migrate_pages allows to shift the memory to the new node. sys_migrate_pages is also useful if the processes available memory nodes have changed through cpuset operations to manually move the processes memory. Paul Jackson is working on an automated mechanism that will allow an automatic migration if the cpuset of a process is changed. However, a user may decide to manually control the migration. This implementation is put into the policy layer since it uses concepts and functions that are also needed for mbind and friends. The patch also provides a do_migrate_pages function that may be useful for cpusets to automatically move memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation. The current code here is based on the swap based page migration capability and thus is not able to preserve the physical layout relative to it containing nodeset (which may be a cpuset). When direct page migration becomes available then the implementation needs to be changed to do a isomorphic move of pages between different nodesets. The current implementation simply evicts all pages in source nodeset that are not in the target nodeset. Patch supports ia64, i386 and x86_64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	dc9aa5b9d6	[PATCH] Swap Migration V5: MPOL_MF_MOVE interface Add page migration support via swap to the NUMA policy layer This patch adds page migration support to the NUMA policy layer. An additional flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is specified then pages that do not conform to the memory policy will be evicted from memory. When they get pages back in new pages will be allocated following the numa policy. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	7cbe34cf86	[PATCH] Swap Migration V5: Add CONFIG_MIGRATION for page migration support Include page migration if the system is NUMA or having a memory model that allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM). And: - Only include lru_add_drain_per_cpu if building for an SMP system. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	49d2e9cc45	[PATCH] Swap Migration V5: migrate_pages() function This adds the basic page migration function with a minimal implementation that only allows the eviction of pages to swap space. Page eviction and migration may be useful to migrate pages, to suspend programs or for remapping single pages (useful for faulty pages or pages with soft ECC failures) The process is as follows: The function wanting to migrate pages must first build a list of pages to be migrated or evicted and take them off the lru lists via isolate_lru_page(). isolate_lru_page determines that a page is freeable based on the LRU bit set. Then the actual migration or swapout can happen by calling migrate_pages(). migrate_pages does its best to migrate or swapout the pages and does multiple passes over the list. Some pages may only be swappable if they are not dirty. migrate_pages may start writing out dirty pages in the initial passes over the pages. However, migrate_pages may not be able to migrate or evict all pages for a variety of reasons. The remaining pages may be returned to the LRU lists using putback_lru_pages(). Changelog V4->V5: - Use the lru caches to return pages to the LRU Changelog V3->V4: - Restructure code so that applying patches to support full migration does require minimal changes. Rename swapout_pages() to migrate_pages(). Changelog V2->V3: - Extract common code from shrink_list() and swapout_pages() Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "Michael Kerrisk" <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	930d915252	[PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap Add PF_SWAPWRITE to control a processes permission to write to swap. - Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd and pdflush - Set PF_SWAPWRITE flag for kswapd and pdflush Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	21eac81f25	[PATCH] Swap Migration V5: LRU operations This is the start of the `swap migration' patch series. Swap migration allows the moving of the physical location of pages between nodes in a numa system while the process is running. This means that the virtual addresses that the process sees do not change. However, the system rearranges the physical location of those pages. The main intent of page migration patches here is to reduce the latency of memory access by moving pages near to the processor where the process accessing that memory is running. The patchset allows a process to manually relocate the node on which its pages are located through the MF_MOVE and MF_MOVE_ALL options while setting a new memory policy. The pages of process can also be relocated from another process using the sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages function call takes two sets of nodes and moves pages of a process that are located on the from nodes to the destination nodes. Manual migration is very useful if for example the scheduler has relocated a process to a processor on a distant node. A batch scheduler or an administrator can detect the situation and move the pages of the process nearer to the new processor. sys_migrate_pages() could be used on non-numa machines as well, to force all of a particualr process's pages out to swap, if someone thinks that's useful. Larger installations usually partition the system using cpusets into sections of nodes. Paul has equipped cpusets with the ability to move pages when a task is moved to another cpuset. This allows automatic control over locality of a process. If a task is moved to a new cpuset then also all its pages are moved with it so that the performance of the process does not sink dramatically (as is the case today). Swap migration works by simply evicting the page. The pages must be faulted back in. The pages are then typically reallocated by the system near the node where the process is executing. For swap migration the destination of the move is controlled by the allocation policy. Cpusets set the allocation policy before calling sys_migrate_pages() in order to move the pages as intended. No allocation policy changes are performed for sys_migrate_pages(). This means that the pages may not faulted in to the specified nodes if no allocation policy was set by other means. The pages will just end up near the node where the fault occurred. There's another patch series in the pipeline which implements "direct migration". The direct migration patchset extends the migration functionality to avoid going through swap. The destination node of the relation is controllable during the actual moving of pages. The crutch of using the allocation policy to relocate is not necessary and the pages are moved directly to the target. Its also faster since swap is not used. And sys_migrate_pages() can then move pages directly to the specified node. Implement functions to isolate pages from the LRU and put them back later. This patch: An earlier implementation was provided by Hirokazu Takahashi <taka@valinux.co.jp> and IWAMOTO Toshihiro <iwamoto@valinux.co.jp> for the memory hotplug project. From: Magnus This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap migration. Signed-off-by: Magnus Damm <magnus.damm@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Nick Piggin	48db57f8ff	[PATCH] mm: free_pages opt Try to streamline free_pages_bulk by ensuring callers don't pass in a 'count' that exceeds the list size. Some cleanups: Rename __free_pages_bulk to __free_one_page. Put the page list manipulation from __free_pages_ok into free_one_page. Make __free_pages_ok static. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Nick Piggin	23316bc86f	[PATCH] mm: cleanup zone_pcp Use zone_pcp everywhere even though NUMA code "knows" the internal details of the zone. Stop other people trying to copy, and it looks nicer. Also, only print the pagesets of online cpus in zoneinfo. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Seth, Rohit" <rohit.seth@intel.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Rohit Seth	8ad4b1fb82	[PATCH] Make high and batch sizes of per_cpu_pagelists configurable As recently there has been lot of traffic on the right values for batch and high water marks for per_cpu_pagelists. This patch makes these two variables configurable through /proc interface. A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry controls the fraction of pages at most in each zone that are allocated for each per cpu page list. The min value for this is 8. It means that we don't allow more than 1/8th of pages in each zone to be allocated in any single per_cpu_pagelist. The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Andrew Morton	9d0243bca3	[PATCH] drop-pagecache Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to discard as much pagecache and/or reclaimable slab objects as it can. THis operation requires root permissions. It won't drop dirty data, so the user should run `sync' first. Caveats: a) Holds inode_lock for exorbitant amounts of time. b) Needs to be taught about NUMA nodes: propagate these all the way through so the discarding can be controlled on a per-node basis. This is a debugging feature: useful for getting consistent results between filesystem benchmarks. We could possibly put it under a config option, but it's less than 300 bytes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Christoph Lameter	bec6b0c89b	[PATCH] slab: remove nested #ifdef CONFIG_NUMA For some reason there is an #ifdef CONFIG_NUMA within another #ifdef CONFIG_NUMA in the page allocator. Remove innermost #ifdef CONFIG_NUMA Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Pekka Enberg	b28a02de8c	[PATCH] slab: fix code formatting The slab allocator code is inconsistent in coding style and messy. For this patch, I ran Lindent for mm/slab.c and fixed up goofs by hand. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Pekka Enberg	4d268eba11	[PATCH] slab: extract slab order calculation to separate function This patch moves the ugly loop that determines the 'optimal' size (page order) of cache slabs from kmem_cache_create() to a separate function and cleans it up a bit. Thanks to Matthew Wilcox for the help with this patch. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Pekka Enberg	85289f98dd	[PATCH] slab: extract slabinfo header printing to separate function This patch extracts slabinfo header printing to a separate function print_slabinfo_header() to make s_start() more readable. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Pekka Enberg	f9f7500521	[PATCH] slab: remove unused align parameter from alloc_percpu __alloc_percpu and alloc_percpu both take an 'align' argument which is completely ignored. snmp6_mib_init() in net/ipv6/af_inet6.c attempts to use it, but it will be ignored. Therefore, remove the 'align' argument and fixup the lone caller. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Andrew Morton	84c2008af0	[PATCH] revert "mm: page_state fixes" Hugh says: page_alloc_cpu_notify() specifically contains code to /* Add dead cpu's page_states to our own. */ which handles this more efficiently. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:38 -08:00
Arnd Bergmann	67207b9664	[PATCH] spufs: The SPU file system, base This is the current version of the spu file system, used for driving SPEs on the Cell Broadband Engine. This release is almost identical to the version for the 2.6.14 kernel posted earlier, which is available as part of the Cell BE Linux distribution from http://www.bsc.es/projects/deepcomputing/linuxoncell/. The first patch provides all the interfaces for running spu application, but does not have any support for debugging SPU tasks or for scheduling. Both these functionalities are added in the subsequent patches. See Documentation/filesystems/spufs.txt on how to use spufs. Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2006-01-09 14:49:12 +11:00
Andrew Morton	22905f775d	identify multipage ->writepages() calls NFS needs to be able to distinguish between single-page ->writepage() calls and multipage ->writepages() calls. For the single-page writepage calls NFS can kick off the I/O within the context of ->writepage(). For multipage ->writepages calls, nfs_writepage() will leave the I/O pending and nfs_writepages() will kick off the I/O when it all has been queued up within NFS. Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2006-01-06 14:58:38 -05:00
Rafael J. Wysocki	3a291a20bd	[PATCH] mm: add a new function (needed for swap suspend) This adds the function get_swap_page_of_type() allowing us to specify an index in swap_info[] and select a swap_info_struct structure to be used for allocating a swap page. This function (or another one of similar functionality) will be necessary for implementing the image-writing part of swsusp in the user space. It can also be used for simplifying the current in-kernel implementation of the image-writing part of swsusp. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:43 -08:00
Anton Blanchard	c898ec16e8	[PATCH] allow flatmem to be disabled when only sparsemem is implemented On architectures that implement sparsemem but not discontigmem we want to be able to hide the flatmem option in some cases. On ppc64 for example, when we select NUMA we must not select flatmem. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:37 -08:00
David Howells	b0e15190ea	[PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMU The attached patch makes the SYSV IPC shared memory facilities use the new ramfs facilities on a no-MMU kernel. The following changes are made: (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to allow the IPC SHM facilities to commune with the tiny-shmem and shmem code. (2) ramfs files now need resizing using do_truncate() rather than by modifying the inode size directly (see shmem_file_setup()). This causes ramfs to attempt to bind a block of pages of sufficient size to the inode. (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:32 -08:00
Nick Piggin	a74609fafa	[PATCH] mm: page_state opt Optimise page_state manipulations by introducing interrupt unsafe accessors to page_state fields. Callers must provide their own locking (either disable interrupts or not update from interrupt context). Switch over the hot callsites that can easily be moved under interrupts off sections. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:29 -08:00
Christoph Lameter	070f80326a	[PATCH] build_zonelists_node(): rename args Give j and r meaningful names. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Christoph Lameter	02a68a5ebc	[PATCH] Fix zone policy determination The use k in the inner loop means that the highest zone nr is always used if any zone of a node is populated. This means that the policy zone is not correctly determined on arches that do no use HIGHMEM like ia64. Change the loop to decrement k which also simplifies the BUG_ON. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Christoph Lameter	4be38e351c	[PATCH] mm: move determination of policy_zone into page allocator Currently the function to build a zonelist for a BIND policy has the side effect to set the policy_zone. This seems to be a bit strange. policy zone seems to not be initialized elsewhere and therefore 0. Do we police ZONE_DMA if no bind policy has been used yet? This patch moves the determination of the zone to apply policies to into the page allocator. We determine the zone while building the zonelist for nodes. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Christoph Lameter	1a93205bdf	[PATCH] mm: simplify build_zonelists_node by removing the case statement. Simplify build_zonelists_node by removing the case statement. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Con Kolivas	f3fe65122d	[PATCH] mm: add populated_zone() helper There are numerous places we check whether a zone is populated or not. Provide a helper function to check for populated zones and convert all checks for zone->present_pages. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Andrew Morton	80bfed904c	[PATCH] consolidate lru_add_drain() and lru_drain_cache() Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Rajesh Shah <rajesh.shah@intel.com> Cc: Li Shaohua <shaohua.li@intel.com> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Andrew Morton	210fe53030	[PATCH] vmscan: balancing fix Revert a patch which went into 2.6.8-rc1. The changelog for that patch was: The shrink_zone() logic can, under some circumstances, cause far too many pages to be reclaimed. Say, we're scanning at high priority and suddenly hit a large number of reclaimable pages on the LRU. Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. Problem is, this change caused significant imbalance in inter-zone scan balancing by truncating scans of larger zones. Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone balancing algorithm would require that if we're scanning 100 pages of ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are reclaimed. Thus effectively causing smaller zones to be scanned relatively harder than large ones. Now I need to remember what the workload was which caused me to write this patch originally, then fix it up in a different way... Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:27 -08:00
Nick Piggin	41e9b63b35	[PATCH] mm: pfault optimisation This atomic operation is superfluous: the pte will be added with the referenced bit set, and the page will be referenced through this mapping after the page fault handler returns anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:27 -08:00
Nick Piggin	9617d95e6e	[PATCH] mm: rmap optimisation Optimise rmap functions by minimising atomic operations when we know there will be no concurrent modifications. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:27 -08:00
Nick Piggin	224abf92b2	[PATCH] mm: bad_page optimisation Cut down size slightly by not passing bad_page the function name (it should be able to be determined by dump_stack()). And cut down the number of printks in bad_page. Also, cut down some branching in the destroy_compound_page path. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Nick Piggin	9328b8faae	[PATCH] mm: dma32 zone statistics Add dma32 to zone statistics. Also attempt to arrange struct page_state a bit better (visually). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Andrew Morton	7756b9e4e3	[PATCH] kill last zone_reclaim() bits Remove the last bits of Martin's ill-fated sys_set_zone_reclaim(). Cc: Martin Hicks <mort@wildopensource.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Nikita Danilov	bbfbb7cec9	[PATCH] find_lock_page(): call __lock_page() directly. As find_lock_page() already checks with TestSetPageLocked() that page is locked, there is no need to call lock_page() that will try-lock page again (chances of page being unlocked in between are small). Call __lock_page() directly, this saves one atomic operation. Also, mark truncate-while-slept path as unlikely while we are here. (akpm: ug. But this is actually a common path for normal old read()s against a page which is under readahead I/O so ho-hum.) Signed-off-by: Nikita Danilov <danilov@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
David Howells	a226f6c899	[PATCH] FRV: Clean up bootmem allocator's page freeing algorithm The attached patch cleans up the way the bootmem allocator frees pages. A new function, __free_pages_bootmem(), is provided in mm/page_alloc.c that is called from mm/bootmem.c to turn pages over to the main allocator. All the bits of code to initialise pages (clearing PG_reserved and setting the page count) are moved to here. The checks on page validity are removed, on the assumption that the struct page arrays will have been prepared correctly. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Ravikiran G Thirumalai	008857c1a4	[PATCH] Cleanup bootmem allocator and fix alloc_bootmem_low Patch cleans up the alloc_bootmem fix for swiotlb. Patch removes alloc_bootmem__limit api and fixes alloc_boot_low api to do the right thing -- allocate from low32 memory. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Nick Piggin	085cc7d5de	[PATCH] mm: page_alloc cleanups Small cleanups that does not change generated code with the gcc's I've tested with. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	a86b1f5316	[PATCH] mm: page_state fixes read_page_state and __get_page_state only traverse online CPUs, which will cause results to fluctuate when CPUs are plugged in or out. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	2d92c5c915	[PATCH] mm: remove pcp low struct per_cpu_pages.low is useless. Remove it. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	13e7444b0e	[PATCH] mm: remove bad_range bad_range is supposed to be a temporary check. It would be a pity to throw it out. Make it depend on CONFIG_DEBUG_VM instead. CONFIG_HOLES_IN_ZONE systems were relying on this to check pfn_valid in the page allocator. Add that to page_is_buddy instead. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	92be2e33b1	[PATCH] mm: microopt conditions Micro optimise some conditionals where we don't need lazy evaluation. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	77a8a78834	[PATCH] mm: set_page_refs opt Inline set_page_refs. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	c54ad30c78	[PATCH] mm: pagealloc opt Slightly optimise some page allocation and freeing functions by taking advantage of knowing whether or not interrupts are disabled. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Hugh Dickins	c484d41042	[PATCH] mm: free_pages_and_swap_cache opt Minor optimization (though it doesn't help in the PREEMPT case, severely constrained by small ZAP_BLOCK_SIZE). free_pages_and_swap_cache works in chunks of 16, calling release_pages which works in chunks of PAGEVEC_SIZE. But PAGEVEC_SIZE was dropped from 16 to 14 in 2.6.10, so we're now doing more spin_lock_irq'ing than necessary: use PAGEVEC_SIZE throughout. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:24 -08:00
Mike Kravetz	a94b3ab7ea	[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM could handle pSeries numa layouts. However, support for DISCONTIGMEM has been replaced by SPARSEMEM on powerpc. As a result, this config option and supporting code is no longer needed. I have already sent a patch to Paul that removes the option from powerpc specific code. This removes the arch independent piece. Doesn't really matter which is applied first. Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:24 -08:00
Christoph Lameter	6bda666a03	[PATCH] hugepages: fold find_or_alloc_pages into huge_no_page() The number of parameters for find_or_alloc_page increases significantly after policy support is added to huge pages. Simplify the code by folding find_or_alloc_huge_page() into hugetlb_no_page(). Adam Litke objected to this piece in an earlier patch but I think this is a good simplification. Diffstat shows that we can get rid of almost half of the lines of find_or_alloc_page(). If we can find no consensus then lets simply drop this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Christoph Lameter	21abb1478a	[PATCH] Remove old node based policy interface from mempolicy.c mempolicy.c contains provisional interface for huge page allocation based on node numbers. This is in use in SLES9 but was never used (AFAIK) in upstream versions of Linux. Huge page allocations now use zonelists to figure out where to allocate pages. The use of zonelists allows us to find the closest hugepage which was the consideration of the NUMA distance for huge page allocations. Remove the obsolete functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Christoph Lameter	5da7ca8607	[PATCH] Add NUMA policy support for huge pages. The huge_zonelist() function in the memory policy layer provides an list of zones ordered by NUMA distance. The hugetlb layer will walk that list looking for a zone that has available huge pages but is also in the nodeset of the current cpuset. This patch does not contain the folding of find_or_alloc_huge_page() that was controversial in the earlier discussion. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Christoph Lameter	96df9333c9	[PATCH] mm: dequeue a huge page near to this node This was discussed at http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2 This patch changes the dequeueing to select a huge page near the node executing instead of always beginning to check for free nodes from node 0. This will result in a placement of the huge pages near the executing processor improving performance. The existing implementation can place the huge pages far away from the executing processor causing significant degradation of performance. The search starting from zero also means that the lower zones quickly run out of memory. Selecting a huge page near the process distributed the huge pages better. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
David Gibson	1e8f889b10	[PATCH] Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Adam Litke	86e5216f8d	[PATCH] Hugetlb: Reorganize hugetlb_fault to prepare for COW This patch splits the "no_page()" type activity into its own function, hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults and delegates to the appropriate handler depending on the type of fault. Right now we still have only hugetlb_no_page() but a later patch introduces a COW fault. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Adam Litke	85ef47f74a	[PATCH] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page find_lock_huge_page() isn't a great name, since it does extra things not analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is closer to the mark. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Adam Litke	f0916794f0	[PATCH] Hugetlb: Remove duplicate i_size check cleanup Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Badari Pulavarty	f6b3ec238d	[PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing store Here is the patch to implement madvise(MADV_REMOVE) - which frees up a given range of pages & its associated backing store. Current implementation supports only shmfs/tmpfs and other filesystems return -ENOSYS. "Some app allocates large tmpfs files, then when some task quits and some client disconnect, some memory can be released. However the only way to release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli Databases want to use this feature to drop a section of their bufferpool (shared memory segments) - without writing back to disk/swap space. This feature is also useful for supporting hot-plug memory on UML. Concerns raised by Andrew Morton: - "We have no plan for holepunching! If we _do_ have such a plan (or might in the future) then what would the API look like? I think sys_holepunch(fd, start, len), so we should start out with that." - Using madvise is very weird, because people will ask "why do I need to mmap my file before I can stick a hole in it?" - None of the other madvise operations call into the filesystem in this manner. A broad question is: is this capability an MM operation or a filesytem operation? truncate, for example, is a filesystem operation which sometimes has MM side-effects. madvise is an mm operation and with this patch, it gains FS side-effects, only they're really, really significant ones." Comments: - Andrea suggested the fs operation too but then it's more efficient to have it as a mm operation with fs side effects, because they don't immediatly know fd and physical offset of the range. It's possible to fixup in userland and to use the fs operation but it's more expensive, the vmas are already in the kernel and we can use them. Short term plan & Future Direction: - We seem to need this interface only for shmfs/tmpfs files in the short term. We have to add hooks into the filesystem for correctness and completeness. This is what this patch does. - In the future, plan is to support both fs and mmap apis also. This also involves (other) filesystem specific functions to be implemented. - Current patch doesn't support VM_NONLINEAR - which can be addressed in the future. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrea Arcangeli <andrea@suse.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Hans Reiser	d7339071f6	[PATCH] reiser4: vfs: add truncate_inode_pages_range() This patch makes truncate_inode_pages_range from truncate_inode_pages. truncate_inode_pages became a one-liner call to truncate_inode_pages_range. Reiser4 needs truncate_inode_pages_ranges because it tries to keep correspondence between existences of metadata pointing to data pages and pages to which those metadata point to. So, when metadata of certain part of file is removed from filesystem tree, only pages of corresponding range are to be truncated. (Needed by the madvise(MADV_REMOVE) patch) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Andy Whitcroft	5ac24eefd1	[PATCH] memhotplug: __add_section remove unused pgdat definition __add_section defines an unused pointer to the zones pgdat. Remove this definition. This fixes a compile warning. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:21 -08:00
Paul Jackson	47f3a867f6	[PATCH] mm: fix __alloc_pages cpuset ALLOC_* flags Two changes to the setting of the ALLOC_CPUSET flag in mm/page_alloc.c:__alloc_pages() - A bug fix - the "ignoring mins" case should not be honoring ALLOC_CPUSET. This case of all cases, since it is handling a request that will free up more memory than is asked for (exiting tasks, e.g.) should be allowed to escape cpuset constraints when memory is tight. - A logic change to make it simpler. Honor cpusets even on GFP_ATOMIC (!wait) requests. With this, cpuset confinement applies to all requests except ALLOC_NO_WATERMARKS, so that in a subsequent cleanup patch, I can remove the ALLOC_CPUSET flag entirely. Since I don't know any real reason this logic has to be either way, I am choosing the path of the simplest code. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:21 -08:00
Zach Brown	994fc28c7b	[PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATE readpage(), prepare_write(), and commit_write() callers are updated to understand the special return code AOP_TRUNCATED_PAGE in the style of writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that the callee has unlocked the page and that the operation should be tried again with a new page. OCFS2 uses this to detect and work around a lock inversion in its aop methods. There should be no change in behaviour for methods that don't return AOP_TRUNCATED_PAGE. WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are made enums so that kerneldoc can be used to document their semantics. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2006-01-03 11:45:42 -08:00
Andi Kleen	8f493d797b	[PATCH] Make sure interleave masks have at least one node set Otherwise a bad mem policy system call can confuse the interleaving code into referencing undefined nodes. Originally reported by Doug Chapman I was told it's CVE-2005-3358 (one has to love these security people - they make everything sound important) Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-02 17:01:42 -08:00
Linus Torvalds	4d7672b462	Make sure we copy pages inserted with "vm_insert_page()" on fork The logic that decides that a fork() might be able to avoid copying a VM area when it can be re-created by page faults didn't know about the new vm_insert_page() case. Also make some things a bit more anal wrt VM_PFNMAP. Pointed out by Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-16 10:21:23 -08:00
Al Viro	78d9955bb0	[PATCH] missing prototype (mm/page_alloc.c) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-15 10:04:30 -08:00
Yasunori Goto	118c71bcac	[PATCH] Fix calculation of grow_pgdat_span() in mm/memory_hotplug.c The calculation for node_spanned_pages at grow_pgdat_span() is clearly wrong. This is patch for it. (Please see grow_zone_span() to compare. It is correct.) Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-13 21:18:16 -08:00
Linus Torvalds	1ff8038988	get_user_pages: don't try to follow PFNMAP pages Nick Piggin points out that a few drivers play games with VM_IO (why? who knows..) and thus a pfn-remapped area may not have that bit set even if remap_pfn_range() set it originally. So make it explicit in get_user_pages() that we don't follow VM_PFNMAP pages, since pretty much by definition they do not have a "struct page" associated with them. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-12 16:24:33 -08:00
Haren Myneni	66d43e98ea	[PATCH] fix in __alloc_bootmem_core() when there is no free page in first node's memory Hitting BUG_ON() in __alloc_bootmem_core() when there is no free page available in the first node's memory. For the case of kdump on PPC64 (Power 4 machine), the captured kernel is used two memory regions - memory for TCE tables (tce-base and tce-size at top of RAM and reserved) and captured kernel memory region (crashk_base and crashk_size). Since we reserve the memory for the first node, we should be returning from __alloc_bootmem_core() to search for the next node (pg_dat). Currently, find_next_zero_bit() is returning the n^th bit (eidx) when there is no free page. Then, test_bit() is failed since we set 0xff only for the actual size initially (init_bootmem_core()) even though rounded up to one page for bdata->node_bootmem_map. We are hitting the BUG_ON after failing to enter second "for" loop. Signed-off-by: Haren Myneni <haren@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-12 08:57:45 -08:00
Linus Torvalds	67121172f9	Allow arbitrary read-only shared pfn-remapping too The VM layer (for historical reasons) turns a read-only shared mmap into a private-like mapping with the VM_MAYWRITE bit clear. Thus checking just VM_SHARED isn't actually sufficient. So use a trivial helper function for the cases where we wanted to inquire if a mapping was COW-like or not. Moo! Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-11 20:38:17 -08:00
Linus Torvalds	7fc7e2eeec	Remove (at least temporarily) the "incomplete PFN mapping" support With the previous commit, we can handle arbitrary shared re-mappings even without this complexity, and since the only known private mappings are for strange users of /dev/mem (which never create an incomplete one), there seems to be no reason to support it. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-11 19:57:52 -08:00
Linus Torvalds	fb155c1619	Allow arbitrary shared PFNMAP's A shared mapping doesn't cause COW-pages, so we don't need to worry about the whole vm_pgoff logic to decide if a PFN-remapped page has gone through COW or not. This makes it possible to entirely avoid the special "partial remapping" logic for the common case. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-11 19:46:02 -08:00
Linus Torvalds	e3c3374fbf	Make vm_insert_page() available to NVidia module It used to use remap_pfn_range(), which wasn't GPL-only either, and the new interface is actually simpler and does more checking, so we shouldn't unnecessarily discourage people from switching over. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-03 20:48:11 -08:00
Nick Piggin	0ceaacc978	[PATCH] Fix up per-cpu page batch sizes The code to clamp batch sizes to 2^n - 1 went missing and an extra check got added, which must have been a hunk of the "higer order pcp batch refills" work sneaking in. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-03 20:46:40 -08:00
Linus Torvalds	a145dd411e	VM: add "vm_insert_page()" function This is what a lot of drivers will actually want to use to insert individual pages into a user VMA. It doesn't have the old PageReserved restrictions of remap_pfn_range(), and it doesn't complain about partial remappings. The page you insert needs to be a nice clean kernel allocation, so you can't insert arbitrary page mappings with this, but that's not what people want. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-30 09:35:19 -08:00
Trond Myklebust	49c91fb01f	[PATCH] VM: Fix typos in get_locked_pte Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 17:29:57 -08:00
Hugh Dickins	325f04dbca	[PATCH] pfnmap: do_no_page BUG_ON again Use copy_user_highpage directly instead of cow_user_page in do_no_page: in the immediately following page_cache_release, and elsewhere, it is assuming that new_page is normal. If any VM_PFNMAP driver can get to do_no_page, it's just a BUG (but not in the case of do_anonymous_page). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:09:17 -08:00
Hugh Dickins	e5bbe4dfc8	[PATCH] pfnmap: remove src_page from do_wp_page Clean away do_wp_page's "src_page": cow_user_page makes it unnecessary. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:09:16 -08:00
Linus Torvalds	5d2a2dbbc1	cow_user_page: fix page alignment High Dickins points out that the user virtual address passed to the page fault handler isn't necessarily page-aligned. Also, add a comment on why the copy could fail for the user address case. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:07:55 -08:00
Linus Torvalds	c9cfcddfd6	VM: add common helper function to create the page tables This logic was duplicated four times, for no good reason. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:03:14 -08:00
Linus Torvalds	238f58d898	Support strange discontiguous PFN remappings These get created by some drivers that don't generally even want a pfn remapping at all, but would really mostly prefer to just map pages they've allocated individually instead. For now, create a helper function that turns such an incomplete PFN remapping call into a loop that does that explicit mapping. In the long run we almost certainly want to export a totally different interface for that, though. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 13:01:56 -08:00
Ben Collins	eca351336a	[PATCH] Fix missing pfn variables caused by vm changes I image this showed up because of "unused var..." when the changes occured, because flush_cache_page() is a noop in most places. This showed up for me on parisc however, where flush_cache_page() is a real function. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 12:57:17 -08:00
Nick Piggin	fa2a455b02	[PATCH] Fix vma argument in get_usr_pages() for gate areas The system call gate area handling called vm_normal_page() with the wrong vma (which was always NULL, and caused an oops). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 07:53:32 -08:00
Andrea Arcangeli	ea164d73a7	[PATCH] shrinker->nr = LONG_MAX means deadlock for icache With Andrew Morton <akpm@osdl.org> The slab scanning code tries to balance the scanning rate of slabs versus the scanning rate of LRU pages. To do this, it retains state concerning how many slabs have been scanned - if a particular slab shrinker didn't scan enough objects, we remember that for next time, and scan more objects on the next pass. The problem with this is that with (say) a huge number of GFP_NOIO direct-reclaim attempts, the number of objects which are to be scanned when we finally get a GFP_KERNEL request can be huge. Because some shrinker handlers just bail out if !__GFP_FS. So the patch clamps the number of objects-to-be-scanned to 2* the total number of objects in the slab cache. Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:26 -08:00
Rik van Riel	f7b7fd8f3e	[PATCH] temporarily disable swap token on memory pressure Some users (hi Zwane) have seen a problem when running a workload that eats nearly all of physical memory - th system does an OOM kill, even when there is still a lot of swap free. The problem appears to be a very big task that is holding the swap token, and the VM has a very hard time finding any other page in the system that is swappable. Instead of ignoring the swap token when sc->priority reaches 0, we could simply take the swap token away from the memory hog and make sure we don't give it back to the memory hog for a few seconds. This patch resolves the problem Zwane ran into. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:25 -08:00
Nick Piggin	3148890bfa	[PATCH] mm: __alloc_pages cleanup fix I believe this patch is required to fix breakage in the asynch reclaim watermark logic introduced by this patch: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7fb1d9fca5c6e3b06773b69165a73f3fb786b8ee Just some background of the watermark logic in case it isn't clear... Basically what we have is this: --- pages_high \| \| (a) \| --- pages_low \| \| (b) \| --- pages_min \| \| (c) \| --- 0 Now when pages_low is reached, we want to kick asynch reclaim, which gives us an interval of "b" before we must start synch reclaim, and gives kswapd an interval of "a" before it need go back to sleep. When pages_min is reached, normal allocators must enter synch reclaim, but PF_MEMALLOC, ALLOC_HARDER, and ALLOC_HIGH (ie. atomic allocations, recursive allocations, etc.) get access to varying amounts of the reserve "c". Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:24 -08:00
Alan Stern	e0f39591cc	[PATCH] Workaround for gcc 2.96 (undefined references) LD .tmp_vmlinux1 mm/built-in.o(.text+0x100d6): In function `copy_page_range': : undefined reference to `__pud_alloc' mm/built-in.o(.text+0x1010b): In function `copy_page_range': : undefined reference to `__pmd_alloc' mm/built-in.o(.text+0x11ef4): In function `__handle_mm_fault': : undefined reference to `__pud_alloc' fs/built-in.o(.text+0xc930): In function `install_arg_page': : undefined reference to `__pud_alloc' make: *** [.tmp_vmlinux1] Error 1 Those missing references in mm/memory.c arise from this code in include/linux/mm.h, combined with the fact that __PGTABLE_PMD_FOLDED and __PGTABLE_PUD_FOLDED are both set and __ARCH_HAS_4LEVEL_HACK is not: /* * The following ifdef needed to get the 4level-fixup.h header to work. * Remove it when 4level-fixup.h has been removed. / #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK) static inline pud_t pud_alloc(struct mm_struct mm, pgd_t pgd, unsigned long address) { return (unlikely(pgd_none(pgd)) && __pud_alloc(mm, pgd, address))? NULL: pud_offset(pgd, address); } static inline pmd_t pmd_alloc(struct mm_struct mm, pud_t pud, unsigned long address) { return (unlikely(pud_none(pud)) && __pmd_alloc(mm, pud, address))? NULL: pmd_offset(pud, address); } #endif / CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */ With my configuration the pgd_none and pud_none routines are inlines returning a constant 0. Apparently the old compiler avoids generating calls to __pud_alloc and __pmd_alloc but still lists them as undefined references in the module's symbol table. I don't know which change caused this problem. I think it was added somewhere between 2.6.14 and 2.6.15-rc1, because I remember building several 2.6.14-rc kernels without difficulty. However I can't point to an individual culprit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:22 -08:00
Linus Torvalds	6aab341e0a	mm: re-architect the VM_UNPAGED logic This replaces the (in my opinion horrible) VM_UNMAPPED logic with very explicit support for a "remapped page range" aka VM_PFNMAP. It allows a VM area to contain an arbitrary range of page table entries that the VM never touches, and never considers to be normal pages. Any user of "remap_pfn_range()" automatically gets this new functionality, and doesn't even have to mark the pages reserved or indeed mark them any other way. It just works. As a side effect, doing mmap() on /dev/mem works for arbitrary ranges. Sparc update from David in the next commit. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:34:23 -08:00
Oleg Drokin	479ef592f3	[PATCH] 32bit integer overflow in invalidate_inode_pages2() Fix a 32 bit integer overflow in invalidate_inode_pages2_range. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-23 16:08:39 -08:00
Hugh Dickins	7b6ac9dffe	[PATCH] mm: update split ptlock Kconfig Closer attention to the arithmetic shows that neither ppc64 nor sparc really uses one page for multiple page tables: how on earth could they, while pte_alloc_one returns just a struct page pointer, with no offset? Well, arm26 manages it by returning a pte_t pointer cast to a struct page pointer, harumph, then compensating in its pmd_populate. But arm26 is never SMP, so it's not a problem for split ptlock either. And the PA-RISC situation has been recently improved: CONFIG_PA20 works without the 16-byte alignment which inflated its spinlock_t. But the current union of spinlock_t with private does make the 7xxx struct page significantly larger, even without debug, so disable its split ptlock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-23 16:08:38 -08:00
Eric Paris	0bd0f9fb19	[PATCH] hugetlb: fix race in set_max_huge_pages for multiple updaters of nr_huge_pages If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously it is possible for the nr_huge_pages variable to become incorrect. There is no locking in the set_max_huge_pages function around alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers to alloc_fresh_huge_page could race against each other as could a call to alloc_fresh_huge_page and a call to update_and_free_page. This patch just expands the area covered by the hugetlb_lock to cover the call into alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section is performance critical where more specific locking would be needed. My reproducer was to run a couple copies of the following script simultaneously while [ true ]; do echo 1000 > /proc/sys/vm/nr_hugepages echo 500 > /proc/sys/vm/nr_hugepages echo 750 > /proc/sys/vm/nr_hugepages echo 100 > /proc/sys/vm/nr_hugepages echo 0 > /proc/sys/vm/nr_hugepages done and then watch /proc/meminfo and eventually you will see things like HugePages_Total: 100 HugePages_Free: 109 After applying the patch all seemed well. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:43 -08:00
Hugh Dickins	689bcebfda	[PATCH] unpaged: PG_reserved bad_page It used to be the case that PG_reserved pages were silently never freed, but in 2.6.15-rc1 they may be freed with a "Bad page state" message. We should work through such cases as they appear, fixing the code; but for now it's safer to issue the message without freeing the page, leaving PG_reserved set. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	f57e88a8d8	[PATCH] unpaged: ZERO_PAGE in VM_UNPAGED It's strange enough to be looking out for anonymous pages in VM_UNPAGED areas, let's not insert the ZERO_PAGE there - though whether it would matter will depend on what we decide about ZERO_PAGE refcounting. But whereas do_anonymous_page may (exceptionally) be called on a VM_UNPAGED area, do_no_page should never be: just BUG_ON. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	ee498ed730	[PATCH] unpaged: anon in VM_UNPAGED copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area, zap_pte_range needs to free them, do_wp_page needs to COW them: just like ordinary pages, not like the unpaged. But recognizing them is a little subtle: because PageReserved is no longer a condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the distro permits, and whether it's advisable on this or that architecture, is another matter). So if we can see a PageAnon, it may not be ours to mess with (or may be ours from elsewhere in the address space). I suspect there's an entertaining insoluble self-referential problem here, but the page_is_anon function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will always be an odd choice. In updating the comment on page_address_in_vma, noticed a potential NULL dereference, in a path we don't actually take, but fixed it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	920fc356f5	[PATCH] unpaged: COW on VM_UNPAGED Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let it do Copy-On-Write without touching the VM_UNPAGED's page counts - but this is incomplete, because the anonymous page it inserts will itself need to be handled, here and in other functions - next patch. We still don't copy the page if the pfn is invalid, because the copy_user_highpage interface does not allow it. But that's not been a problem in the past: can be added in later if the need arises. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	101d2be764	[PATCH] unpaged: VM_NONLINEAR VM_RESERVED There's one peculiar use of VM_RESERVED which the previous patch left behind: because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout cursor, but should never meet VM_RESERVED vmas, it was a way of extending VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose. But that's an empty set - they don't have the populate function required. So just throw away those VM_RESERVED tests. But one more interesting in rmap.c has to go too: try_to_unmap_one will want to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	0b14c179a4	[PATCH] unpaged: VM_UNPAGED Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few drivers set VM_RESERVED on areas which are then populated by nopage. The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in zap_pte_range, without changing those drivers not to set it: so their pages just leak away. Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core, to flag the special areas where the ptes may have no struct page, or if they have then it's not to be touched. Replace most instances of VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and sparc64 io_remap_pfn_range. Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it needed anywhere? It still governs the mm->reserved_vm statistic, and special vmas not to be merged, and areas not to be core dumped; but could probably be eliminated later (the drivers are probably specifying it because in 2.4 it kept swapout off the vma, but in 2.6 we work from the LRU, which these pages don't get on). Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no purpose whatsoever, and should be removed from drivers when we clean up. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	664beed019	[PATCH] unpaged: unifdefed PageCompound It looks like snd_xxx is not the only nopage to be using PageReserved as a way of holding a high-order page together: which no longer works, but is masked by our failure to free from VM_RESERVED areas. We cannot fix that bug without first substituting another way to hold the high-order page together, while farming out the 0-order pages from within it. That's just what PageCompound is designed for, but it's been kept under CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line put_page), doesn't slow down what most needs to be fast (already using hugetlb), and unifies the way we handle high-order pages. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	83e9b7e929	[PATCH] unpaged: private write VM_RESERVED The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message when you tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED, and failed with -EACCES: because do_wp_page lacks the refinement to COW pages in those areas, nor do we expect to find anonymous pages in them; and it seemed just bloat to add code for handling such a peculiar case. But immediately it caused vbetool and ddcprobe (using lrmi) to fail. So revert the "deprecated" messages, letting mmap and mprotect succeed. But leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place until we've added the code to do it right: so this particular patch is only good if the app doesn't really need to write to that private area. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00

... 3 4 5 6 7 ...

758 Commits