Commit Graph

1790 Commits

Author SHA1 Message Date
Nick Piggin 95c354fe9f spinlock: lockbreak cleanup
The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.

Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.

Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:20 +01:00
Jiri Kosina c1d171a002 x86: randomize brk
Randomize the location of the heap (brk) for i386 and x86_64.  The range is
randomized in the range starting at current brk location up to 0x02000000
offset for both architectures.  This, together with
pie-executable-randomization.patch and
pie-executable-randomization-fix.patch, should make the address space
randomization on i386 and x86_64 complete.

Arjan says:

This is known to break older versions of some emacs variants, whose dumper
code assumed that the last variable declared in the program is equal to the
start of the dynamically allocated memory region.

(The dumper is the code where emacs effectively dumps core at the end of it's
compilation stage; this coredump is then loaded as the main program during
normal use)

iirc this was 5 years or so; we found this way back when I was at RH and we
first did the security stuff there (including this brk randomization).  It
wasn't all variants of emacs, and it got fixed as a result (I vaguely remember
that emacs already had code to deal with it for other archs/oses, just
ifdeffed wrongly).

It's a rare and wrong assumption as a general thing, just on x86 it mostly
happened to be true (but to be honest, it'll break too if gcc does
something fancy or if the linker does a non-standard order).  Still its
something we should at least document.

Note 2: afaik it only broke the emacs *build*.  I'm not 100% sure about that
(it IS 5 years ago) though.

[ akpm@linux-foundation.org: deuglification ]

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:40 +01:00
Paul Mundt d5f68c6dbd sh: Bump number of quicklists for SH-5.
Sync up with the SH definitions.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2008-01-28 13:18:55 +09:00
Peter Zijlstra fa717060f1 sched: sched_rt_entity
Move the task_struct members specific to rt scheduling together.
A future optimization could be to put sched_entity and sched_rt_entity
into a union.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:27 +01:00
Gautham R Shenoy 95402b3829 cpu-hotplug: replace per-subsystem mutexes with get_online_cpus()
This patch converts the known per-subsystem mutexes to get_online_cpus
put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and
CPU_LOCK_RELEASE hotplug notification events.

Signed-off-by: Gautham  R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:02 +01:00
Linus Torvalds b47711bfbc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6:
  selinux: make mls_compute_sid always polyinstantiate
  security/selinux: constify function pointer tables and fields
  security: add a secctx_to_secid() hook
  security: call security_file_permission from rw_verify_area
  security: remove security_sb_post_mountroot hook
  Security: remove security.h include from mm.h
  Security: remove security_file_mmap hook sparse-warnings (NULL as 0).
  Security: add get, set, and cloning of superblock security information
  security/selinux: Add missing "space"
2008-01-25 08:44:29 -08:00
Linus Torvalds df8dc74e8a Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6
This can be broken down into these major areas:
 - Documentation updates (language translations and fixes, as
   well as kobject and kset documenatation updates.)
 - major kset/kobject/ktype rework and fixes.  This cleans up the
   kset and kobject and ktype relationship and architecture,
   making sense of things now, and good documenation and samples
   are provided for others to use.  Also the attributes for
   kobjects are much easier to handle now.  This cleaned up a LOT
   of code all through the kernel, making kobjects easier to use
   if you want to.
 - struct bus_type has been reworked to now handle the lifetime
   rules properly, as the kobject is properly dynamic.
 - struct driver has also been reworked, and now the lifetime
   issues are resolved.
 - the block subsystem has been converted to use struct device
   now, and not "raw" kobjects.  This patch has been in the -mm
   tree for over a year now, and finally all the issues are
   worked out with it.  Older distros now properly work with new
   kernels, and no userspace updates are needed at all.
 - nozomi driver is added.  This has also been in -mm for a long
   time, and many people have asked for it to go in.  It is now
   in good enough shape to do so.
 - lots of class_device conversions to use struct device instead.
   The tree is almost all cleaned up now, only SCSI and IB is the
   remaining code to fix up...

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6: (196 commits)
  Driver core: coding style fixes
  Kobject: fix coding style issues in kobject c files
  Kobject: fix coding style issues in kobject.h
  Driver core: fix coding style issues in device.h
  spi: use class iteration api
  scsi: use class iteration api
  rtc: use class iteration api
  power supply : use class iteration api
  ieee1394: use class iteration api
  Driver Core: add class iteration api
  Driver core: Cleanup get_device_parent() in device_add() and device_move()
  UIO: constify function pointer tables
  Driver Core: constify the name passed to platform_device_register_simple
  driver core: fix build with SYSFS=n
  sysfs: make SYSFS_DEPRECATED depend on SYSFS
  Driver core: use LIST_HEAD instead of call to INIT_LIST_HEAD in __init
  kobject: add sample code for how to use ksets/ktypes/kobjects
  kobject: add sample code for how to use kobjects in a simple manner.
  kobject: update the kobject/kset documentation
  kobject: remove old, outdated documentation.
  ...
2008-01-25 08:35:13 -08:00
Pekka Enberg 556a169dab slab: fix bootstrap on memoryless node
If the node we're booting on doesn't have memory, bootstrapping kmalloc()
caches resorts to fallback_alloc() which requires ->nodelists set for all
nodes.  Fix that by calling set_up_list3s() for CACHE_CACHE in
kmem_cache_init().

As kmem_getpages() is called with GFP_THISNODE set, this used to work before
because of breakage in 2.6.22 and before with GFP_THISNODE returning pages from
the wrong node if a node had no memory. So it may have worked accidentally and
in an unsafe manner because the pages would have been associated with the wrong
node which could trigger bug ons and locking troubles.

Tested-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
[ With additional one-liner by Olaf Hering  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-25 08:30:36 -08:00
Greg Kroah-Hartman 1eada11c88 Kobject: convert mm/slub.c to use kobject_init/add_ng()
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:31 -08:00
Greg Kroah-Hartman 0ff21e4663 kobject: convert kernel_kset to be a kobject
kernel_kset does not need to be a kset, but a much simpler kobject now
that we have kobj_attributes.

We also rename kernel_kset to kernel_kobj to catch all users of this
symbol with a build error instead of an easy-to-ignore build warning.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:24 -08:00
Greg Kroah-Hartman 081248de0a kset: move /sys/slab to /sys/kernel/slab
/sys/kernel is where these things should go.
Also updated the documentation and tool that used this directory.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:16 -08:00
Greg Kroah-Hartman 27c3a314d5 kset: convert slub to use kset_create
Dynamically create the kset instead of declaring it statically.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:15 -08:00
Greg Kroah-Hartman 3514faca19 kobject: remove struct kobj_type from struct kset
We don't need a "default" ktype for a kset.  We should set this
explicitly every time for each kset.  This change is needed so that we
can make ksets dynamic, and cleans up one of the odd, undocumented
assumption that the kset/kobject/ktype model has.

This patch is based on a lot of help from Kay Sievers.

Nasty bug in the block code was found by Dave Young
<hidave.darkstar@gmail.com>

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:10 -08:00
Richard Knutsson 88c3f7a8f2 Security: remove security_file_mmap hook sparse-warnings (NULL as 0).
Fixing:
  CHECK   mm/mmap.c
mm/mmap.c:1623:29: warning: Using plain integer as NULL pointer
mm/mmap.c:1623:29: warning: Using plain integer as NULL pointer
mm/mmap.c:1944:29: warning: Using plain integer as NULL pointer

Signed-off-by: Richard Knutsson <ricknu-0@student.ltu.se>
Signed-off-by: James Morris <jmorris@namei.org>
2008-01-25 11:29:48 +11:00
Mel Gorman 9c09a95cf4 slab: partially revert list3 changes
Partial revert the changes made by 04231b3002
to the kmem_list3 management. On a machine with a memoryless node, this
BUG_ON was triggering

	static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
	{
		struct list_head *entry;
		struct slab *slabp;
		struct kmem_list3 *l3;
		void *obj;
		int x;

		l3 = cachep->nodelists[nodeid];
		BUG_ON(!l3);

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-24 08:07:27 -08:00
Larry Woodman c5c99429fa fix hugepages leak due to pagetable page sharing
The shared page table code for hugetlb memory on x86 and x86_64
is causing a leak.  When a user of hugepages exits using this code
the system leaks some of the hugepages.

-------------------------------------------------------
Part of /proc/meminfo just before database startup:
HugePages_Total:  5500
HugePages_Free:   5500
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

Just before shutdown:
HugePages_Total:  5500
HugePages_Free:   4475
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

After shutdown:
HugePages_Total:  5500
HugePages_Free:   4988
HugePages_Rsvd:
0 Hugepagesize:     2048 kB
----------------------------------------------------------

The problem occurs durring a fork, in copy_hugetlb_page_range().  It
locates the dst_pte using huge_pte_alloc().  Since huge_pte_alloc() calls
huge_pmd_share() it will share the pmd page if can, yet the main loop in
copy_hugetlb_page_range() does a get_page() on every hugepage.  This is a
violation of the shared hugepmd pagetable protocol and creates additional
referenced to the hugepages causing a leak when the unmap of the VMA
occurs.  We can skip the entire replication of the ptes when the hugepage
pagetables are shared.  The attached patch skips copying the ptes and the
get_page() calls if the hugetlbpage pagetable is shared.

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-24 08:07:27 -08:00
Anton Salikhmetov 8f7b3d156d Update ctime and mtime for memory-mapped files
Update ctime and mtime for memory-mapped files at a write access on
a present, read-only PTE, as well as at a write on a non-present PTE.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-23 09:58:55 -08:00
Carsten Otte 9723198c21 #ifdef very expensive debug check in page fault path
This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
that verifies that a pfn is valid.  This patch increases performance of the
page fault microbenchmark in lmbench by 13% and overall dbench performance
by 7% on s390x.  pfn_valid() is an expensive operation on s390 that needs a
high double digit amount of CPU cycles.  Nick Piggin suggested that
pfn_valid() involves an array lookup on systems with sparsemem, and
therefore is an expensive operation there too.

The check looks like a clear debug thing to me, it should never trigger on
regular kernels.  And if a pte is created for an invalid pfn, we'll find
out once the memory gets accessed later on anyway.  Please consider
inclusion of this patch into mm.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-17 15:38:59 -08:00
Sam Ravnborg 1d6f4e60e7 mm: fix section mismatch warning in page_alloc.c
With CONFIG_HOTPLUG=n and CONFIG_HOTPLUG_CPU=y we saw
following warning:
WARNING: mm/built-in.o(.text+0x6864): Section mismatch: reference to .init.text: (between 'process_zones' and 'pageset_cpuup_callback')

The culprit was zone_batchsize() which were annotated __devinit but used
from process_zones() which is annotated __cpuinit.  zone_batchsize() are
used from another function annotated __meminit so the only valid option is
to drop the annotation of zone_batchsize() so we know it is always valid to
use it.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-17 15:38:58 -08:00
Linus Torvalds c23f72cae9 Revert "writeback: introduce writeback_control.more_io to indicate more io"
This reverts commit 2e6883bdf4, as
requested by Fengguang Wu.  It's not quite fully baked yet, and while
there are patches around to fix the problems it caused, they should get
more testing.  Says Fengguang: "I'll resend them both for -mm later on,
in a more complete patchset".

See

	http://bugzilla.kernel.org/show_bug.cgi?id=9738

for some of this discussion.

Requested-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-14 21:21:29 -08:00
Ken Chen 68842c9b94 hugetlbfs: fix quota leak
In the error path of both shared and private hugetlb page allocation,
the file system quota is never undone, leading to fs quota leak.  Fix
them up.

[akpm@linux-foundation.org: cleanup, micro-optimise]
Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-14 08:52:23 -08:00
Christoph Lameter 96990a4ae9 quicklists: Only consider memory that can be used with GFP_KERNEL
Quicklists calculates the size of the quicklists based on the number of
free pages.  This must be the number of free pages that can be allocated
with GFP_KERNEL.  node_page_state() includes the pages in ZONE_HIGHMEM and
ZONE_MOVABLE which may lead the quicklists to become too large causing OOM.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-14 08:52:22 -08:00
Thomas Bogendoerfer 467bc461d2 Fix crash with FLAT_MEMORY and ARCH_PFN_OFFSET != 0
When using FLAT_MEMORY and ARCH_PFN_OFFSET is not 0, the kernel crashes in
memmap_init_zone().  This bug got introduced by commit
c713216dee

Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-08 16:10:36 -08:00
Akinobu Mita c51b1a160b xip: fix get_zeroed_page with __GFP_HIGHMEM
The use of get_zeroed_page() with __GFP_HIGHMEM is invalid.  Use
alloc_page() with __GFP_ZERO instead of invalid get_zeroed_page().

(This patch is only compile tested)

Cc: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-08 16:10:36 -08:00
Linus Torvalds 158a962422 Unify /proc/slabinfo configuration
Both SLUB and SLAB really did almost exactly the same thing for
/proc/slabinfo setup, using duplicate code and per-allocator #ifdef's.

This just creates a common CONFIG_SLABINFO that is enabled by both SLUB
and SLAB, and shares all the setup code.  Maybe SLOB will want this some
day too.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-02 13:04:48 -08:00
Pekka J Enberg 57ed3eda97 slub: provide /proc/slabinfo
This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.

[ mingo@elte.hu: build fix. ]

Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-01 11:32:02 -08:00
Christoph Lameter 76be895001 SLUB: Improve hackbench speed
Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-21 15:51:07 -08:00
Linus Torvalds 3a6927906f Do dirty page accounting when removing a page from the page cache
Krzysztof Oledzki noticed a dirty page accounting leak on some of his
machines, causing the machine to eventually lock up when the kernel
decided that there was too much dirty data, but nobody could actually
write anything out to fix it.

The culprit turns out to be filesystems (cough ext3 with data=journal
cough) that re-dirty the page when the "->invalidatepage()" callback is
called.

Fix it up by doing a final dirty page accounting check when we actually
remove the page from the page cache.

This fixes bugzilla entry 9182:

	http://bugzilla.kernel.org/show_bug.cgi?id=9182

Tested-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Krzysztof Oledzki <olel@ans.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-19 14:05:13 -08:00
Christoph Lameter 3811dbf671 SLUB: remove useless masking of GFP_ZERO
Remove a recently added useless masking of GFP_ZERO.  GFP_ZERO is already
masked out in new_slab() (See how it calls allocate_slab).  No need to do
it twice.

This reverts the SLUB parts of 7fd272550b.

Cc: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:17 -08:00
Nishanth Aravamudan 368d2c6358 Revert "hugetlb: Add hugetlb_dynamic_pool sysctl"
This reverts commit 54f9f80d65 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")

Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).

(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:17 -08:00
Nishanth Aravamudan d1c3fb1f8f hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl

While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:

1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.

2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.

To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition

	nr_overcommit_hugepages > 0

indicates the same administrative setting as

	hugetlb_dynamic_pool == 1

Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.

A few caveats:

1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.

2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.

Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:17 -08:00
Mel Gorman 81eabcbe0b mm: fix page allocation for larger I/O segments
In some cases the IO subsystem is able to merge requests if the pages are
adjacent in physical memory.  This was achieved in the allocator by having
expand() return pages in physically contiguous order in situations were a
large buddy was split.  However, list-based anti-fragmentation changed the
order pages were returned in to avoid searching in buffered_rmqueue() for a
page of the appropriate migrate type.

This patch restores behaviour of rmqueue_bulk() preserving the physical
order of pages returned by the allocator without incurring increased search
costs for anti-fragmentation.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Mark Lord <mlord@pobox.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:16 -08:00
WANG Cong bbd0682596 mm/sparse.c: improve the error handling for sparse_add_one_section()
Improve the error handling for mm/sparse.c::sparse_add_one_section().  And I
see no reason to check 'usemap' until holding the 'pgdat_resize_lock'.

[geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST]
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:16 -08:00
WANG Cong af0cd5a7c3 mm/sparse.c: check the return value of sparse_index_alloc()
Since sparse_index_alloc() can return NULL on memory allocation failure,
we must deal with the failure condition when calling it.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:16 -08:00
Geoff Levand a5ee6daa52 sparsemem: make SPARSEMEM_VMEMMAP selectable
SPARSEMEM_VMEMMAP needs to be a selectable config option to support
building the kernel both with and without sparsemem vmemmap support.  This
selection is desirable for platforms which could be configured one way for
platform specific builds and the other for multi-platform builds.

Signed-off-by: Miguel Botón <mboton@gmail.com>
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:16 -08:00
Adam Litke 72fad7139b hugetlb: handle write-protection faults in follow_hugetlb_page
The follow_hugetlb_page() fix I posted (merged as git commit
5b23dbe817) missed one case.  If the pte is
present, but not writable and write access is requested by the caller to
get_user_pages(), the code will do the wrong thing.  Rather than calling
hugetlb_fault to make the pte writable, it notes the presence of the pte
and continues.

This simple one-liner makes sure we also fault on the pte for this case.
Please apply.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-10 19:43:55 -08:00
Linus Torvalds 7fd272550b Avoid double memclear() in SLOB/SLUB
Both slob and slub react to __GFP_ZERO by clearing the allocation, which
means that passing the GFP_ZERO bit down to the page allocator is just
wasteful and pointless.

Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-09 10:17:52 -08:00
Linus Torvalds ad658cec23 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6:
  VM/Security: add security hook to do_brk
  Security: round mmap hint address above mmap_min_addr
  security: protect from stack expantion into low vm addresses
  Security: allow capable check to permit mmap or low vm space
  SELinux: detect dead booleans
  SELinux: do not clear f_op when removing entries
2007-12-05 09:26:52 -08:00
Eric Paris ecaf18c15a VM/Security: add security hook to do_brk
Given a specifically crafted binary do_brk() can be used to get low pages
available in userspace virtual memory and can thus be used to circumvent
the mmap_min_addr low memory protection.  Add security checks in do_brk().

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-05 09:21:21 -08:00
Vegard Nossum 294a80a8ed SLUB's ksize() fails for size > 2048
I can't pass memory allocated by kmalloc() to ksize() if it is allocated by
SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2.

The error of ksize() seems to be that it does not check if the allocation
was made by SLUB or the page allocator.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Christoph Lameter <clameter@sgi.com>, Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-05 09:21:20 -08:00
Nick Piggin 369b8f5a70 mm: fix XIP file writes
Writing to XIP files at a non-page-aligned offset results in data corruption
because the writes were always sent to the start of the page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-05 09:21:20 -08:00
Tetsuo Handa f8fcc93319 Add EXPORT_SYMBOL(ksize);
mm/slub.c exports ksize(), but mm/slob.c and mm/slab.c don't.

It's used by binfmt_flat, which can be built as a module.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-05 09:21:18 -08:00
Denis Cheng 4b01a0b161 mm/backing-dev.c: fix percpu_counter_destroy call bug in bdi_init
this call should use the array index j, not i.  But with this approach, just
one int i is enough, int j is not needed.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-05 09:21:18 -08:00
Eric Paris 5a211a5dea VM/Security: add security hook to do_brk
Given a specifically crafted binary do_brk() can be used to get low
pages available in userspace virtually memory and can thus be used to
circumvent the mmap_min_addr low memory protection.  Add security checks
in do_brk().

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2007-12-06 00:25:30 +11:00
Eric Paris 7cd94146cd Security: round mmap hint address above mmap_min_addr
If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
non-null hint address less than mmap_min_addr the mapping will fail the
security checks.  Since this is just a hint address this patch will round
such a hint address above mmap_min_addr.

gcj was found to try to be very frugal with vm usage and give hint addresses
in the 8k-32k range.  Without this patch all such programs failed and with
the patch they happily get a higher address.

This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
without it and there would be no security check possible no matter what.  So
we should not bother compiling in this rounding if it is just a waste of
time.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2007-12-06 00:25:10 +11:00
Eric Paris 8869477a49 security: protect from stack expantion into low vm addresses
Add security checks to make sure we are not attempting to expand the
stack into memory protected by mmap_min_addr

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2007-12-06 00:24:48 +11:00
Matthew Wilcox 80cbd911ca Fix kmem_cache_free performance regression in slab
The database performance group have found that half the cycles spent
in kmem_cache_free are spent in this one call to BUG_ON.  Moving it
into the CONFIG_SLAB_DEBUG-only function cache_free_debugcheck() is a
performance win of almost 0.5% on their particular benchmark.

The call was added as part of commit ddc2e812d5
with the comment that "overhead should be minimal".  It may have been
minimal at the time, but it isn't now.

[ Quoth Pekka Enberg: "I don't think the BUG_ON per se caused the
  performance regression but rather the virt_to_head_page() changes to
  virt_to_cache() that were added later." ]

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-30 08:08:05 -08:00
KAMEZAWA Hiroyuki e0dc3a53de memory hotplug fix: fix section mismatch in vmammap_allock_block()
Fixes section mismatch below.

WARNING: vmlinux.o(.text+0x946b5): Section mismatch: reference to .init.text:'
__alloc_bootmem_node (between 'vmemmap_alloc_block' and 'vmemmap_pgd_populate')

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:54 -08:00
Mel Gorman ba72cb8cb0 Fix boot problem with iSeries lacking hugepage support
Ordinarily the size of a pageblock is determined at compile-time based on the
hugepage size. On PPC64, the hugepage size is determined at runtime based on
what is supported by the machine. With legacy machines such as iSeries that
do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order
being set to -PAGE_SHIFT and a crash results shortly afterwards.

This patch adds a function to select a sensible value for pageblock order by
default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT
is a sensible value before using the hugepage size; if it is not MAX_ORDER-1
is used.

This is a fix for 2.6.24.

Credit goes to Stephen Rothwell for identifying the bug and testing candidate
patches.  Additional credit goes to Andy Whitcroft for spotting a problem
with respects to IA-64 before releasing. Additional credit to David Gibson
for testing with the libhugetlbfs test suite.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:51 -08:00
Hugh Dickins 09f345da75 prep_zero_page: remove bogus BUG_ON
2.6.11 gave __GFP_ZERO's prep_zero_page a bogus "highmem may have to wait"
assertion.  Presumably added under the misconception that clear_highpage
uses nonatomic kmap; but then and now it uses kmap_atomic, so no problem.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-28 11:04:28 -08:00