linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	2c8296f8cf	Merge branch 'slub-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm * 'slub-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm: Explain kmem_cache_cpu fields SLUB: Do not upset lockdep SLUB: Fix coding style violations Add parameter to add_partial to avoid having two functions SLUB: rename defrag to remote_node_defrag_ratio Move count_partial before kmem_cache_shrink SLUB: Fix sysfs refcounting slub: fix shadowed variable sparse warnings	2008-02-04 12:14:55 -08:00
root	ba84c73c7a	SLUB: Do not upset lockdep inconsistent {softirq-on-W} -> {in-softirq-W} usage. swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes: (&n->list_lock){-+..}, at: [<ffffffff802935c1>] add_partial+0x31/0xa0 {softirq-on-W} state was registered at: [<ffffffff80259fb8>] __lock_acquire+0x3e8/0x1140 [<ffffffff80259838>] debug_check_no_locks_freed+0x188/0x1a0 [<ffffffff8025ad65>] lock_acquire+0x55/0x70 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff805c76de>] _spin_lock+0x1e/0x30 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff80296f9c>] kmem_cache_open+0x1cc/0x330 [<ffffffff805c7984>] _spin_unlock_irq+0x24/0x30 [<ffffffff802974f4>] create_kmalloc_cache+0x64/0xf0 [<ffffffff80295640>] init_alloc_cpu_cpu+0x70/0x90 [<ffffffff8080ada5>] kmem_cache_init+0x65/0x1d0 [<ffffffff807f1b4e>] start_kernel+0x23e/0x350 [<ffffffff807f112d>] _sinittext+0x12d/0x140 [<ffffffffffffffff>] 0xffffffffffffffff This change isn't really necessary for correctness, but it prevents lockdep from getting upset and then disabling itself. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-04 10:56:02 -08:00
Pekka Enberg	064287807c	SLUB: Fix coding style violations This fixes most of the obvious coding style violations in mm/slub.c as reported by checkpatch. Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-04 10:56:02 -08:00
Christoph Lameter	7c2e132c54	Add parameter to add_partial to avoid having two functions Add a parameter to add_partial instead of having separate functions. The parameter allows a more detailed control of where the slab pages is placed in the partial queues. If we put slabs back to the front then they are likely immediately used for allocations. If they are put at the end then we can maximize the time that the partial slabs spent without being subject to allocations. When deactivating slab we can put the slabs that had remote objects freed (we can see that because objects were put on the freelist that requires locks) to them at the end of the list so that the cachelines of remote processors can cool down. Slabs that had objects from the local cpu freed to them (objects exist in the lockless freelist) are put in the front of the list to be reused ASAP in order to exploit the cache hot state of the local cpu. Patch seems to slightly improve tbench speed (1-2%). Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-04 10:56:02 -08:00
Christoph Lameter	9824601ead	SLUB: rename defrag to remote_node_defrag_ratio The NUMA defrag works by allocating objects from partial slabs on remote nodes. Rename it to remote_node_defrag_ratio to be clear about this. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-04 10:56:02 -08:00
Christoph Lameter	f61396aed9	Move count_partial before kmem_cache_shrink Move the counting function for objects in partial slabs so that it is placed before kmem_cache_shrink. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-04 10:56:01 -08:00
Christoph Lameter	151c602f79	SLUB: Fix sysfs refcounting If CONFIG_SYSFS is set then free the kmem_cache structure when sysfs tells us its okay. Otherwise there is the danger (as pointed out by Al Viro) that sysfs thinks the kobject still exists after kmem_cache_destroy() removed it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka J Enberg <penberg@cs.helsinki.fi>	2008-02-04 10:56:01 -08:00
Harvey Harrison	e374d48356	slub: fix shadowed variable sparse warnings Introduce 'len' at outer level: mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one mm/slub.c:3393:6: originally declared here No need to declare new node: mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one mm/slub.c:3491:6: originally declared here No need to declare new x: mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one mm/slub.c:3492:6: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-04 10:56:01 -08:00
Linus Torvalds	f5bb3a5e9d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (79 commits) Jesper Juhl is the new trivial patches maintainer Documentation: mention email-clients.txt in SubmittingPatches fs/binfmt_elf.c: spello fix do_invalidatepage() comment typo fix Documentation/filesystems/porting fixes typo fixes in net/core/net_namespace.c typo fix in net/rfkill/rfkill.c typo fixes in net/sctp/sm_statefuns.c lib/: Spelling fixes kernel/: Spelling fixes include/scsi/: Spelling fixes include/linux/: Spelling fixes include/asm-m68knommu/: Spelling fixes include/asm-frv/: Spelling fixes fs/: Spelling fixes drivers/watchdog/: Spelling fixes drivers/video/: Spelling fixes drivers/ssb/: Spelling fixes drivers/serial/: Spelling fixes drivers/scsi/: Spelling fixes ...	2008-02-04 07:58:52 -08:00
Nick Piggin	2f98735c9c	vm audit: add VM_DONTEXPAND to mmap for drivers that need it Drivers that register a ->fault handler, but do not range-check the offset argument, must set VM_DONTEXPAND in the vm_flags in order to prevent an expanding mremap from overflowing the resource. I've audited the tree and attempted to fix these problems (usually by adding VM_DONTEXPAND where it is not obvious). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-04 07:55:38 -08:00
Fengguang Wu	28bc44d7d1	do_invalidatepage() comment typo fix Fix a typo in the comment for do_invalidatepage(). Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2008-02-03 18:04:10 +02:00
Nick Piggin	124d3b7041	fix writev regression: pan hanging unkillable and un-straceable Frederik Himpe reported an unkillable and un-straceable pan process. Zero length iovecs can go into an infinite loop in writev, because the iovec iterator does not always advance over them. The sequence required to trigger this is not trivial. I think it requires that a zero-length iovec be followed by a non-zero-length iovec which causes a pagefault in the atomic usercopy. This causes the writev code to drop back into single-segment copy mode, which then tries to copy the 0 bytes of the zero-length iovec; a zero length copy looks like a failure though, so it loops. Put a test into iov_iter_advance to catch zero-length iovecs. We could just put the test in the fallback path, but I feel it is more robust to skip over zero-length iovecs throughout the code (iovec iterator may be used in filesystems too, so it should be robust). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-03 07:55:39 +11:00
Linus Torvalds	75659ca0c1	Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits) Remove commented-out code copied from NFS NFS: Switch from intr mount option to TASK_KILLABLE Add wait_for_completion_killable Add wait_event_killable Add schedule_timeout_killable Use mutex_lock_killable in vfs_readdir Add mutex_lock_killable Use lock_page_killable Add lock_page_killable Add fatal_signal_pending Add TASK_WAKEKILL exit: Use task_is_* signal: Use task_is_* sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL ptrace: Use task_is_* power: Use task_is_* wait: Use TASK_NORMAL proc/base.c: Use task_is_* proc/array.c: Use TASK_REPORT perfmon: Use task_is_* ... Fixed up conflicts in NFS/sunrpc manually..	2008-02-01 11:45:47 +11:00
Andi Kleen	03252919b7	x86: print which shared library/executable faulted in segfault etc. messages v3 They now look like: hal-resmgr[13791]: segfault at 3c rip 2b9c8caec182 rsp 7fff1e825d30 error 4 in libacl.so.1.1.0[2b9c8caea000+6000] This makes it easier to pinpoint bugs to specific libraries. And printing the offset into a mapping also always allows to find the correct fault point in a library even with randomized mappings. Previously there was no way to actually find the correct code address inside the randomized mapping. Relies on earlier patch to shorten the printk formats. They are often now longer than 80 characters, but I think that's worth it. [includes fix from Eric Dumazet to check d_path error value] Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-01-30 13:33:18 +01:00
Nick Piggin	95c354fe9f	spinlock: lockbreak cleanup The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-01-30 13:31:20 +01:00
Jiri Kosina	c1d171a002	x86: randomize brk Randomize the location of the heap (brk) for i386 and x86_64. The range is randomized in the range starting at current brk location up to 0x02000000 offset for both architectures. This, together with pie-executable-randomization.patch and pie-executable-randomization-fix.patch, should make the address space randomization on i386 and x86_64 complete. Arjan says: This is known to break older versions of some emacs variants, whose dumper code assumed that the last variable declared in the program is equal to the start of the dynamically allocated memory region. (The dumper is the code where emacs effectively dumps core at the end of it's compilation stage; this coredump is then loaded as the main program during normal use) iirc this was 5 years or so; we found this way back when I was at RH and we first did the security stuff there (including this brk randomization). It wasn't all variants of emacs, and it got fixed as a result (I vaguely remember that emacs already had code to deal with it for other archs/oses, just ifdeffed wrongly). It's a rare and wrong assumption as a general thing, just on x86 it mostly happened to be true (but to be honest, it'll break too if gcc does something fancy or if the linker does a non-standard order). Still its something we should at least document. Note 2: afaik it only broke the emacs build. I'm not 100% sure about that (it IS 5 years ago) though. [ akpm@linux-foundation.org: deuglification ] Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Roland McGrath <roland@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-01-30 13:30:40 +01:00
Paul Mundt	d5f68c6dbd	sh: Bump number of quicklists for SH-5. Sync up with the SH definitions. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2008-01-28 13:18:55 +09:00
Peter Zijlstra	fa717060f1	sched: sched_rt_entity Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:27 +01:00
Gautham R Shenoy	95402b3829	cpu-hotplug: replace per-subsystem mutexes with get_online_cpus() This patch converts the known per-subsystem mutexes to get_online_cpus put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE hotplug notification events. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:02 +01:00
Linus Torvalds	b47711bfbc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6: selinux: make mls_compute_sid always polyinstantiate security/selinux: constify function pointer tables and fields security: add a secctx_to_secid() hook security: call security_file_permission from rw_verify_area security: remove security_sb_post_mountroot hook Security: remove security.h include from mm.h Security: remove security_file_mmap hook sparse-warnings (NULL as 0). Security: add get, set, and cloning of superblock security information security/selinux: Add missing "space"	2008-01-25 08:44:29 -08:00
Linus Torvalds	df8dc74e8a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6 This can be broken down into these major areas: - Documentation updates (language translations and fixes, as well as kobject and kset documenatation updates.) - major kset/kobject/ktype rework and fixes. This cleans up the kset and kobject and ktype relationship and architecture, making sense of things now, and good documenation and samples are provided for others to use. Also the attributes for kobjects are much easier to handle now. This cleaned up a LOT of code all through the kernel, making kobjects easier to use if you want to. - struct bus_type has been reworked to now handle the lifetime rules properly, as the kobject is properly dynamic. - struct driver has also been reworked, and now the lifetime issues are resolved. - the block subsystem has been converted to use struct device now, and not "raw" kobjects. This patch has been in the -mm tree for over a year now, and finally all the issues are worked out with it. Older distros now properly work with new kernels, and no userspace updates are needed at all. - nozomi driver is added. This has also been in -mm for a long time, and many people have asked for it to go in. It is now in good enough shape to do so. - lots of class_device conversions to use struct device instead. The tree is almost all cleaned up now, only SCSI and IB is the remaining code to fix up... * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6: (196 commits) Driver core: coding style fixes Kobject: fix coding style issues in kobject c files Kobject: fix coding style issues in kobject.h Driver core: fix coding style issues in device.h spi: use class iteration api scsi: use class iteration api rtc: use class iteration api power supply : use class iteration api ieee1394: use class iteration api Driver Core: add class iteration api Driver core: Cleanup get_device_parent() in device_add() and device_move() UIO: constify function pointer tables Driver Core: constify the name passed to platform_device_register_simple driver core: fix build with SYSFS=n sysfs: make SYSFS_DEPRECATED depend on SYSFS Driver core: use LIST_HEAD instead of call to INIT_LIST_HEAD in __init kobject: add sample code for how to use ksets/ktypes/kobjects kobject: add sample code for how to use kobjects in a simple manner. kobject: update the kobject/kset documentation kobject: remove old, outdated documentation. ...	2008-01-25 08:35:13 -08:00
Pekka Enberg	556a169dab	slab: fix bootstrap on memoryless node If the node we're booting on doesn't have memory, bootstrapping kmalloc() caches resorts to fallback_alloc() which requires ->nodelists set for all nodes. Fix that by calling set_up_list3s() for CACHE_CACHE in kmem_cache_init(). As kmem_getpages() is called with GFP_THISNODE set, this used to work before because of breakage in 2.6.22 and before with GFP_THISNODE returning pages from the wrong node if a node had no memory. So it may have worked accidentally and in an unsafe manner because the pages would have been associated with the wrong node which could trigger bug ons and locking troubles. Tested-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> [ With additional one-liner by Olaf Hering - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-25 08:30:36 -08:00
Greg Kroah-Hartman	1eada11c88	Kobject: convert mm/slub.c to use kobject_init/add_ng() This converts the code to use the new kobject functions, cleaning up the logic in doing so. Cc: Christoph Lameter <clameter@sgi.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:31 -08:00
Greg Kroah-Hartman	0ff21e4663	kobject: convert kernel_kset to be a kobject kernel_kset does not need to be a kset, but a much simpler kobject now that we have kobj_attributes. We also rename kernel_kset to kernel_kobj to catch all users of this symbol with a build error instead of an easy-to-ignore build warning. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:24 -08:00
Greg Kroah-Hartman	081248de0a	kset: move /sys/slab to /sys/kernel/slab /sys/kernel is where these things should go. Also updated the documentation and tool that used this directory. Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:16 -08:00
Greg Kroah-Hartman	27c3a314d5	kset: convert slub to use kset_create Dynamically create the kset instead of declaring it statically. Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:15 -08:00
Greg Kroah-Hartman	3514faca19	kobject: remove struct kobj_type from struct kset We don't need a "default" ktype for a kset. We should set this explicitly every time for each kset. This change is needed so that we can make ksets dynamic, and cleans up one of the odd, undocumented assumption that the kset/kobject/ktype model has. This patch is based on a lot of help from Kay Sievers. Nasty bug in the block code was found by Dave Young <hidave.darkstar@gmail.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:10 -08:00
Richard Knutsson	88c3f7a8f2	Security: remove security_file_mmap hook sparse-warnings (NULL as 0). Fixing: CHECK mm/mmap.c mm/mmap.c:1623:29: warning: Using plain integer as NULL pointer mm/mmap.c:1623:29: warning: Using plain integer as NULL pointer mm/mmap.c:1944:29: warning: Using plain integer as NULL pointer Signed-off-by: Richard Knutsson <ricknu-0@student.ltu.se> Signed-off-by: James Morris <jmorris@namei.org>	2008-01-25 11:29:48 +11:00
Mel Gorman	9c09a95cf4	slab: partially revert list3 changes Partial revert the changes made by `04231b3002` to the kmem_list3 management. On a machine with a memoryless node, this BUG_ON was triggering static void ____cache_alloc_node(struct kmem_cache cachep, gfp_t flags, int nodeid) { struct list_head entry; struct slab slabp; struct kmem_list3 l3; void obj; int x; l3 = cachep->nodelists[nodeid]; BUG_ON(!l3); Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-24 08:07:27 -08:00
Larry Woodman	c5c99429fa	fix hugepages leak due to pagetable page sharing The shared page table code for hugetlb memory on x86 and x86_64 is causing a leak. When a user of hugepages exits using this code the system leaks some of the hugepages. ------------------------------------------------------- Part of /proc/meminfo just before database startup: HugePages_Total: 5500 HugePages_Free: 5500 HugePages_Rsvd: 0 Hugepagesize: 2048 kB Just before shutdown: HugePages_Total: 5500 HugePages_Free: 4475 HugePages_Rsvd: 0 Hugepagesize: 2048 kB After shutdown: HugePages_Total: 5500 HugePages_Free: 4988 HugePages_Rsvd: 0 Hugepagesize: 2048 kB ---------------------------------------------------------- The problem occurs durring a fork, in copy_hugetlb_page_range(). It locates the dst_pte using huge_pte_alloc(). Since huge_pte_alloc() calls huge_pmd_share() it will share the pmd page if can, yet the main loop in copy_hugetlb_page_range() does a get_page() on every hugepage. This is a violation of the shared hugepmd pagetable protocol and creates additional referenced to the hugepages causing a leak when the unmap of the VMA occurs. We can skip the entire replication of the ptes when the hugepage pagetables are shared. The attached patch skips copying the ptes and the get_page() calls if the hugetlbpage pagetable is shared. [akpm@linux-foundation.org: coding-style cleanups] Signed-off-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-24 08:07:27 -08:00
Anton Salikhmetov	8f7b3d156d	Update ctime and mtime for memory-mapped files Update ctime and mtime for memory-mapped files at a write access on a present, read-only PTE, as well as at a write on a non-present PTE. Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-23 09:58:55 -08:00
Carsten Otte	9723198c21	#ifdef very expensive debug check in page fault path This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page that verifies that a pfn is valid. This patch increases performance of the page fault microbenchmark in lmbench by 13% and overall dbench performance by 7% on s390x. pfn_valid() is an expensive operation on s390 that needs a high double digit amount of CPU cycles. Nick Piggin suggested that pfn_valid() involves an array lookup on systems with sparsemem, and therefore is an expensive operation there too. The check looks like a clear debug thing to me, it should never trigger on regular kernels. And if a pte is created for an invalid pfn, we'll find out once the memory gets accessed later on anyway. Please consider inclusion of this patch into mm. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-17 15:38:59 -08:00
Sam Ravnborg	1d6f4e60e7	mm: fix section mismatch warning in page_alloc.c With CONFIG_HOTPLUG=n and CONFIG_HOTPLUG_CPU=y we saw following warning: WARNING: mm/built-in.o(.text+0x6864): Section mismatch: reference to .init.text: (between 'process_zones' and 'pageset_cpuup_callback') The culprit was zone_batchsize() which were annotated __devinit but used from process_zones() which is annotated __cpuinit. zone_batchsize() are used from another function annotated __meminit so the only valid option is to drop the annotation of zone_batchsize() so we know it is always valid to use it. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-17 15:38:58 -08:00
Linus Torvalds	c23f72cae9	Revert "writeback: introduce writeback_control.more_io to indicate more io" This reverts commit `2e6883bdf4`, as requested by Fengguang Wu. It's not quite fully baked yet, and while there are patches around to fix the problems it caused, they should get more testing. Says Fengguang: "I'll resend them both for -mm later on, in a more complete patchset". See http://bugzilla.kernel.org/show_bug.cgi?id=9738 for some of this discussion. Requested-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-14 21:21:29 -08:00
Ken Chen	68842c9b94	hugetlbfs: fix quota leak In the error path of both shared and private hugetlb page allocation, the file system quota is never undone, leading to fs quota leak. Fix them up. [akpm@linux-foundation.org: cleanup, micro-optimise] Signed-off-by: Ken Chen <kenchen@google.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-14 08:52:23 -08:00
Christoph Lameter	96990a4ae9	quicklists: Only consider memory that can be used with GFP_KERNEL Quicklists calculates the size of the quicklists based on the number of free pages. This must be the number of free pages that can be allocated with GFP_KERNEL. node_page_state() includes the pages in ZONE_HIGHMEM and ZONE_MOVABLE which may lead the quicklists to become too large causing OOM. Signed-off-by: Christoph Lameter <clameter@sgi.com> Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-14 08:52:22 -08:00
Thomas Bogendoerfer	467bc461d2	Fix crash with FLAT_MEMORY and ARCH_PFN_OFFSET != 0 When using FLAT_MEMORY and ARCH_PFN_OFFSET is not 0, the kernel crashes in memmap_init_zone(). This bug got introduced by commit `c713216dee` Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-08 16:10:36 -08:00
Akinobu Mita	c51b1a160b	xip: fix get_zeroed_page with __GFP_HIGHMEM The use of get_zeroed_page() with __GFP_HIGHMEM is invalid. Use alloc_page() with __GFP_ZERO instead of invalid get_zeroed_page(). (This patch is only compile tested) Cc: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-08 16:10:36 -08:00
Linus Torvalds	158a962422	Unify /proc/slabinfo configuration Both SLUB and SLAB really did almost exactly the same thing for /proc/slabinfo setup, using duplicate code and per-allocator #ifdef's. This just creates a common CONFIG_SLABINFO that is enabled by both SLUB and SLAB, and shares all the setup code. Maybe SLOB will want this some day too. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-02 13:04:48 -08:00
Pekka J Enberg	57ed3eda97	slub: provide /proc/slabinfo This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work. [ mingo@elte.hu: build fix. ] Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-01 11:32:02 -08:00
Christoph Lameter	76be895001	SLUB: Improve hackbench speed Increase the mininum number of partial slabs to keep around and put partial slabs to the end of the partial queue so that they can add more objects. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-21 15:51:07 -08:00
Linus Torvalds	3a6927906f	Do dirty page accounting when removing a page from the page cache Krzysztof Oledzki noticed a dirty page accounting leak on some of his machines, causing the machine to eventually lock up when the kernel decided that there was too much dirty data, but nobody could actually write anything out to fix it. The culprit turns out to be filesystems (cough ext3 with data=journal cough) that re-dirty the page when the "->invalidatepage()" callback is called. Fix it up by doing a final dirty page accounting check when we actually remove the page from the page cache. This fixes bugzilla entry 9182: http://bugzilla.kernel.org/show_bug.cgi?id=9182 Tested-by: Ingo Molnar <mingo@elte.hu> Tested-by: Krzysztof Oledzki <olel@ans.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-19 14:05:13 -08:00
Christoph Lameter	3811dbf671	SLUB: remove useless masking of GFP_ZERO Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already masked out in new_slab() (See how it calls allocate_slab). No need to do it twice. This reverts the SLUB parts of `7fd272550b`. Cc: Matt Mackall <mpm@selenic.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:17 -08:00
Nishanth Aravamudan	368d2c6358	Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" This reverts commit `54f9f80d65` ("hugetlb: Add hugetlb_dynamic_pool sysctl") Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool sysctl is not needed, as its semantics can be expressed by 0 in the overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl (pool enabled). (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change) Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:17 -08:00
Nishanth Aravamudan	d1c3fb1f8f	hugetlb: introduce nr_overcommit_hugepages sysctl hugetlb: introduce nr_overcommit_hugepages sysctl While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I became convinced that having a boolean sysctl was insufficient: 1) To support per-node control of hugepages, I have previously submitted patches to add a sysfs attribute related to nr_hugepages. However, with a boolean global value and per-mount quota enforcement constraining the dynamic pool, adding corresponding control of the dynamic pool on a per-node basis seems inconsistent to me. 2) Administration of the hugetlb dynamic pool with multiple hugetlbfs mount points is, arguably, more arduous than it needs to be. Each quota would need to be set separately, and the sum would need to be monitored. To ease the administration, and to help make the way for per-node control of the static & dynamic hugepage pool, I added a separate sysctl, nr_overcommit_hugepages. This value serves as a high watermark for the overall hugepage pool, while nr_hugepages serves as a low watermark. The boolean sysctl can then be removed, as the condition nr_overcommit_hugepages > 0 indicates the same administrative setting as hugetlb_dynamic_pool == 1 Quotas still serve as local enforcement of the size of the pool on a per-mount basis. A few caveats: 1) There is a race whereby the global surplus huge page counter is incremented before a hugepage has allocated. Another process could then try grow the pool, and fail to convert a surplus huge page to a normal huge page and instead allocate a fresh huge page. I believe this is benign, as no memory is leaked (the actual pages are still tracked correctly) and the counters won't go out of sync. 2) Shrinking the static pool while a surplus is in effect will allow the number of surplus huge pages to exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. Successfully tested on x86_64 with the current libhugetlbfs snapshot, modified to use the new sysctl. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:17 -08:00
Mel Gorman	81eabcbe0b	mm: fix page allocation for larger I/O segments In some cases the IO subsystem is able to merge requests if the pages are adjacent in physical memory. This was achieved in the allocator by having expand() return pages in physically contiguous order in situations were a large buddy was split. However, list-based anti-fragmentation changed the order pages were returned in to avoid searching in buffered_rmqueue() for a page of the appropriate migrate type. This patch restores behaviour of rmqueue_bulk() preserving the physical order of pages returned by the allocator without incurring increased search costs for anti-fragmentation. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Mark Lord <mlord@pobox.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
WANG Cong	bbd0682596	mm/sparse.c: improve the error handling for sparse_add_one_section() Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I see no reason to check 'usemap' until holding the 'pgdat_resize_lock'. [geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST] Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
WANG Cong	af0cd5a7c3	mm/sparse.c: check the return value of sparse_index_alloc() Since sparse_index_alloc() can return NULL on memory allocation failure, we must deal with the failure condition when calling it. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
Geoff Levand	a5ee6daa52	sparsemem: make SPARSEMEM_VMEMMAP selectable SPARSEMEM_VMEMMAP needs to be a selectable config option to support building the kernel both with and without sparsemem vmemmap support. This selection is desirable for platforms which could be configured one way for platform specific builds and the other for multi-platform builds. Signed-off-by: Miguel Botón <mboton@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
Adam Litke	72fad7139b	hugetlb: handle write-protection faults in follow_hugetlb_page The follow_hugetlb_page() fix I posted (merged as git commit `5b23dbe817`) missed one case. If the pte is present, but not writable and write access is requested by the caller to get_user_pages(), the code will do the wrong thing. Rather than calling hugetlb_fault to make the pte writable, it notes the presence of the pte and continues. This simple one-liner makes sure we also fault on the pte for this case. Please apply. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Dave Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-10 19:43:55 -08:00

1 2 3 4 5 ...

1806 Commits