Commit Graph

4322 Commits

Author SHA1 Message Date
KAMEZAWA Hiroyuki 5ee28a4476 vmstat: update zone stat threshold when onlining a cpu
refresh_zone_stat_thresholds() calculates parameter based on the number of
online cpus.  It's called at cpu offlining but needs to be called at
onlining, too.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins 3399446632 swap: discard while swapping only if SWAP_FLAG_DISCARD
Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
show a shift since I last tested in December: in part because of firmware
updates, in part because of the necessary move from barriers to awaiting
completion at the block layer.  While discard at swapon still shows as
slightly beneficial on both, discarding 1MB swap cluster when allocating
is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

Surrender: discard as presently implemented is more hindrance than help
for swap; but might prove useful on other devices, or with improvements.
So continue to do the discard at swapon, but make discard while swapping
conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
only the lower 16 bits of int flags).

We can add a --discard or -d to swapon(8), and a "discard" to swap in
/etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Christoph Hellwig 8f2ae0faa3 swap: do not send discards as barriers
The swap code already uses synchronous discards, no need to add I/O
barriers.

This fixes the worst of the terrible slowdown in swap allocation for
hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
eliminate that regression.

[tj@kernel.org: superflous newlines removed]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins b73d7fcecd swap: prevent reuse during hibernation
Move the hibernation check from scan_swap_map() into try_to_free_swap():
to catch not only the common case when hibernation's allocation itself
triggers swap reuse, but also the less likely case when concurrent page
reclaim (shrink_page_list) might happen to try_to_free_swap from a page.

Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
reclaim from going to swap: check that to prevent swap reuse too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins 910321ea81 swap: revert special hibernation allocation
Please revert 2.6.36-rc commit d2997b1042
"hibernation: freeze swap at hibernation".  It complicated matters by
adding a second swap allocation path, just for hibernation; without in any
way fixing the issue that it was intended to address - page reclaim after
fixing the hibernation image might free swap from a page already imaged as
swapcache, letting its swap be reallocated to store a different page of
the image: resulting in data corruption if the imaged page were freed as
clean then swapped back in.  Pages freed to si->swap_map were still in
danger of being reallocated by the alternative allocation path.

I guess it inadvertently fixed slow SSD swap allocation for hibernation,
as reported by Nigel Cunningham: by missing out the discards that occur on
the usual swap allocation path; but that was unintentional, and needs a
separate fix.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Gary King ac8456d6f9 bounce: call flush_dcache_page() after bounce_copy_vec()
I have been seeing problems on Tegra 2 (ARMv7 SMP) systems with HIGHMEM
enabled on 2.6.35 (plus some patches targetted at 2.6.36 to perform cache
maintenance lazily), and the root cause appears to be that the mm bouncing
code is calling flush_dcache_page before it copies the bounce buffer into
the bio.

The bounced page needs to be flushed after data is copied into it, to
ensure that architecture implementations can synchronize instruction and
data caches if necessary.

Signed-off-by: Gary King <gking@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
KAMEZAWA Hiroyuki 0dcc48c15f memory hotplug: fix next block calculation in is_removable
next_active_pageblock() is for finding next _used_ freeblock.  It skips
several blocks when it finds there are a chunk of free pages lager than
pageblock.  But it has 2 bugs.

  1. We have no lock. page_order(page) - pageblock_order can be minus.
  2. pageblocks_stride += is wrong. it should skip page_order(p) of pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Minchan Kim bc69304574 mm: compaction: handle active and inactive fairly in too_many_isolated
Iram reported that compaction's too_many_isolated() loops forever.
(http://www.spinics.net/lists/linux-mm/msg08123.html)

The meminfo when the situation happened was inactive anon is zero.  That's
because the system has no memory pressure until then.  While all anon
pages were in the active lru, compaction could select active lru as well
as inactive lru.  That's a different thing from vmscan's isolated.  So we
has been two too_many_isolated.

While compaction can isolate pages in both active and inactive, current
implementation of too_many_isolated only considers inactive.  It made
Iram's problem.

This patch handles active and inactive fairly.  That's because we can't
expect where from and how many compaction would isolated pages.

This patch changes (nr_isolated > nr_inactive) with
nr_isolated > (nr_active + nr_inactive) / 2.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Iram Shahzad <iram.shahzad@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Andrea Arcangeli 152e0659fc mm: avoid warning when COMPACTION is selected
COMPACTION enables MIGRATION, but MIGRATION spawns a warning if numa or
memhotplug aren't selected.  However MIGRATION doesn't depend on them.  I
guess it's just trying to be strict doing a double check on who's enabling
it, but it doesn't know that compaction also enables MIGRATION.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Andrea Arcangeli 4969c1192d mm: fix swapin race condition
The pte_same check is reliable only if the swap entry remains pinned (by
the page lock on swapcache).  We've also to ensure the swapcache isn't
removed before we take the lock as try_to_free_swap won't care about the
page pin.

One of the possible impacts of this patch is that a KSM-shared page can
point to the anon_vma of another process, which could exit before the page
is freed.

This can leave a page with a pointer to a recycled anon_vma object, or
worse, a pointer to something that is no longer an anon_vma.

[riel@redhat.com: changelog help]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Stefan Bader 39aa3cb3e8 mm: Move vma_stack_continue into mm.h
So it can be used by all that need to check for that.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 09:05:06 -07:00
Linus Torvalds ce7db282a3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix a mismatch between code and comment
  percpu: fix a memory leak in pcpu_extend_area_map()
  percpu: add __percpu notations to UP allocator
  percpu: handle __percpu notations in UP accessors
2010-09-07 14:08:37 -07:00
Linus Torvalds 997396a73a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix get_ticket_handler() error handling
  ceph: don't BUG on ENOMEM during mds reconnect
  ceph: ceph_mdsc_build_path() returns an ERR_PTR
  ceph: Fix warnings
  ceph: ceph_get_inode() returns an ERR_PTR
  ceph: initialize fields on new dentry_infos
  ceph: maintain i_head_snapc when any caps are dirty, not just for data
  ceph: fix osd request lru adjustment when sending request
  ceph: don't improperly set dir complete when holding EXCL cap
  mm: exporting account_page_dirty
  ceph: direct requests in snapped namespace based on nonsnap parent
  ceph: queue cap snap writeback for realm children on snap update
  ceph: include dirty xattrs state in snapped caps
  ceph: fix xattr cap writeback
  ceph: fix multiple mds session shutdown
2010-08-28 14:07:20 -07:00
Hugh Dickins f18194275c mm: fix hang on anon_vma->root->lock
After several hours, kbuild tests hang with anon_vma_prepare() spinning on
a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
(which makes this very much more likely, but it could happen without).

The ever-subtle page_lock_anon_vma() now needs a further twist: since
anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
of a reused anon_vma structure at any moment, page_lock_anon_vma()
needs to check page_mapped() again before succeeding, otherwise
page_unlock_anon_vma() might address a different root->lock.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-28 13:54:12 -07:00
Namhyung Kim 54157c4447 percpu: fix a mismatch between code and comment
When pcpu_build_alloc_info() searches best_upa value, it ignores current value
if the number of waste units exceeds 1/3 of the number of total cpus. But the
comment on the code says that it will ignore if wastage is over 25%.
Modify the comment.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-08-27 11:36:19 +02:00
Huang Shijie a002d14842 percpu: fix a memory leak in pcpu_extend_area_map()
The original code did not free the old map.  This patch fixes it.

tj: use @old as memcpy source instead of @chunk->map, and indentation
    and description update

Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2010-08-27 11:36:08 +02:00
Linus Torvalds 871eae4891 Merge branch '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev
* '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
  xfs: do not discard page cache data on EAGAIN
  xfs: don't do memory allocation under the CIL context lock
  xfs: Reduce log force overhead for delayed logging
  xfs: dummy transactions should not dirty VFS state
  xfs: ensure f_ffree returned by statfs() is non-negative
  xfs: handle negative wbc->nr_to_write during sync writeback
  writeback: write_cache_pages doesn't terminate at nr_to_write <= 0
  xfs: fix untrusted inode number lookup
  xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE
  xfs: unlock items before allowing the CIL to commit
2010-08-25 08:39:07 -07:00
Luck, Tony 8ca3eb0809 guard page for stacks that grow upwards
pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-24 12:13:20 -07:00
Dave Chinner 546a192422 writeback: write_cache_pages doesn't terminate at nr_to_write <= 0
I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
been. Enabling writeback tracing showed:

    flush-253:16-8516  [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516  [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0

Writeback is not terminating in background writeback if ->writepage is
returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
writeback on XFS.

Fix the write_cache_pages loop to terminate correctly when this situation
occurs and so prevent this sub-optimal background writeback pattern. This
improves sustained sequential buffered write performance from around
250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.

Cc:<stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-24 11:44:34 +10:00
Michael Rubin 679ceace84 mm: exporting account_page_dirty
This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.

Signed-off-by: Michael Rubin <mrubin@google.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:51 -07:00
Linus Torvalds bc584c5107 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slab: fix object alignment
  slub: add missing __percpu markup in mm/slub_def.h
2010-08-22 10:08:52 -07:00
Linus Torvalds 0e8e50e20c mm: make stack guard page logic use vm_prev pointer
Like the mlock() change previously, this makes the stack guard check
code use vma->vm_prev to see what the mapping below the current stack
is, rather than have to look it up with find_vma().

Also, accept an abutting stack segment, since that happens naturally if
you split the stack with mlock or mprotect.

Tested-by: Ian Campbell <ijc@hellion.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-21 08:50:00 -07:00
Linus Torvalds 7798330ac8 mm: make the mlock() stack guard page checks stricter
If we've split the stack vma, only the lowest one has the guard page.
Now that we have a doubly linked list of vma's, checking this is trivial.

Tested-by: Ian Campbell <ijc@hellion.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-21 08:49:50 -07:00
Linus Torvalds 297c5eee37 mm: make the vma list be doubly linked
It's a really simple list, and several of the users want to go backwards
in it to find the previous vma.  So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.

Tested-by: Ian Campbell <ijc@hellion.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-21 08:49:21 -07:00
KOSAKI Motohiro 8d6c83f0ba oom: __task_cred() need rcu_read_lock()
dump_tasks() needs to hold the RCU read lock around its access of the
target task's UID.  To this end it should use task_uid() as it only needs
that one thing from the creds.

The fact that dump_tasks() holds tasklist_lock is insufficient to prevent the
target process replacing its credentials on another CPU.

Then, this patch change to call rcu_read_lock() explicitly.

	===================================================
	[ INFO: suspicious rcu_dereference_check() usage. ]
	---------------------------------------------------
	mm/oom_kill.c:410 invoked rcu_dereference_check() without protection!

	other info that might help us debug this:

	rcu_scheduler_active = 1, debug_locks = 1
	4 locks held by kworker/1:2/651:
	 #0:  (events){+.+.+.}, at: [<ffffffff8106aae7>]
	process_one_work+0x137/0x4a0
	 #1:  (moom_work){+.+...}, at: [<ffffffff8106aae7>]
	process_one_work+0x137/0x4a0
	 #2:  (tasklist_lock){.+.+..}, at: [<ffffffff810fafd4>]
	out_of_memory+0x164/0x3f0
	 #3:  (&(&p->alloc_lock)->rlock){+.+...}, at: [<ffffffff810fa48e>]
	find_lock_task_mm+0x2e/0x70

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-20 09:34:55 -07:00
KOSAKI Motohiro b52723c560 oom: fix tasklist_lock leak
Commit 0aad4b3124 ("oom: fold __out_of_memory into out_of_memory")
introduced a tasklist_lock leak.  Then it caused following obvious
danger warnings and panic.

    ================================================
    [ BUG: lock held when returning to user space! ]
    ------------------------------------------------
    rsyslogd/1422 is leaving the kernel with locks still held!
    1 lock held by rsyslogd/1422:
     #0:  (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0
    BUG: scheduling while atomic: rsyslogd/1422/0x00000002
    INFO: lockdep is turned off.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-20 09:34:55 -07:00
KOSAKI Motohiro be71cf2202 oom: fix NULL pointer dereference
Commit b940fd7035 ("oom: remove unnecessary code and cleanup") added an
unnecessary NULL pointer dereference.  remove it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-20 09:34:55 -07:00
Jan Kara d5ed3a4af7 lib/radix-tree.c: fix overflow in radix_tree_range_tag_if_tagged()
When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows
index and tree traversal code goes astray reading memory until it hits
unreadable memory.  Check for overflow and exit in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-20 09:34:55 -07:00
Hugh Dickins 602586a83b shmem: put_super must percpu_counter_destroy
list_add() corruption messages reported from shmem_fill_super()'s recently
introduced percpu_counter_init(): shmem_put_super() needs to remember to
percpu_counter_destroy().  And also check error from percpu_counter_init().

Reported-bisected-and-tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-17 18:33:11 -07:00
Linus Torvalds d7824370e2 mm: fix up some user-visible effects of the stack guard page
This commit makes the stack guard page somewhat less visible to user
space. It does this by:

 - not showing the guard page in /proc/<pid>/maps

   It looks like lvm-tools will actually read /proc/self/maps to figure
   out where all its mappings are, and effectively do a specialized
   "mlockall()" in user space.  By not showing the guard page as part of
   the mapping (by just adding PAGE_SIZE to the start for grows-up
   pages), lvm-tools ends up not being aware of it.

 - by also teaching the _real_ mlock() functionality not to try to lock
   the guard page.

   That would just expand the mapping down to create a new guard page,
   so there really is no point in trying to lock it in place.

It would perhaps be nice to show the guard page specially in
/proc/<pid>/maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.

Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.

Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-15 11:35:52 -07:00
Randy Dunlap 03ab450f03 mm/page-writeback: fix non-kernel-doc function comments
Remove leading /** from non-kernel-doc function comments to prevent
kernel-doc warnings.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-14 16:20:59 -07:00
Linus Torvalds 11ac552477 mm: fix page table unmap for stack guard page properly
We do in fact need to unmap the page table _before_ doing the whole
stack guard page logic, because if it is needed (mainly 32-bit x86 with
PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it
will do a kmap_atomic/kunmap_atomic.

And those kmaps will create an atomic region that we cannot do
allocations in.  However, the whole stack expand code will need to do
anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an
atomic region.

Now, a better model might actually be to do the anon_vma_prepare() when
_creating_ a VM_GROWSDOWN segment, and not have to worry about any of
this at page fault time.  But in the meantime, this is the
straightforward fix for the issue.

See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details.

Reported-by: Wylda <wylda@volny.cz>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Reported-by: Mike Pagano <mpagano@gentoo.org>
Reported-by: François Valenduc <francois.valenduc@tvcablenet.be>
Tested-by: Ed Tomlinson <edt@aei.ca>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-14 11:44:56 -07:00
David Howells fe622e76fd NOMMU: Remove an extraneous no_printk()
Remove an extraneous no_printk() in mm/nommu.c that got missed when the
function got generalised from several things that used it in commit
12fdff3fc2 ("Add a dummy printk function for the maintenance of unused
printks").

Without this, the following error is observed:

  mm/nommu.c:41: error: conflicting types for 'no_printk'
  include/linux/kernel.h:314: error: previous definition of 'no_printk' was here

Reported-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-13 16:55:25 -07:00
Linus Torvalds 5528f9132c mm: fix missing page table unmap for stack guard page failure case
.. which didn't show up in my tests because it's a no-op on x86-64 and
most other architectures.  But we enter the function with the last-level
page table mapped, and should unmap it at exit.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-13 09:24:04 -07:00
Linus Torvalds 320b2b8de1 mm: keep a guard page below a grow-down stack segment
This is a rather minimally invasive patch to solve the problem of the
user stack growing into a memory mapped area below it.  Whenever we fill
the first page of the stack segment, expand the segment down by one
page.

Now, admittedly some odd application might _want_ the stack to grow down
into the preceding memory mapping, and so we may at some point need to
make this a process tunable (some people might also want to have more
than a single page of guarding), but let's try the minimal approach
first.

Tested with trivial application that maps a single page just below the
stack, and then starts recursing.  Without this, we will get a SIGSEGV
_after_ the stack has smashed the mapping.  With this patch, we'll get a
nice SIGBUS just as the stack touches the page just above the mapping.

Requested-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 17:54:33 -07:00
Linus Torvalds 1021a64534 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
  hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
  hwpoison: rename CONFIG
  HWPOISON, hugetlb: support hwpoison injection for hugepage
  HWPOISON, hugetlb: detect hwpoison in hugetlb code
  HWPOISON, hugetlb: isolate corrupted hugepage
  HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
  HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
  HWPOISON, hugetlb: enable error handling path for hugepage
  hugetlb, rmap: add reverse mapping for hugepage
  hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

Fix up trivial conflicts in mm/memory-failure.c
2010-08-12 10:15:10 -07:00
Linus Torvalds 26f0cf9181 Merge branch 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  x86: Detect whether we should use Xen SWIOTLB.
  pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.
  swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.
  xen/mmu: inhibit vmap aliases rather than trying to clear them out
  vmap: add flag to allow lazy unmap to be disabled at runtime
  xen: Add xen_create_contiguous_region
  xen: Rename the balloon lock
  xen: Allow unprivileged Xen domains to create iomap pages
  xen: use _PAGE_IOMAP in ioremap to do machine mappings

Fix up trivial conflicts (adding both xen swiotlb and xen pci platform
driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and
include/xen/xen-ops.h
2010-08-12 09:09:41 -07:00
Wu Fengguang 1babe18385 writeback: add comment to the dirty limit functions
Document global_dirty_limits() and bdi_dirty_limit().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 08:43:30 -07:00
Wu Fengguang 16c4042f08 writeback: avoid unnecessary calculation of bdi dirty thresholds
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 08:43:29 -07:00
Wu Fengguang e50e37201a writeback: balance_dirty_pages(): reduce calls to global_page_state
Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
times, dropping the fasted & slowest values then taking the average &
standard deviation

		average (s.d.) in millions (10^6)
2.6.31-rc8	648.6 (14.6)
+patch		620.1 (16.5)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
the counters to apply the dirty_threshold and moving this check up into
balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit tests
allows the pdflush test after the loop to use the local copies of the
counters rather than rereading them.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

Fengguang:

This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
avoid exceeding the dirty limit.  Since the bdi dirty limit is mostly
accurate we don't need to do routinely clip.  A simple dirty limit check
would be enough.

The check is necessary because, in principle we should throttle everything
calling balance_dirty_pages() when we're over the total limit, as said by
Peter.

We now set and clear dirty_exceeded not only based on bdi dirty limits,
but also on the global dirty limit.  The global limit check is added in
place of clip_bdi_dirty_limit() for safety and not intended as a behavior
change.  The bdi limits should be tight enough to keep all dirty pages
under the global limit at most time; occasional small exceeding should be
OK though.  The change makes the logic more obvious: the global limit is
the ultimate goal and shall be always imposed.

We may now start background writeback work based on outdated conditions.
That's safe because the bdi flush thread will (and have to) double check
the states.  It reduces overall overheads because the test based on old
states still have good chance to be right.

[akpm@linux-foundation.org] fix uninitialized dirty_exceeded
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 08:43:29 -07:00
Randy Dunlap 3c111a071d mm: fix fatal kernel-doc error
Fix a fatal kernel-doc error due to a #define coming between a function's
kernel-doc notation and the function signature.  (kernel-doc cannot handle
this)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 08:43:29 -07:00
KOSAKI Motohiro 13d7e3a2db memcg: convert to use zone_to_nid() from bare zone->zone_pgdat->node_id
We have zone_to_nid().  this patch convert all existing users of
zone->zone_pgdat->node_id.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
KOSAKI Motohiro 00918b6ab8 memcg: remove nid and zid argument from mem_cgroup_soft_limit_reclaim()
mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument.  but nid
and zid can be calculated from zone.  So remove it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
KOSAKI Motohiro 14fec79680 memcg: mem_cgroup_shrink_node_zone() doesn't need sc.nodemask
Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly.  thus
it doesn't need to initialize sc.nodemask because shrink_zone() doesn't
use it at all.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
KOSAKI Motohiro da280d636b memcg: kill unnecessary initialization in mem_cgroup_shrink_node_zone()
sc.nr_reclaimed and sc.nr_scanned have already been initialized few lines
above "struct scan_control sc = {}" statement.

So, This patch remove this unnecessary code.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
KOSAKI Motohiro b8f5c5664d memcg: sc.nr_to_reclaim should be initialized
Currently, mem_cgroup_shrink_node_zone() initialize sc.nr_to_reclaim as 0.
 It mean shrink_zone() only scan 32 pages and immediately return even if
it doesn't reclaim any pages.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
KAMEZAWA Hiroyuki f75ca96203 memcg: avoid css_get()
Now, memory cgroup increments css(cgroup subsys state)'s reference count
per a charged page.  And the reference count is kept until the page is
uncharged.  But this has 2 bad effect.

 1. Because css_get/put calls atomic_inc()/dec, heavy call of them
    on large smp will not scale well.
 2. Because css's refcnt cannot be in a state as "ready-to-release",
    cgroup's notify_on_release handler can't work with memcg.
 3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.

This has been a problem since the 1st merge of memcg.

This is a trial to remove css's refcnt per a page. Even if we remove
refcnt, pre_destroy() does enough synchronization as
  - check res->usage == 0.
  - check no pages on LRU.

This patch removes css's refcnt per page.  Even after this patch, at the
1st look, it seems css_get() is still called in try_charge().

But the logic is.

  - If a memcg of mm->owner is cached one, consume_stock() will work.
    At success, return immediately.
  - If consume_stock returns false, css_get() is called and go to
    slow path which may be blocked. At the end of slow path,
    css_put() is called and restart from the start if necessary.

So, in the fast path, we don't call css_get() and can avoid access to
shared counter. This patch can make the most possible case fast.

Here is a result of multi-threaded page fault benchmark.

[Before]
    25.32%  multi-fault-all  [kernel.kallsyms]      [k] clear_page_c
     9.30%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
     8.02%  multi-fault-all  [kernel.kallsyms]      [k] try_get_mem_cgroup_from_mm <=====(*)
     7.83%  multi-fault-all  [kernel.kallsyms]      [k] down_read_trylock
     5.38%  multi-fault-all  [kernel.kallsyms]      [k] __css_put
     5.29%  multi-fault-all  [kernel.kallsyms]      [k] __alloc_pages_nodemask
     4.92%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irq
     4.24%  multi-fault-all  [kernel.kallsyms]      [k] up_read
     3.53%  multi-fault-all  [kernel.kallsyms]      [k] css_put
     2.11%  multi-fault-all  [kernel.kallsyms]      [k] handle_mm_fault
     1.76%  multi-fault-all  [kernel.kallsyms]      [k] __rmqueue
     1.64%  multi-fault-all  [kernel.kallsyms]      [k] __mem_cgroup_commit_charge

[After]
    28.41%  multi-fault-all  [kernel.kallsyms]      [k] clear_page_c
    10.08%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irq
     9.58%  multi-fault-all  [kernel.kallsyms]      [k] down_read_trylock
     9.38%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
     5.86%  multi-fault-all  [kernel.kallsyms]      [k] __alloc_pages_nodemask
     5.65%  multi-fault-all  [kernel.kallsyms]      [k] up_read
     2.82%  multi-fault-all  [kernel.kallsyms]      [k] handle_mm_fault
     2.64%  multi-fault-all  [kernel.kallsyms]      [k] mem_cgroup_add_lru_list
     2.48%  multi-fault-all  [kernel.kallsyms]      [k] __mem_cgroup_commit_charge

Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch
removes css_tryget() in it. (But yes, this is an extreme case.)

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
KAMEZAWA Hiroyuki 158e0a2d1b memcg: use find_lock_task_mm() in memory cgroups oom
When the OOM killer scans task, it check a task is under memcg or
not when it's called via memcg's context.

But, as Oleg pointed out, a thread group leader may have NULL ->mm
and task_in_mem_cgroup() may do wrong decision. We have to use
find_lock_task_mm() in memcg as generic OOM-Killer does.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
Daisuke Nishimura 73045c47b6 memcg: remove mem from arg of charge_common
mem_cgroup_charge_common() is always called with @mem = NULL, so it's
meaningless.  This patch removes it.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:19 -07:00
Daisuke Nishimura bd0d24bfe8 memcg: remove redundant code
- try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we
  don't have to call them in task_in_mem_cgroup().
- *mz is not used in __mem_cgroup_uncharge_common().
- we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration()
  after we've cleared PCG_MIGRATION of @oldpage.
- remove empty comment.
- remove redundant empty line in mem_cgroup_cache_charge().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:18 -07:00