OpenCloudOS-Kernel

History

Hugh Dickins 9fab5619bd shmem: writepage directly to swap Synopsis: if shmem_writepage calls swap_writepage directly, most shmem swap loads benefit, and a catastrophic interaction between SLUB and some flash storage is avoided. shmem_writepage() has always been peculiar in making no attempt to write: it has just transferred a shmem page from file cache to swap cache, then let that page make its way around the LRU again before being written and freed. The idea was that people use tmpfs because they want those pages to stay in RAM; so although we give it an overflow to swap, we should resist writing too soon, giving those pages a second chance before they can be reclaimed. That was always questionable, and I've toyed with this patch for years; but never had a clear justification to depart from the original design. It became more questionable in 2.6.28, when the split LRU patches classed shmem and tmpfs pages as SwapBacked rather than as file_cache: that in itself gives them more resistance to reclaim than normal file pages. I prepared this patch for 2.6.29, but the merge window arrived before I'd completed gathering statistics to justify sending it in. Then while comparing SLQB against SLUB, running SLUB on a laptop I'd habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping tests five times slower than SLAB or SLQB - other machines slower too, but nowhere near so bad. Simpler "cp -a" swapping tests showed the same. slub_max_order=0 brings sanity to all, but heavy swapping is too far from normal to justify such a tuning. The crucial factor on that laptop turns out to be that I'm using an SD card for swap. What happens is this: By default, SLUB uses order-2 pages for shmem_inode_cache (and many other fs inodes), so creating tmpfs files under memory pressure brings lumpy reclaim into play. One subpage of the order is chosen from the bottom of the LRU as usual, then the other three picked out from their random positions on the LRUs. In a tmpfs load, many of these pages will be ones which already passed through shmem_writepage, so already have swap allocated. And though their offsets on swap were probably allocated sequentially, now that the pages are picked off at random, their swap offsets are scattered. But the flash storage on the SD card is very sensitive to having its writes merged: once swap is written at scattered offsets, performance falls apart. Rotating disk seeks increase too, but less disastrously. So: stop giving shmem/tmpfs pages a second pass around the LRU, write them out to swap as soon as their swap has been allocated. It's surely possible to devise an artificial load which runs faster the old way, one whose sizing is such that the tmpfs pages on their second pass are the ones that are wanted again, and other pages not. But I've not yet found such a load: on all machines, under the loads I've tried, immediate swap_writepage speeds up shmem swapping: especially when using the SLUB allocator (and more effectively than slub_max_order=0), but also with the others; and it also reduces the variance between runs. How much faster varies widely: a factor of five is rare, 5% is common. One load which might have suffered: imagine a swapping shmem load in a limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the swapcache was not charged, and such a load would have run quickest with the shmem swapcache never written to swap. But now swapcache is charged, so even this load benefits from shmem_writepage directly to swap. Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h: it's silly because that will never get called; but refactoring shmem.c sensibly according to CONFIG_SWAP will be a separate task. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2009-04-01 08:59:15 -07:00
..
Kconfig	nommu: make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n	2009-04-01 08:59:15 -07:00
Kconfig.debug	generic debug pagealloc	2009-04-01 08:59:13 -07:00
Makefile	generic debug pagealloc	2009-04-01 08:59:13 -07:00
allocpercpu.c	cpumask: use new cpumask_ functions in core code.	2009-03-30 22:05:16 +10:30
backing-dev.c	Move the default_backing_dev_info out of readahead.c and into backing-dev.c	2009-03-26 11:01:33 +01:00
bootmem.c	bootmem, x86: further fixes for arch-specific bootmem wrapping	2009-03-01 16:06:56 +09:00
bounce.c	bounce: don't rely on a zeroed bio_vec list	2008-12-29 08:29:52 +01:00
debug-pagealloc.c	generic debug pagealloc	2009-04-01 08:59:13 -07:00
dmapool.c	dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON too	2008-04-28 08:58:20 -07:00
fadvise.c	[CVE-2009-0029] System call wrapper special cases	2009-01-14 14:15:18 +01:00
failslab.c	SLUB: failslab support	2008-12-29 11:27:46 +02:00
filemap.c	x86, mm: dont use non-temporal stores in pagecache accesses	2009-03-02 11:06:49 +01:00
filemap_xip.c	x86, mm: dont use non-temporal stores in pagecache accesses	2009-03-02 11:06:49 +01:00
fremap.c	Do not account for the address space used by hugetlbfs using VM_ACCOUNT	2009-02-10 10:48:42 -08:00
highmem.c	mm: introduce debug_kmap_atomic	2009-04-01 08:59:14 -07:00
hugetlb.c	hugetlb: chg cannot become less than 0	2009-04-01 08:59:13 -07:00
internal.h	nommu: there is no mlock() for NOMMU, so don't provide the bits	2009-04-01 08:59:14 -07:00
maccess.c	kgdb: fix optional arch functions and probe_kernel_*	2008-04-17 20:05:39 +02:00
madvise.c	[CVE-2009-0029] System call wrappers part 14	2009-01-14 14:15:24 +01:00
memcontrol.c	memcg: NULL pointer dereference at rmdir on some NUMA systems	2009-01-29 18:04:44 -08:00
memory.c	mm: page_mkwrite change prototype to match fault	2009-04-01 08:59:14 -07:00
memory_hotplug.c	mm: remove GFP_HIGHUSER_PAGECACHE	2009-01-06 15:59:01 -08:00
mempolicy.c	[CVE-2009-0029] System call wrappers part 28	2009-01-14 14:15:30 +01:00
mempool.c	spelling fixes: mm/	2007-10-20 01:27:18 +02:00
migrate.c	migration: migrate_vmas should check "vma"	2009-02-11 14:25:34 -08:00
mincore.c	[CVE-2009-0029] System call wrappers part 14	2009-01-14 14:15:24 +01:00
mlock.c	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2009-02-17 14:27:39 -08:00
mm_init.c	mm: mminit_loglevel cannot be __meminitdata anymore	2008-08-20 15:40:30 -07:00
mmap.c	Merge branch 'master' into next	2009-03-24 10:52:46 +11:00
mmu_notifier.c	mmu-notifiers: core	2008-07-28 16:30:21 -07:00
mmzone.c	mm: mark the correct zone as full when scanning zonelists	2008-09-13 14:41:52 -07:00
mprotect.c	Do not account for the address space used by hugetlbfs using VM_ACCOUNT	2009-02-10 10:48:42 -08:00
mremap.c	[CVE-2009-0029] System call wrappers part 13	2009-01-14 14:15:23 +01:00
msync.c	[CVE-2009-0029] System call wrappers part 13	2009-01-14 14:15:23 +01:00
nommu.c	uclinux: add process name to allocation error message	2009-01-27 16:42:03 +10:00
oom_kill.c	oom_kill: don't call for int_sqrt(0)	2009-04-01 08:59:11 -07:00
page-writeback.c	mm: fix proc_dointvec_userhz_jiffies "breakage"	2009-04-01 08:59:13 -07:00
page_alloc.c	vmscan: fix it to take care of nodemask	2009-04-01 08:59:15 -07:00
page_cgroup.c	memcg: use __GFP_NOWARN in page cgroup allocation	2009-02-11 14:25:35 -08:00
page_io.c	block: fix bad definition of BIO_RW_SYNC	2009-02-18 10:32:00 +01:00
page_isolation.c	memory hotplug: fix page_zone() calculation in test_pages_isolated()	2008-11-06 15:41:19 -08:00
pagewalk.c	pagemap: pass mm into pagewalkers	2008-06-12 18:05:41 -07:00
pdflush.c	cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL	2009-03-30 22:05:11 +10:30
percpu.c	percpu: generalize embedding first chunk setup helper	2009-03-10 16:27:48 +09:00
prio_tree.c	spelling fixes: mm/	2007-10-20 01:27:18 +02:00
quicklist.c	mm: size of quicklists shouldn't be proportional to the number of CPUs	2008-09-02 19:21:38 -07:00
readahead.c	Move the default_backing_dev_info out of readahead.c and into backing-dev.c	2009-03-26 11:01:33 +01:00
rmap.c	mm: fix mlocked page counter mismatch	2009-02-11 14:25:35 -08:00
shmem.c	shmem: writepage directly to swap	2009-04-01 08:59:15 -07:00
shmem_acl.c	[PATCH] sanitize ->permission() prototype	2008-07-26 20:53:14 -04:00
slab.c	Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2009-03-30 17:17:35 -07:00
slob.c	Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2009-03-30 17:17:35 -07:00
slub.c	Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2009-03-30 17:17:35 -07:00
sparse-vmemmap.c	vmemmap: warn about page_structs with remote distance	2008-11-06 15:41:19 -08:00
sparse.c	mm: mminit_validate_memmodel_limits(): remove redundant test	2009-04-01 08:59:11 -07:00
swap.c	mm: remove pagevec_swap_free()	2009-04-01 08:59:13 -07:00
swap_state.c	memcg: mem+swap controller core	2009-01-08 08:31:05 -08:00
swapfile.c	PM/hibernate: fix "swap breaks after hibernation failures"	2009-02-21 14:17:17 -08:00
thrash.c	Bug in mm/thrash.c function grab_swap_token()	2007-05-11 08:29:32 -07:00
truncate.c	mmap: handle mlocked pages during map, remap, unmap	2008-10-20 08:52:31 -07:00
util.c	memdup_user(): introduce	2009-04-01 08:59:13 -07:00
vmalloc.c	vmap: remove needless lock and list in vmap	2009-04-01 08:59:11 -07:00
vmscan.c	vmscan: fix it to take care of nodemask	2009-04-01 08:59:15 -07:00
vmstat.c	mm: introduce for_each_populated_zone() macro	2009-04-01 08:59:11 -07:00