OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	14ffe009ca	Merge branch 'akpm' (Fixups from Andrew) Merge misc fixes from Andrew Morton: "Followups, fixes and some random stuff I found on the internet." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (11 patches) perf: fix duplicate header inclusion memcg, kmem: fix build error when CONFIG_INET is disabled rtc: kconfig: fix RTC_INTF defaults connected to RTC_CLASS rapidio: fix comment lib/kasprintf.c: use kmalloc_track_caller() to get accurate traces for kvasprintf rapidio: update for destination ID allocation rapidio: update asynchronous discovery initialization rapidio: use msleep in discovery wait mm: compaction: fix bit ranges in {get,clear,set}_pageblock_skip() arch/powerpc/platforms/pseries/hotplug-memory.c: section removal cleanups arch/powerpc/platforms/pseries/hotplug-memory.c: fix section handling code	2012-10-11 10:14:16 +09:00
Linus Torvalds	ce40be7a82	Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block Pull block IO update from Jens Axboe: "Core block IO bits for 3.7. Not a huge round this time, it contains: - First series from Kent cleaning up and generalizing bio allocation and freeing. - WRITE_SAME support from Martin. - Mikulas patches to prevent O_DIRECT crashes when someone changes the block size of a device. - Make bio_split() work on data-less bio's (like trim/discards). - A few other minor fixups." Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew Morton. It is due to the VM no longer using a prio-tree (see commit 6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree"). So make set_blocksize() use mapping_mapped() instead of open-coding the internal VM knowledge that has changed. * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits) block: makes bio_split support bio without data scatterlist: refactor the sg_nents scatterlist: add sg_nents fs: fix include/percpu-rwsem.h export error percpu-rw-semaphore: fix documentation typos fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared blockdev: turn a rw semaphore into a percpu rw semaphore Fix a crash when block device is read and block size is changed at the same time block: fix request_queue->flags initialization block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue() block: ioctl to zero block ranges block: Make blkdev_issue_zeroout use WRITE SAME block: Implement support for WRITE SAME block: Consolidate command flag and queue limit checks for merges block: Clean up special command handling logic block/blk-tag.c: Remove useless kfree block: remove the duplicated setting for congestion_threshold block: reject invalid queue attribute values block: Add bio_clone_bioset(), bio_clone_kmalloc() block: Consolidate bio_alloc_bioset(), bio_kmalloc() ...	2012-10-11 09:04:23 +09:00
David Rientjes	cd59085a9b	memcg, kmem: fix build error when CONFIG_INET is disabled Commit `e1aab161e0` ("socket: initial cgroup code.") causes a build error when CONFIG_INET is disabled in Linus' tree: net/built-in.o: In function `sk_update_clone': net/core/sock.c:1336: undefined reference to `sock_update_memcg' sock_update_memcg() is only defined when CONFIG_INET is enabled, so fix it by defining the dummy function without this option. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-11 08:50:16 +09:00
Chad Reese	fdb8d561e6	rapidio: fix comment The resource index for the mailboxes was incorrect. Signed-off-by: Chad Reese <kreese@caviumnetworks.com> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-11 08:50:16 +09:00
Alexandre Bounine	4ed134beee	rapidio: update for destination ID allocation Address comments provided by Andrew Morton: https://lkml.org/lkml/2012/10/3/550 - Keeps consistent kerneldoc compatible comments style for new static functions. - Removes unnecessary complexity from destination ID allocation routine. - Uses kcalloc() for code clarity. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-11 08:50:15 +09:00
Bartlomiej Zolnierkiewicz	627260595c	mm: compaction: fix bit ranges in {get,clear,set}_pageblock_skip() {get,clear,set}_pageblock_skip() use incorrect bit ranges (please compare to bit ranges used by {get,set}_pageblock_flags() used for migration types) and can overwrite pageblock migratetype of the next pageblock in the bitmap. Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Thierry Reding <thierry.reding@avionic-design.de> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-11 08:50:14 +09:00
David Howells	922cd657c9	UAPI: (Scripted) Disintegrate include/linux/can Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2012-10-10 22:24:21 +02:00
Linus Torvalds	df632d3ce7	NFS client updates for Linux 3.7 Features include: - Remove CONFIG_EXPERIMENTAL dependency from NFSv4.1 Aside from the issues discussed at the LKS, distros are shipping NFSv4.1 with all the trimmings. - Fix fdatasync()/fsync() for the corner case of a server reboot. - NFSv4 OPEN access fix: finally distinguish correctly between open-for-read and open-for-execute permissions in all situations. - Ensure that the TCP socket is closed when we're in CLOSE_WAIT - More idmapper bugfixes - Lots of pNFS bugfixes and cleanups to remove unnecessary state and make the code easier to read. - In cases where a pNFS read or write fails, allow the client to resume trying layoutgets after two minutes of read/write-through-mds. - More net namespace fixes to the NFSv4 callback code. - More net namespace fixes to the NFSv3 locking code. - More NFSv4 migration preparatory patches. Including patches to detect network trunking in both NFSv4 and NFSv4.1 - pNFS block updates to optimise LAYOUTGET calls. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQdMvBAAoJEGcL54qWCgDyV84P/0XvcEXj6kdMv9EiWfRczo7r iAwAIhiEmG1agtZa6v+Gso2MYRQbkGyJi0LKIwzGqNUi0BLQGQCoV93kB0ITVpiN g7poDTnPyoItW1oJCtC48/Mx0G5C1yrHSwFAJrXmtzDF1mwd/BIQReafYp6x+/TU Mvwm7au3Y2ySRBEDmY4zyBERHXGt//JmsZ9Ays6jewQg5ZOyjDQKoeHVYaaeJoF0 A0tQGcBSNdySagI5dt4SlkuO7AClhzVHlilep2dsBu/TLS0F2pEdHXvM2W0koZmM uazaIpzd2F7TfokTYExgsyKsqpkzpDf1kebN4Y1+Ioi7Yy30dQrX6lNaUNcOmOJQ xx694HDHV90KdRBVSFhOIHMTBRcls68hBcWib3MXWHTKX6HVgnFMwhwxGH0MRezf 3rmXoqn+CO1j5WeQmA3BqdVbHSZHi913TKEwE/qoW4pmOFhv5I2flXWQS/Rwvdng 2xDCe6TlvhMS92IpyvNEIicXLRSm+DUAmoAfSqqlifZIAEM5R29e/wCAsmVprO3B LPHyUoIMO6SZ1PL6Rk20+6qQfvCK7U/ChULsUL/zb7R88Pc3sFE2BeAvZVATsvH3 +FJWTz43fwUBoMhPsn8xSBLn/fq6az5C19syz6Fpu3DZ4X0EwyVWifiFk6HgcxZD J8ajEl+dNZeFE8rkwykX =uBk7 -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Features include: - Remove CONFIG_EXPERIMENTAL dependency from NFSv4.1 Aside from the issues discussed at the LKS, distros are shipping NFSv4.1 with all the trimmings. - Fix fdatasync()/fsync() for the corner case of a server reboot. - NFSv4 OPEN access fix: finally distinguish correctly between open-for-read and open-for-execute permissions in all situations. - Ensure that the TCP socket is closed when we're in CLOSE_WAIT - More idmapper bugfixes - Lots of pNFS bugfixes and cleanups to remove unnecessary state and make the code easier to read. - In cases where a pNFS read or write fails, allow the client to resume trying layoutgets after two minutes of read/write- through-mds. - More net namespace fixes to the NFSv4 callback code. - More net namespace fixes to the NFSv3 locking code. - More NFSv4 migration preparatory patches. Including patches to detect network trunking in both NFSv4 and NFSv4.1 - pNFS block updates to optimise LAYOUTGET calls." * tag 'nfs-for-3.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (113 commits) pnfsblock: cleanup nfs4_blkdev_get NFS41: send real read size in layoutget NFS41: send real write size in layoutget NFS: track direct IO left bytes NFSv4.1: Cleanup ugliness in pnfs_layoutgets_blocked() NFSv4.1: Ensure that the layout sequence id stays 'close' to the current NFSv4.1: Deal with seqid wraparound in the pNFS return-on-close code NFSv4 set open access operation call flag in nfs4_init_opendata_res NFSv4.1: Remove the dependency on CONFIG_EXPERIMENTAL NFSv4 reduce attribute requests for open reclaim NFSv4: nfs4_open_done first must check that GETATTR decoded a file type NFSv4.1: Deal with wraparound when updating the layout "barrier" seqid NFSv4.1: Deal with wraparound issues when updating the layout stateid NFSv4.1: Always set the layout stateid if this is the first layoutget NFSv4.1: Fix another refcount issue in pnfs_find_alloc_layout NFSv4: don't put ACCESS in OPEN compound if O_EXCL NFSv4: don't check MAY_WRITE access bit in OPEN NFS: Set key construction data for the legacy upcall NFSv4.1: don't do two EXCHANGE_IDs on mount NFS: nfs41_walk_client_list(): re-lock before iterating ...	2012-10-10 23:52:35 +09:00
Linus Torvalds	2474542f64	pwm: Changes for v3.7-rc1 All legacy PWM providers have now been moved to the PWM subsystem. The plan for 3.8 is to adapt all board files to provide a lookup table for PWM devices in order to get rid of the global namespace. Subsequently, users of the legacy pwm_request() and pwm_free() functions can be migrated to the new pwm_get() and pwm_put() functions. Once this has been completed, the legacy API and the compatibility code in the core can be removed. In addition to the above, these changes also add support for configuring the polarity of a PWM signal (currently only supported on ECAP and EHRPWM) and include a much needed rework of the i.MX driver. Managed functions to obtain and release a PWM device (devm_pwm_get() and devm_pwm_put()) have been added and the pwm-backlight driver has been updated to use them. If the PWM subsystem hasn't been enabled, dummy functions are provided that allow the subsystem to safely compile out. Some common checks on input parameters have been moved to the core and removed from the drivers. Finally, a small fix corrects the description of the PWM specifier's second cell in the device tree representation. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIcBAABAgAGBQJQcV6pAAoJEN0jrNd/PrOh594P/3NHPjnTiyAZpUICX4mJeoMd 1rX685HofCyX5jlHloS1pLRXYtIPZP97We5crNR/k/RujKcHAqAFzLtG2Jj93gxm nRUwLZ4xbrAgeaZeJV8YG1naWx9R1OjrK2sv+FyHRWsn9UZus2vDMV7tB0oNWo9k DMhtF1q7gQxx4nRu8z9FFybyItQinVREODNs+Gdv6AOJ36t+mOsGPL3LKZeFsIG3 LdaB1p3t/07Job4KwGYYRjkOR7bIQ+ADnnO8Pkk0a9lxkMI8rA0uFbLwgyx7Hohh xZtm8R5il2IlHwqh52RzpShceWICxCZ3htuIXTRbVa1PSEsBzZ66W2CT7sy1jPm5 lJlHgc7abaiwijtoG76I+Lgslv5UTViBuIYHSX4dH2kF3Jgm2SgTBNzZYePI5MMT BI3xc9WnwBpxgMybJsPHfU0Ff6O/9h9ab7q5lBOHkQJittahYxzWD+FgYlalcAa+ I919iFREKNZytewLnXu+4TG0RfacQoj6D5G+SquOGmXyKYRhhJ1O6WodnRZHi/0n M6J0f6fLxpomI1RMn44VYP43OPJCDYxgXYoDLwCZQL9tSWYpdhXoDrjcIPO547w1 ZS5HTXUE4DrxeYjoeqL555vtQIooM9CuH4JwKlz5zzea8NPC8eVtmNFTTWE6wKWX ZIOOLVvw/FI0/DSjBb8H =MQ5V -----END PGP SIGNATURE----- Merge tag 'for-3.7-rc1' of git://gitorious.org/linux-pwm/linux-pwm Pull pwm changes from Thierry Reding: "All legacy PWM providers have now been moved to the PWM subsystem. The plan for 3.8 is to adapt all board files to provide a lookup table for PWM devices in order to get rid of the global namespace. Subsequently, users of the legacy pwm_request() and pwm_free() functions can be migrated to the new pwm_get() and pwm_put() functions. Once this has been completed, the legacy API and the compatibility code in the core can be removed. In addition to the above, these changes also add support for configuring the polarity of a PWM signal (currently only supported on ECAP and EHRPWM) and include a much needed rework of the i.MX driver. Managed functions to obtain and release a PWM device (devm_pwm_get() and devm_pwm_put()) have been added and the pwm-backlight driver has been updated to use them. If the PWM subsystem hasn't been enabled, dummy functions are provided that allow the subsystem to safely compile out. Some common checks on input parameters have been moved to the core and removed from the drivers. Finally, a small fix corrects the description of the PWM specifier's second cell in the device tree representation." * tag 'for-3.7-rc1' of git://gitorious.org/linux-pwm/linux-pwm: (23 commits) pwm: dt: Fix description of second PWM cell pwm: Check for negative duty-cycle and period pwm: Add Ingenic JZ4740 support MIPS: JZ4740: Export timer API pwm: Move PUV3 PWM driver to PWM framework unicore32: pwm: Use managed resource allocations unicore32: pwm: Remove unnecessary indirection unicore32: pwm: Use module_platform_driver() unicore32: pwm: Properly remap memory-mapped registers pwm-backlight: Use devm_pwm_get() instead of pwm_get() pwm: Move AB8500 PWM driver to PWM framework pwm: Fix compilation error when CONFIG_PWM is not defined pwm: i.MX: fix clock lookup pwm: i.MX: use per clock unconditionally pwm: i.MX: add devicetree support pwm: i.MX: Use module_platform_driver pwm: i.MX: add functions to enable/disable pwm. pwm: i.MX: remove unnecessary if in pwm_[en\|dis]able pwm: i.MX: factor out SoC specific functions pwm: pwm-tiehrpwm: Add support for configuring polarity of PWM ...	2012-10-10 20:15:24 +09:00
Linus Torvalds	c7a6ced9d8	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds Pull LED subsystem update from Bryan Wu. * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds: (24 commits) leds: add output driver configuration for pca9633 led driver leds: lm3642: Use regmap_update_bits() in lm3642_chip_init() leds: Add new LED driver for lm3642 chips leds-lp5523: Fix riskiness of the page fault leds-lp5523: turn off the LED engines on unloading the driver leds-lm3530: Fix smatch warnings leds-lm3530: Use devm_regulator_get function leds: leds-gpio: adopt pinctrl support leds: Add new LED driver for lm355x chips leds-lp5523: use the i2c device id rather than fixed name leds-lp5523: add new device id for LP55231 leds-lp5523: support new LP55231 device leds: triggers: send uevent when changing triggers leds-lp5523: minor code style fixes leds-lp5523: change the return type of lp5523_set_mode() leds-lp5523: set the brightness to 0 forcely on removing the driver leds-lp5523: add channel name in the platform data leds: leds-gpio: Use of_get_child_count() helper leds: leds-gpio: Use platform_{get,set}_drvdata leds: leds-gpio: use of_match_ptr() ...	2012-10-10 20:14:07 +09:00
Linus Torvalds	a188e7e93a	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull scsi target updates from Nicholas Bellinger: "Things have been calm for the most part with no new fabric drivers in flight for v3.7 (we're up to eight now !), so this update is primarily focused on addressing a few long-standing items within target-core and iscsi-target fabric code. The highlights include: - target: Simplify fabric sense data length handling (roland) - qla2xxx: Fix endianness of task management response code (roland) - target: fix truncation of mode data, support zero allocation length (paolo) - target: Properly support zero-length commands in normal processing path (paolo) - iscsi-target: Correctly set 0xffffffff field within ISCSI_OP_REJECT PDU (ronnie + nab) - iscsi-target: Add explicit set of cache_dynamic_acls=1 for TPG demo-mode (ronnie + nab) - target/file: Re-enable optional fd_buffered_io=1 operation (nab + hch) - iscsi-target: Add MaxXmitDataSegmenthLength forr target -> initiator MDRSL declaration (nab) - target: Add target_submit_cmd_map_sgls for SGL fabric memory passthrough (nab + hch) - tcm_loop: Convert I/O path to use target_submit_cmd_map_sgls (hch + nab) - tcm_vhost: Convert I/O path to use target_submit_cmd_map_sgls (nab + hch) The last series for adding a new target_submit_cmd_map_sgls() fabric caller (as requested by hch) that accepts pre-allocated SGL memory (using existing logic), along with converting tcm_loop + tcm_vhost has only been in -next for the last days, but has gotten enough review +testing and is clear enough a mechanical change that I think it's reasonable to merge for -rc1 code. Thanks again to everyone who contributed this round! Extra special thanks to Roland (PureStorage) for tracking down the qla2xxx target TMR response code endian issue, and to Paolo (Redhat) for resolving the long standing zero-length CDB issues within target-core between virtual and pSCSI backends." * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (44 commits) iscsi-target: Bump defaults for nopin_timeout + nopin_response_timeout values iscsit: proper endianess conversions iscsit: use the itt_t abstract type iscsit: add missing endianess conversion in iscsit_check_inaddr_any iscsit: remove incorrect unlock in iscsit_build_sendtargets_resp iscsit: mark various functions static target/iscsi: precedence bug in iscsit_set_dataout_sequence_values() target/usb-gadget: strlen() doesn't count the terminator target/usb-gadget: remove duplicate initialization tcm_vhost: Convert I/O path to use target_submit_cmd_map_sgls target: Add control CDB READ payload zero work-around tcm_loop: Convert I/O path to use target_submit_cmd_map_sgls target: Add target_submit_cmd_map_sgls for SGL fabric memory passthrough iscsi-target: Add explicit set of cache_dynamic_acls=1 for TPG demo-mode iscsi-target: Change iscsi_target_seq_pdu_list.c to honor MaxXmitDataSegmentLength iscsi-target: Add MaxXmitDataSegmentLength connection recovery check iscsi-target: Convert incoming PDU payload checks to MaxXmitDataSegmentLength iscsi-target: Enable MaxXmitDataSegmentLength operation in login path iscsi-target: Add base MaxXmitDataSegmentLength code target/file: Re-enable optional fd_buffered_io=1 operation ...	2012-10-10 19:52:19 +09:00
Rusty Russell	106a4ee258	module: signature checking hook We do a very simple search for a particular string appended to the module (which is cache-hot and about to be SHA'd anyway). There's both a config option and a boot parameter which control whether we accept or fail with unsigned modules and modules that are signed with an unknown key. If module signing is enabled, the kernel will be tainted if a module is loaded that is unsigned or has a signature for which we don't have the key. (Useful feedback and tweaks by David Howells <dhowells@redhat.com>) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2012-10-10 20:00:55 +10:30
Lai Jiangshan	4b2c551f77	lglock: add DEFINE_STATIC_LGLOCK() When the lglock doesn't need to be exported we can use DEFINE_STATIC_LGLOCK(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-10 01:15:44 -04:00
Lai Jiangshan	466cab878e	lglock: make the per_cpu locks static The per_cpu locks are not used outside the file which contains the DEFINE_LGLOCK(), so we can make these symbols static. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-10 01:15:44 -04:00
Lai Jiangshan	462e1e1bc8	lglock: remove unused DEFINE_LGLOCK_LOCKDEP() struct lglocks use their own lock_key/lock_dep_map which are defined in struct lglock. DEFINE_LGLOCK_LOCKDEP() is unused, so remove it and save a small piece of memory. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-10 01:15:44 -04:00
Al Viro	614c321f4b	MAX_LFS_FILESIZE definition for 64bit needs LL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-10 01:06:20 -04:00
Marco Stornelli	8e22cc88d6	vfs: drop lock/unlock super Removed s_lock from super_block and removed lock/unlock super. Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-09 23:33:39 -04:00
Linus Torvalds	42859eea96	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull generic execve() changes from Al Viro: "This introduces the generic kernel_thread() and kernel_execve() functions, and switches x86, arm, alpha, um and s390 over to them." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits) s390: convert to generic kernel_execve() s390: switch to generic kernel_thread() s390: fold kernel_thread_helper() into ret_from_fork() s390: fold execve_tail() into start_thread(), convert to generic sys_execve() um: switch to generic kernel_thread() x86, um/x86: switch to generic sys_execve and kernel_execve x86: split ret_from_fork alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve() alpha: switch to generic kernel_thread() alpha: switch to generic sys_execve() arm: get rid of execve wrapper, switch to generic execve() implementation arm: optimized current_pt_regs() arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve() arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk] generic sys_execve() generic kernel_execve() new helper: current_pt_regs() preparation for generic kernel_thread() um: kill thread->forking um: let signal_delivered() do SIGTRAP on singlestepping into handler ...	2012-10-10 12:02:25 +09:00
Florian Tobias Schandinat	0febd3bccf	Omapdss driver changes for the 3.7 merge window. Notable changes: * Basic writeback support for DISPC level. Writeback is not yet usable, though, as we need higher level code to actually expose the writeback feature to userspace. * Rewriting the omapdss output drivers. We're trying to remove the hard links between the omapdss and the panels, and this rewrite work moves us closer to that goal. * Cleanup and restructuring patches that have been made while working on device tree support for omapdss. Device tree support is still some way ahead, but these patches are good cleanups in themselves. * Basic OMAP5 DSS support for DPI and DSI outputs. * Workaround for the problem that GFX overlay's fifo is too small for high resolution scenarios, causing underflows. * Cleanups that remove dependencies to omap platform code. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJQae8PAAoJEPo9qoy8lh71l6EP/1ot2bJ1+8zkTO/N73SiMn/3 +6yGfX411Z9GNsEiDqgGKMboDpU7+Fqi0nO5mgPbbE936Lu92TA6DF/S6rr79iaQ 2faxqRiWYXkVfXo806fo0BMeprB6ieQNBaEXtRmMK+tlviZtgFfCy5+9oq4AxP64 zWuFl/lPuSkhzjU2hHygETHpccsxtNI7EiJf7UoBLBCKi/eU/E5XJKUJUBwhX68R SA7SrpLfTNErAgdY4aMeLKprreN8VhJ7gp9+FCdnD7u4sawGpDIcF577EEc50Fjy aZbDkQkL8p7zw8g/ejBsY9zlsZZk7LFOSkXhfZCstnhTvF9ugm9S6Ax2rqu3tVfd M5B8n/I4njSFyQOxXyILfrGJdP/Fnpu6wlYKv+pNDQWspxo1saCu4Uju041A2lHy HaanqxCvpiHAIjqzfOvtLDfGxqZs37QEER0mDlg7CR2iErw5aAyjfy/CHpjTJ6Ib 2X0ho0y5+ndxkuVhFp3+fZaDQ7Y2HVc0EduzaY6OvzqouTWq4pBgeLIoXXYdHtsp wXfEPKghSZ6lA4wHYBRJppTLk3e56V6lVmC2/N+7PQkPm+Sfde1OAd8iqI0Ytm12 oL780L4XoWIbQml1ZZ6ZFwEO/2QnAqufzD7djMGkwm4BJb/04m4M2W2xWOEZ/R7T 0Xh/yWRDzQyrnzmyqg7K =lcyJ -----END PGP SIGNATURE----- Merge tag 'omapdss-for-3.7' of git://gitorious.org/linux-omap-dss2/linux into fbdev-next Omapdss driver changes for the 3.7 merge window. Notable changes: * Basic writeback support for DISPC level. Writeback is not yet usable, though, as we need higher level code to actually expose the writeback feature to userspace. * Rewriting the omapdss output drivers. We're trying to remove the hard links between the omapdss and the panels, and this rewrite work moves us closer to that goal. * Cleanup and restructuring patches that have been made while working on device tree support for omapdss. Device tree support is still some way ahead, but these patches are good cleanups in themselves. * Basic OMAP5 DSS support for DPI and DSI outputs. * Workaround for the problem that GFX overlay's fifo is too small for high resolution scenarios, causing underflows. * Cleanups that remove dependencies to omap platform code.	2012-10-10 02:16:30 +00:00
Linus Torvalds	aac2b1f574	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking updates from David Miller: 1) UAPI changes for networking from David Howells 2) A netlink dump is an operation we can sleep within, and therefore we need to make sure the dump provider module doesn't disappear on us meanwhile. Fix from Gao Feng. 3) Now that tunnels support GRO, we have to be more careful in skb_gro_reset_offset() otherwise we OOPS, from Eric Dumazet. 4) We can end up processing packets for VLANs we aren't actually configured to be on, fix from Florian Zumbiehl. 5) Fix routing cache removal regression in redirects and IPVS. The core issue on the IPVS side is that it wants to rewrite who the nexthop is and we have to explicitly accomodate that case. From Julian Anastasov. 6) Error code return fixes all over the networking drivers from Peter Senna Tschudin. 7) Fix routing cache removal regressions in IPSEC, from Steffen Klassert. 8) Fix deadlock in RDS during pings, from Jeff Liu. 9) Neighbour packet queue can trigger skb_under_panic() because we do not reset the network header of the SKB in the right spot. From Ramesh Nagappa. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits) RDS: fix rds-ping spinlock recursion netdev/phy: Prototype of_mdio_find_bus() farsync: fix support for over 30 cards be2net: Remove code that stops further access to BE NIC based on UE bits pch_gbe: Fix build error by selecting all the possible dependencies. e1000e: add device IDs for i218 ixgbe/ixgbevf: Limit maximum jumbo frame size to 9.5K to avoid Tx hangs ixgbevf: Set the netdev number of Tx queues UAPI: (Scripted) Disintegrate include/linux/tc_ematch UAPI: (Scripted) Disintegrate include/linux/tc_act UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv6 UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv4 UAPI: (Scripted) Disintegrate include/linux/netfilter_bridge UAPI: (Scripted) Disintegrate include/linux/netfilter_arp UAPI: (Scripted) Disintegrate include/linux/netfilter/ipset UAPI: (Scripted) Disintegrate include/linux/netfilter UAPI: (Scripted) Disintegrate include/linux/isdn UAPI: (Scripted) Disintegrate include/linux/caif net: fix typo in freescale/ucc_geth.c vxlan: fix more sparse warnings ...	2012-10-10 11:12:54 +09:00
Linus Torvalds	b7e97d2211	Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dma Pull slave-dmaengine updates from Vinod Koul: "This time we have Andy updates on dw_dmac which is attempting to make this IP block available as PCI and platform device though not fully complete this time. We also have TI EDMA moving the dma driver to use dmaengine APIs, also have a new driver for mmp-tdma, along with bunch of small updates. Now for your excitement the merge is little unusual here, while merging the auto merge on linux-next picks wrong choice for pl330 (drivers/dma/pl330.c) and this causes build failure. The correct resolution is in linux-next. (DMA: PL330: Fix build error) I didn't back merge your tree this time as you are better than me so no point in doing that for me :)" Fixed the pl330 conflict as in linux-next, along with trivial header file conflicts due to changed includes. * 'next' of git://git.infradead.org/users/vkoul/slave-dma: (29 commits) dma: tegra: fix interrupt name issue with apb dma. dw_dmac: fix a regression in dwc_prep_dma_memcpy dw_dmac: introduce software emulation of LLP transfers dw_dmac: autoconfigure data_width or get it via platform data dw_dmac: autoconfigure block_size or use platform data dw_dmac: get number of channels from hardware if possible dw_dmac: fill optional encoded parameters in register structure dw_dmac: mark dwc_dump_chan_regs as inline DMA: PL330: return ENOMEM instead of 0 from pl330_alloc_chan_resources DMA: PL330: Remove redundant runtime_suspend/resume functions DMA: PL330: Remove controller clock enable/disable dmaengine: use kmem_cache_zalloc instead of kmem_cache_alloc/memset DMA: PL330: Set the capability of pdm0 and pdm1 as DMA_PRIVATE ARM: EXYNOS: Set the capability of pdm0 and pdm1 as DMA_PRIVATE dma: tegra: use list_move_tail instead of list_del/list_add_tail mxs/dma: Enlarge the CCW descriptor area to 4 pages dw_dmac: utilize slave_id to pass request line dmaengine: mmp_tdma: add dt support dmaengine: mmp-pdma support spi: davici - make davinci select edma ...	2012-10-10 11:10:41 +09:00
Linus Torvalds	943c2acea5	MMC highlights for 3.7: Core: - Add DT properties for card detection (broken-cd, cd-gpios, non-removable) - Don't poll non-removable devices - Fixup/rework eMMC sleep mode/"power off notify" feature - Support eMMC background operations (BKOPS). To set the one-time programmable fuse that enables bkops on an eMMC that doesn't already have it set, you can use the "mmc bkops enable" command in: git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc-utils.git Drivers: - atmel-mci, dw_mmc, pxa-mci, dove, s3c, spear: Add device tree support - bfin_sdh: Add support for the controller in bf60x - dw_mmc: Support Samsung Exynos SoCs - eSDHC: Add ADMA support - sdhci: Support testing a cd-gpio (from slot-gpio) instead of presence bit - sdhci-pltfm: Support broken-cd DT property - tegra: Convert to only supporting DT (mach-tegra has gone DT-only) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQcfc/AAoJEHNBYZ7TNxYMB3wQALPObZUjKBsR38N2llPUOz5M nTMNYa99Pg3/Du5EgXKwYDkoYG1M9yjNTdxBmz3Sz9cIkLueZHoDmqvpgZJv9vRn 5l0TncExC+T2Tn7qjE5axgM7fus5r9SLKCOtbE+V8jATTWeG0d/X0DdzvKPpJLmb uLPmqNG50LdQQUoCkcDU3hiDONqQnOx4tDq4C7rTlf+Cr8pJXRoBPEF0C1PTvs64 0AP8oXDtirz+RIR5xTELy08o4SVS4Wcn63PH1H56kmAIjdT5FeVnAOeyF9Aer/+R Sz3qMrN/sNSEEkbgCGQLJVVYACNdgB1WXdxhqk2d996iwtEQgkVB8S2ziyhpZTZ2 SxgCMvfYf5ySOIuzvyEScGdKjw6DSV01HDr9eyFJqIYaDOBp+kUJkbM2O8ISf+Kb rudrc58mdfPPhX5rqjEYBKGtyC6q+LvRGOwO8QJNvZ0wAFAg4nCzcD9btAl65QR1 7aM0qp+55pyc2xyUO9q5AvOwiaBU2sYYuBCUm1zzK3HQ8x5ZKpueQwa2KBmEa2f+ Qp6oflWNeG/X53WHCurl/ECZY5Y4w4esHPMWXVXJP8Ao+3D2Wofkp4CSGcQClZSd /OBGSw9g70BIKwOTUvU9tD3ALQsG+A9UHmG7RQBhmcQFaKY709bfhzSG3/jHymSg AKr0VSezE/DTj6URvxaq =qyY5 -----END PGP SIGNATURE----- Merge tag 'mmc-merge-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc Pull MMC updates from Chris Ball: "Core: - Add DT properties for card detection (broken-cd, cd-gpios, non-removable) - Don't poll non-removable devices - Fixup/rework eMMC sleep mode/"power off notify" feature - Support eMMC background operations (BKOPS). To set the one-time programmable fuse that enables bkops on an eMMC that doesn't already have it set, you can use the "mmc bkops enable" command in: git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc-utils.git Drivers: - atmel-mci, dw_mmc, pxa-mci, dove, s3c, spear: Add device tree support - bfin_sdh: Add support for the controller in bf60x - dw_mmc: Support Samsung Exynos SoCs - eSDHC: Add ADMA support - sdhci: Support testing a cd-gpio (from slot-gpio) instead of presence bit - sdhci-pltfm: Support broken-cd DT property - tegra: Convert to only supporting DT (mach-tegra has gone DT-only)" * tag 'mmc-merge-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (67 commits) mmc: core: Fixup broken suspend and eMMC4.5 power off notify mmc: sdhci-spear: Add clk_{un}prepare() support mmc: sdhci-spear: add device tree bindings mmc: sdhci-s3c: Add clk_(enable/disable) in runtime suspend/resume mmc: core: Replace MMC_CAP2_BROKEN_VOLTAGE with test for fixed regulator mmc: sdhci-pxav3: Use sdhci_get_of_property for parsing DT quirks mmc: dt: Support "broken-cd" property in sdhci-pltfm mmc: sdhci-s3c: fix the wrong number of max bus clocks mmc: sh-mmcif: avoid oops on spurious interrupts mmc: sh-mmcif: properly handle MMC_WRITE_MULTIPLE_BLOCK completion IRQ mmc: sdhci-s3c: Fix crash on module insertion for second time mmc: sdhci-s3c: Enable only required bus clock mmc: Revert "mmc: dw_mmc: Add check for IDMAC configuration" mmc: mxcmmc: fix bug that may block a data transfer forever mmc: omap_hsmmc: Pass on the suspend failure to the PM core mmc: atmel-mci: AP700x PDC is not connected to MCI mmc: atmel-mci: DMA can be used with other controllers mmc: mmci: use clk_prepare_enable and clk_disable_unprepare mmc: sdhci-s3c: Add device tree support mmc: dw_mmc: add support for exynos specific implementation of dw-mshc ...	2012-10-10 10:58:42 +09:00
Linus Torvalds	10f39f04b2	MTD merge for 3.7 - Disable broken mtdchar mmap() on MMU systems - Additional ECC tests for NAND flash, and some test cleanups - New NAND and SPI chip support - Fixes/cleanup for SH FLCTL NAND controller driver - Improved hardware support for GPMI NAND controller - Conversions to device-tree support for various drivers - Removal of obsolete drivers (sbc8xxx, bcmring, etc.) - New LPC32xx drivers for MLC and SLC NAND - Further cleanup of NAND OOB/ECC handling - UAPI cleanup merge from David Howells (just moving files, since MTD headers were sorted out long ago to separate user-visible from kernel bits) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEABECAAYFAlB0OosACgkQdwG7hYl686OSPACeKLrlHmyG8KXgAqcGZwAj1RM+ X9YAoI2Kd6Sz8v6sLbJidnxUBr/oJVa8 =/kFV -----END PGP SIGNATURE----- Merge tag 'for-linus-20121009' of git://git.infradead.org/mtd-2.6 Pull MTD updates from David Woodhouse: - Disable broken mtdchar mmap() on MMU systems - Additional ECC tests for NAND flash, and some test cleanups - New NAND and SPI chip support - Fixes/cleanup for SH FLCTL NAND controller driver - Improved hardware support for GPMI NAND controller - Conversions to device-tree support for various drivers - Removal of obsolete drivers (sbc8xxx, bcmring, etc.) - New LPC32xx drivers for MLC and SLC NAND - Further cleanup of NAND OOB/ECC handling - UAPI cleanup merge from David Howells (just moving files, since MTD headers were sorted out long ago to separate user-visible from kernel bits) * tag 'for-linus-20121009' of git://git.infradead.org/mtd-2.6: (168 commits) mtd: Disable mtdchar mmap on MMU systems UAPI: (Scripted) Disintegrate include/mtd mtd: nand: detect Samsung K9GBG08U0A, K9GAG08U0F ID mtd: nand: decode Hynix MLC, 6-byte ID length mtd: nand: increase max OOB size to 640 mtd: nand: add generic READ ID length calculation functions mtd: nand: split simple ID decode into its own function mtd: nand: split extended ID decoding into its own function mtd: nand: split BB marker options decoding into its own function mtd: nand: remove redundant ID read mtd: nand: remove unnecessary variable mtd: docg4: add missing HAS_IOMEM dependency mtd: gpmi: initialize the timing registers only one time mtd: gpmi: add EDO feature for imx6q mtd: gpmi: do not set the default values for the extra clocks mtd: gpmi: simplify the DLL setting code mtd: gpmi: add a new field for HW_GPMI_CTRL1 mtd: gpmi: do not get the clock frequency in gpmi_begin() mtd: gpmi: add a new field for HW_GPMI_TIMING1 mtd: add helpers to get the supportted ONFI timing mode ...	2012-10-10 10:51:35 +09:00
Linus Torvalds	72055425e5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "This is a large pull, with the bulk of the updates coming from: - Hole punching - send/receive fixes - fsync performance - Disk format extension allowing more hardlinks inside a single directory (btrfs-progs patch required to enable the compat bit for this one) I'm cooking more unrelated RAID code, but I wanted to make sure this original batch makes it in. The largest updates here are relatively old and have been in testing for some time." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits) btrfs: init ref_index to zero in add_inode_ref Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer Btrfs: fix page leakage Btrfs: do not warn_on when we cannot alloc a page for an extent buffer Btrfs: don't bug on enomem in readpage Btrfs: cleanup pages properly when ENOMEM in compression Btrfs: make filesystem read-only when submitting barrier fails Btrfs: detect corrupted filesystem after write I/O errors Btrfs: make compress and nodatacow mount options mutually exclusive btrfs: fix message printing Btrfs: don't bother committing delayed inode updates when fsyncing btrfs: move inline function code to header file Btrfs: remove unnecessary IS_ERR in bio_readpage_error() btrfs: remove unused function btrfs_insert_some_items() Btrfs: don't commit instead of overcommitting Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called Btrfs: be smarter about dropping things from the tree log Btrfs: don't lookup csums for prealloc extents Btrfs: cache extent state when writing out dirty metadata pages Btrfs: do not hold the file extent leaf locked when adding extent item ...	2012-10-10 10:49:20 +09:00
J. Bruce Fields	f474af7051	UAPI Disintegration 2012-10-09 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIVAwUAUHPmWxOxKuMESys7AQKN4w//XDwALfbf0MXIw+gwyRiUtJe9mGexvI6X 1R4FWU9a3ImzEZP4cWnmPGT2wmC/x007DcIvx8cyvbdlSuqtR2i/DC+HbWabiLRn nJS7Eer1BJvLv5dn6NmXMEz7yB4Z46+frcmBs3WQeR0sqBMDm+rjQzCqECznO8Jc VtCbox+VR2DuWcM++YECTblYEH3Z+doDXUN2eBaD8L9x3klPbPXD7OcRyOnry8w+ ynmUTKKyH4+hpxDakYrObPIg+vFCxb4QRck1mlgA4wbvb3eqjhM0oOCYJ8GvmILA vdFYztWCjkiuOl5djtXBlsClX8SAMOBYlRed+R1GvjNCSR+WCWrFJJ2F8qoQ1w87 9ts2/8qrozS8luTB475SkT2uLdJkIUKX89Oh+dWeE8YkbPnRPj5lNAdtNY5QSyDq VaRpIo+YfmZygyvHJQlAXBuZ0mvzcPzArfcPgSVTD3B7xTEGVu/45V7SnQX5os/V v39ySPXMdGOIdvK51gw7OtZl64uqrEKu39PyYDX/GUADflp/CHD0J7PJrQePbsH9 AQolVZDIxTfKqYQnUdL8+C8Zc24RowEzz3c2+aO89MSzwGqev3q8sXRVbW/Iqryg p+V3nHe+ipKcga5tOBlPr9KDtDd7j3xN2yaIwf5/QyO1OHBpjAZP1gjSVDcUcwpi svYy4kPn3PA= =etoL -----END PGP SIGNATURE----- nfs: disintegrate UAPI for nfs This is to complete part of the Userspace API (UAPI) disintegration for which the preparatory patches were pulled recently. After these patches, userspace headers will be segregated into: include/uapi/linux/.../foo.h for the userspace interface stuff, and: include/linux/.../foo.h for the strictly kernel internal stuff. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-09 18:35:22 -04:00
Thomas Gleixner	db8c246937	Merge branch 'fortglx/3.7/time' of git://git.linaro.org/people/jstultz/linux into timers/core	2012-10-09 21:20:05 +02:00
David S. Miller	3db6857c91	UAPI Disintegration 2012-10-09 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIVAwUAUHPmWhOxKuMESys7AQLkBQ//esqQY0l2+3thGTMtDnmEbb+p+FJaUwEd HmGZS6BBsFYB5+LzLmeQORvzUfMSVnZhICxhozLF6UsL2VaovMU39BE60Yr61Hou mOBqRmb9lqKCvhsMRzI+xcYV6VxvDSi+0vz7HnO5wx5AmCuiKad3FBjFoaG8kQDc Z7wSUWLlYyzAF8U8REyOwd1WfoAW3e2pAq/v762AvSlO06e0FcFPoAgYa69kRjob gQLOLmcki8ynasSOSeSDgOOgjustKCkOAnu/FQ98b4im6ZQNmKkOiiBGXIAx46A9 1jDC6IHzpM6Jcug1lkYd+M8TUOZR8WpUDWCGemb20NQ5453iRYoo0FOTgUaAgw9U 2RussndU9WBsujAH/zvDAztDzVUt3oVOFdtpvW+QUwt+Y5NDdbd6lGYw/Jlzjtdv mzKdYu0R1Zq8whx2WRG2m3wW9K+6yoDoEsfAeYR5uD3sdGiJBG0avtjH7LSkuzeD CRVamImvyp2mCGiK1CxSax1BizfvA11fDufHSUIeJMf4jco70sAKZt7oNgvKye4f EW3fhMN4PpCm4yuzxZzNSZ6bFlXplmNOSGdE4B9kxkV5VogNbI8BpOmqGpAh0obr NQvU6t4RStIFmN9NGLkUsboQL33/850ryDKb5nRnomEFK9eHTi4U96zA1JR4c9XM qWy+e1cbNQU= =+AqB -----END PGP SIGNATURE----- Merge tag 'disintegrate-isdn-20121009' of git://git.infradead.org/users/dhowells/linux-headers UAPI Disintegration 2012-10-09 Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-09 13:16:33 -04:00
David S. Miller	8a3ddb88fb	UAPI Disintegration 2012-10-09 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIVAwUAUHPmWxOxKuMESys7AQJA5xAAqH/0RgiHXNdD92N9LABONVjQ3Jr5RO8u xH7omFJ4J+uHuwsqF2fk2lfqKLQtAGQ7ZUFT2l8oeZz3mfB/Ijd4TfF1wtKOzZiQ nzjaCSB3+ZKR0uCcVyzma8bbAEykJx0oKOyBlzKUmOnq9gN+pEmw8q8iCsfPhRxF hKjh7I14dyFlchjgnCOHZ8QQADVcKAta2K51FRSgble4k4fz5UL0aS+7/GOxcxuD orysa4DqAzdVTsHRVHV/tgA1I0NKa1PyHaq7xSZp0iVsR31adMmnF/zlsLlIj66a gHy8167JaxCi+h38NI9QRUFaNtpV0dz6tcOh/kknrrF50YmP64rTwBZ3L04tDWIK VdgSND3c/1/nXAFPem3CiTImQKk3w7qbPlPf++/QHLz4Q6JXQVEnpB6pJdyIThPA 86+R23jwuaGhLAHzBL1eh4TypmzuGIpczw6aOY8F6izVktRvqDr1rv+sET0fEjTW lU7ZWN3y0e1XBqci+dLPpFzCK7zUpi8FM/XVbEFZTgWRKS6d2YBGc4NEvblG891K NhbO7gfLh62A96eKkOZN7BakKwpwBiWoHW+tbphZ9JYVNVIPs8HXadSNK5Zaa4lh z96hTnYuwHs3b1E3qGJ32Etip1ZU5p5ZRyHCeR4KCJe/GBibkj21Fbra7ToMVvYH NFD0QfFgmn0= =8MB7 -----END PGP SIGNATURE----- Merge tag 'disintegrate-net-20121009' of git://git.infradead.org/users/dhowells/linux-headers UAPI Disintegration 2012-10-09 Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-09 13:15:29 -04:00
David S. Miller	8dd9117cc7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux Pulled mainline in order to get the UAPI infrastructure already merged before I pull in David Howells's UAPI trees for networking. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-09 13:14:32 -04:00
David Woodhouse	ffe3150125	UAPI Disintegration 2012-10-09 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIVAwUAUHPmWhOxKuMESys7AQLCZRAAsZAuAK0MxZ4iuq/+fmy7Uxb1jrzLOYSb 3UgbTgXAjR0WAUHNegVZLX1Xc+12KxvMCj/8sO62Ai+wtgHeDAuUl2T0FbSZjlGK qqx/qQqTFHUfJRbm3Lu9iarZ2K49v1kTDk4C+nC8J9mEEW4WFlVPD10n90j+4hxr ZCEYril7qOQQV65oor3BT2V64+X1WDHriTLugH1o8RziRF9jh6Z2hgZAWnThcGxu lPsmXF2e7jDqGcM3gWtxZWu/yTBPxw549R+JUg4aVKho9WI5ClyjNAKnE7wtd3iW HyrylRH+ch2oeYFa5+xoyopRARUUPmujKaHU+ZI1o++eNzuw5JYiwuMlZBLyUc9I foWMSUw31U7695exyf66HiH7GEKI1PVpgJVNu41eJvl0iWSWCpKCB6Gs8Sw4xnp2 auUCYSniXHNTFhFktjNdIUAn0+1X/b/SEfb/id4GvLp1K98QGOfe8dMCC8hEnXiF 4iIViM8Sv1GB1us5huSjbMeRPbZ3x/loqEpApfgcaqcyrUR29FTE/lFQ4fj9xviL JjckPLMMZb4Ho5wrkCi5NtXJ16mx1qKzbBGDdqzmqaNdN+08rNF//kA9m9hCwgD8 XfAV286DKDC0SllZIG+Uz7YLnSZjNAUhjvWN3ipV+SdT5DGybL3uSW5tYiSAzI2E 3cayGTWINMg= =U9Qq -----END PGP SIGNATURE----- Merge tag 'disintegrate-mtd-20121009' of git://git.infradead.org/users/dhowells/linux-headers UAPI Disintegration 2012-10-09 Conflicts: MAINTAINERS arch/arm/configs/bcmring_defconfig arch/arm/mach-imx/clk-imx51-imx53.c drivers/mtd/nand/Kconfig drivers/mtd/nand/bcm_umi_nand.c drivers/mtd/nand/nand_bcm_umi.h drivers/mtd/nand/orion_nand.c	2012-10-09 15:04:25 +01:00
Li Zhong	329a402cb0	[SCSI] Shorten the path length of scsi_cmd_to_driver() This patch tries to shorten the path length of scsi_cmd_to_driver(). As only REQ_TYPE_BLOCK_PC commands can be submitted without a driver, so we could avoid the related NULL checking, as long as we make sure we don't use it for REQ_TYPE_BLOCK_PC type commands. Plus, this fixes a bug where you get different behaviors from REQ_TYPE_BLOCK_PC commands when a driver is and isn't attached. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>	2012-10-09 12:04:42 +01:00
Paolo Bonzini	865b58c05b	[SCSI] virtio-scsi: support online resizing of disks Support the LUN parameter change event. Currently, the host fires this event when the capacity of a disk is changed from the virtual machine monitor. The resize then appears in the kernel log like this: sd 0:0:0:0: [sda] 46137344 512-byte logical blocks: (23.6 GB/22.0 GIb) sda: detected capacity change from 22548578304 to 23622320128 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>	2012-10-09 11:24:47 +01:00
David Howells	72503791ed	UAPI: (Scripted) Disintegrate include/xen Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:15 +01:00
David Howells	4a8e43feea	UAPI: (Scripted) Disintegrate include/mtd Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:09 +01:00
David Howells	1e256b340d	UAPI: (Scripted) Disintegrate include/linux/wimax Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:08 +01:00
David Howells	b6bf56c6be	UAPI: (Scripted) Disintegrate include/linux/tc_ematch Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:06 +01:00
David Howells	611128eb39	UAPI: (Scripted) Disintegrate include/linux/tc_act Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:05 +01:00
David Howells	e3dd9a52cb	UAPI: (Scripted) Disintegrate include/linux/sunrpc Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:04 +01:00
David Howells	7939d3c2a6	UAPI: (Scripted) Disintegrate include/linux/spi Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:03 +01:00
David Howells	616d1ca5d7	UAPI: (Scripted) Disintegrate include/linux/nfsd Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:02 +01:00
David Howells	ff1e1756c9	UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv6 Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:49:01 +01:00
David Howells	17c075923d	UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv4 Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:48:59 +01:00
David Howells	55c5cd3cc1	UAPI: (Scripted) Disintegrate include/linux/netfilter_bridge Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:48:58 +01:00
David Howells	8922082ae6	UAPI: (Scripted) Disintegrate include/linux/netfilter_arp Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:48:56 +01:00
David Howells	a82014149b	UAPI: (Scripted) Disintegrate include/linux/netfilter/ipset Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:48:55 +01:00
David Howells	94d0ec58e6	UAPI: (Scripted) Disintegrate include/linux/netfilter Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:48:54 +01:00
David Howells	1855f1b107	UAPI: (Scripted) Disintegrate include/linux/isdn Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:48:45 +01:00
David Howells	27a3aadcdc	UAPI: (Scripted) Disintegrate include/linux/caif Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>	2012-10-09 09:48:40 +01:00
Linus Torvalds	9e2d8656f5	Merge branch 'akpm' (Andrew's patch-bomb) Merge patches from Andrew Morton: "A few misc things and very nearly all of the MM tree. A tremendous amount of stuff (again), including a significant rbtree library rework." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (160 commits) sparc64: Support transparent huge pages. mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). mm: Add and use update_mmu_cache_pmd() in transparent huge page code. sparc64: Document PGD and PMD layout. sparc64: Eliminate PTE table memory wastage. sparc64: Halve the size of PTE tables sparc64: Only support 4MB huge pages and 8KB base pages. memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning mm: memcg: clean up mm_match_cgroup() signature mm: document PageHuge somewhat mm: use %pK for /proc/vmallocinfo mm, thp: fix mlock statistics mm, thp: fix mapped pages avoiding unevictable list on mlock memory-hotplug: update memory block's state and notify userspace memory-hotplug: preparation to notify memory block's state at memory hot remove mm: avoid section mismatch warning for memblock_type_name make GFP_NOTRACK definition unconditional cma: decrease cc.nr_migratepages after reclaiming pagelist CMA: migrate mlocked pages kpageflags: fix wrong KPF_THP on non-huge compound pages ...	2012-10-09 16:23:15 +09:00
Johannes Weiner	587af308cc	mm: memcg: clean up mm_match_cgroup() signature It really should return a boolean for match/no match. And since it takes a memcg, not a cgroup, fix that parameter name as well. [akpm@linux-foundation.org: mm_match_cgroup() is not a macro] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:23:04 +09:00
David Rientjes	b676b293fb	mm, thp: fix mapped pages avoiding unevictable list on mlock When a transparent hugepage is mapped and it is included in an mlock() range, follow_page() incorrectly avoids setting the page's mlock bit and moving it to the unevictable lru. This is evident if you try to mlock(), munlock(), and then mlock() a range again. Currently: #define MAP_SIZE (4 << 30) /* 4GB / void ptr = mmap(NULL, MAP_SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep -E "Unevictable\|Inactive\(anon" /proc/meminfo Inactive(anon): 6304 kB Unevictable: 4213924 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4186252 kB Unevictable: 19652 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4198556 kB Unevictable: 21684 kB Notice that less than 2MB was added to the unevictable list; this is because these pages in the range are not transparent hugepages since the 4GB range was allocated with mmap() and has no specific alignment. If posix_memalign() were used instead, unevictable would not have grown at all on the second mlock(). The fix is to call mlock_vma_page() so that the mlock bit is set and the page is added to the unevictable list. With this patch: mlock(ptr, MAP_SIZE); Inactive(anon): 4056 kB Unevictable: 4213940 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4198268 kB Unevictable: 19636 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4008 kB Unevictable: 4213940 kB Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:23:02 +09:00
Wen Congyang	e90bdb7f52	memory-hotplug: update memory block's state and notify userspace remove_memory() will be called when hot removing a memory device. But even if offlining memory, we cannot notice it. So the patch updates the memory block's state and sends notification to userspace. Additionally, the memory device may contain more than one memory block. If the memory block has been offlined, __offline_pages() will fail. So we should try to offline one memory block at a time. Thus remove_memory() also check each memory block's state. So there is no need to check the memory block's state before calling remove_memory(). Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:23:02 +09:00
Wen Congyang	a16cee10c7	memory-hotplug: preparation to notify memory block's state at memory hot remove remove_memory() is called in two cases: 1. echo offline >/sys/devices/system/memory/memoryXX/state 2. hot remove a memory device In the 1st case, the memory block's state is changed and the notification that memory block's state changed is sent to userland after calling remove_memory(). So user can notice memory block is changed. But in the 2nd case, the memory block's state is not changed and the notification is not also sent to userspcae even if calling remove_memory(). So user cannot notice memory block is changed. For adding the notification at memory hot remove, the patch just prepare as follows: 1st case uses offline_pages() for offlining memory. 2nd case uses remove_memory() for offlining memory and changing memory block's state and notifing the information. The patch does not implement notification to remove_memory(). Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:23:02 +09:00
Glauber Costa	3e648ebe07	make GFP_NOTRACK definition unconditional There was a general sentiment in a recent discussion (See https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be defined unconditionally. Currently, the only offender is GFP_NOTRACK, which is conditional to KMEMCHECK. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:23:01 +09:00
Minchan Kim	e46a28790e	CMA: migrate mlocked pages Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:23:00 +09:00
Hugh Dickins	8befedfe67	mm: remove unevictable_pgs_mlockfreed Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line from /proc/vmstat: Johannes and Mel point out that it was very unlikely to have been used by any tool, and of course we can restore it easily enough if that turns out to be wrong. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:59 +09:00
Minchan Kim	5a88381384	memory-hotplug: fix zone stat mismatch During memory-hotplug, I found NR_ISOLATED_[ANON\|FILE] are increasing, causing the kernel to hang. When the system doesn't have enough free pages, it enters reclaim but never reclaim any pages due to too_many_isolated()==true and loops forever. The cause is that when we do memory-hotadd after memory-remove, __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset() although the vm_stat_diff of all CPUs still have values. In addtion, when we offline all pages of the zone, we reset them in zone_pcp_reset without draining so we loss some zone stat item. Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:59 +09:00
Sagi Grimberg	2ec74c3ef2	mm: move all mmu notifier invocations to be done outside the PT lock In order to allow sleeping during mmu notifier calls, we need to avoid invoking them under the page table spinlock. This patch solves the problem by calling invalidate_page notification after releasing the lock (but before freeing the page itself), or by wrapping the page invalidation with calls to invalidate_range_begin and invalidate_range_end. To prevent accidental changes to the invalidate_range_end arguments after the call to invalidate_range_begin, the patch introduces a convention of saving the arguments in consistently named locals: unsigned long mmun_start; /* For mmu_notifiers / unsigned long mmun_end; / For mmu_notifiers */ ... mmun_start = ... mmun_end = ... mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ... mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); The patch changes code to use this convention for all calls to mmu_notifier_invalidate_range_start/end, except those where the calls are close enough so that anyone who glances at the code can see the values aren't changing. This patchset is a preliminary step towards on-demand paging design to be added to the RDMA stack. Why do we want on-demand paging for Infiniband? Applications register memory with an RDMA adapter using system calls, and subsequently post IO operations that refer to the corresponding virtual addresses directly to HW. Until now, this was achieved by pinning the memory during the registration calls. The goal of on demand paging is to avoid pinning the pages of registered memory regions (MRs). This will allow users the same flexibility they get when swapping any other part of their processes address spaces. Instead of requiring the entire MR to fit in physical memory, we can allow the MR to be larger, and only fit the current working set in physical memory. Why should anyone care? What problems are users currently experiencing? This can make programming with RDMA much simpler. Today, developers that are working with more data than their RAM can hold need either to deregister and reregister memory regions throughout their process's life, or keep a single memory region and copy the data to it. On demand paging will allow these developers to register a single MR at the beginning of their process's life, and let the operating system manage which pages needs to be fetched at a given time. In the future, we might be able to provide a single memory access key for each process that would provide the entire process's address as one large memory region, and the developers wouldn't need to register memory regions at all. Is there any prospect that any other subsystems will utilise these infrastructural changes? If so, which and how, etc? As for other subsystems, I understand that XPMEM wanted to sleep in MMU notifiers, as Christoph Lameter wrote at http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and perhaps Andrea knows about other use cases. Scheduling in mmu notifications is required since we need to sync the hardware with the secondary page tables change. A TLB flush of an IO device is inherently slower than a CPU TLB flush, so our design works by sending the invalidation request to the device, and waiting for an interrupt before exiting the mmu notifier handler. Avi said: kvm may be a buyer. kvm::mmu_lock, which serializes guest page faults, also protects long operations such as destroying large ranges. It would be good to convert it into a spinlock, but as it is used inside mmu notifiers, this cannot be done. (there are alternatives, such as keeping the spinlock and using a generation counter to do the teardown in O(1), which is what the "may" is doing up there). [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:58 +09:00
David Rientjes	957f822a0a	mm, numa: reclaim from all nodes within reclaim distance RECLAIM_DISTANCE represents the distance between nodes at which it is deemed too costly to allocate from; it's preferred to try to reclaim from a local zone before falling back to allocating on a remote node with such a distance. To do this, zone_reclaim_mode is set if the distance between any two nodes on the system is greather than this distance. This, however, ends up causing the page allocator to reclaim from every zone regardless of its affinity. What we really want is to reclaim only from zones that are closer than RECLAIM_DISTANCE. This patch adds a nodemask to each node that represents the set of nodes that are within this distance. During the zone iteration, if the bit for a zone's node is set for the local node, then reclaim is attempted; otherwise, the zone is skipped. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:56 +09:00
Hugh Dickins	a0c5e813f0	mm: remove free_page_mlock We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be checking it, reporting "BUG: Bad page state" if it's ever found set. Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:56 +09:00
Hugh Dickins	39b5f29ac1	mm: remove vma arg from page_evictable page_evictable(page, vma) is an irritant: almost all its callers pass NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page) explicitly in the couple of places it's needed. But in those places we don't even need page_evictable() itself! They're dealing with a freshly allocated anonymous page, which has no "mapping" and cannot be mlocked yet. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:55 +09:00
Jianguo Wu	7f1290f2f2	mm: fix-up zone present pages I think zone->present_pages indicates pages that buddy system can management, it should be: zone->present_pages = spanned pages - absent pages - bootmem pages, but is now: zone->present_pages = spanned pages - absent pages - memmap pages. spanned pages: total size, including holes. absent pages: holes. bootmem pages: pages used in system boot, managed by bootmem allocator. memmap pages: pages used by page structs. This may cause zone->present_pages less than it should be. For example, numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now it also minus memmap pages(free_area_init_core), which are actually allocated from ZONE_MOVABLE. When offlining all memory of a zone, this will cause zone->present_pages less than 0, because present_pages is unsigned long type, it is actually a very large integer, it indirectly caused zone->watermark[WMARK_MIN] becomes a large integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a large integer(calculate_totalreserve_pages()), and finally cause memory allocating failure when fork process(__vm_enough_memory()). [root@localhost ~]# dmesg -bash: fork: Cannot allocate memory I think the bug described in http://marc.info/?l=linux-mm&m=134502182714186&w=2 is also caused by wrong zone present pages. This patch intends to fix-up zone->present_pages when memory are freed to buddy system on x86_64 and IA64 platforms. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Reported-by: Petr Tesarik <ptesarik@suse.cz> Tested-by: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:54 +09:00
Catalin Marinas	2d28a2275c	mm: thp: fix the pmd_clear() arguments in pmdp_get_and_clear() The CONFIG_TRANSPARENT_HUGEPAGE implementation of pmdp_get_and_clear() calls pmd_clear() with 3 arguments instead of 1. This happens only for !__HAVE_ARCH_PMDP_GET_AND_CLEAR which doesn't seem to happen because x86 defines this and it uses pmd_update. [mhocko@suse.cz: changelog addition] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:53 +09:00
Minchan Kim	723a0644a7	mm/page_alloc: refactor out __alloc_contig_migrate_alloc() __alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor it out (move + rename as a common name) into page_isolation.c. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:52 +09:00
Mel Gorman	62997027ca	mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity Compaction caches if a pageblock was scanned and no pages were isolated so that the pageblocks can be skipped in the future to reduce scanning. This information is not cleared by the page allocator based on activity due to the impact it would have to the page allocator fast paths. Hence there is a requirement that something clear the cache or pageblocks will be skipped forever. Currently the cache is cleared if there were a number of recent allocation failures and it has not been cleared within the last 5 seconds. Time-based decisions like this are terrible as they have no relationship to VM activity and is basically a big hammer. Unfortunately, accurate heuristics would add cost to some hot paths so this patch implements a rough heuristic. There are two cases where the cache is cleared. 1. If a !kswapd process completes a compaction cycle (migrate and free scanner meet), the zone is marked compact_blockskip_flush. When kswapd goes to sleep, it will clear the cache. This is expected to be the common case where the cache is cleared. It does not really matter if kswapd happens to be asleep or going to sleep when the flag is set as it will be woken on the next allocation request. 2. If there have been multiple failures recently and compaction just finished being deferred then a process will clear the cache and start a full scan. This situation happens if there are multiple high-order allocation requests under heavy memory pressure. The clearing of the PG_migrate_skip bits and other scans is inherently racy but the race is harmless. For allocations that can fail such as THP, they will simply fail. For requests that cannot fail, they will retry the allocation. Tests indicated that scanning rates were roughly similar to when the time-based heuristic was used and the allocation success rates were similar. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:51 +09:00
Mel Gorman	c89511ab2f	mm: compaction: Restart compaction from near where it left off This is almost entirely based on Rik's previous patches and discussions with him about how this might be implemented. Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. This patch caches where the migration and free scanner should start from on subsequent compaction invocations using the pageblock-skip information. When compaction starts it begins from the cached restart points and will update the cached restart points until a page is isolated or a pageblock is skipped that would have been scanned by synchronous compaction. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:50 +09:00
Mel Gorman	bb13ffeb9f	mm: compaction: cache if a pageblock was scanned and no pages were isolated When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are a serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is "when to ignore the cached information?" If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:50 +09:00
Mel Gorman	753341a4b8	revert "mm: have order > 0 compaction start off where it left" This reverts commit `7db8889ab0` ("mm: have order > 0 compaction start off where it left") and commit `de74f1cc` ("mm: have order > 0 compaction start near a pageblock with free pages"). These patches were a good idea and tests confirmed that they massively reduced the amount of scanning but the implementation is complex and tricky to understand. A later patch will cache what pageblocks should be skipped and reimplements the concept of compact_cached_free_pfn on top for both migration and free scanners. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:50 +09:00
Wanpeng Li	f2d52fe51c	mm/memblock: cleanup early_node_map[] related comments Commit `0ee332c145` ("memblock: Kill early_node_map[]") removed early_node_map[]. Clean up the comments to comply with that change. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:47 +09:00
Shaohua Li	45cac65b0f	readahead: fault retry breaks mmap file read random detection .fault now can retry. The retry can break state machine of .fault. In filemap_fault, if page is miss, ra->mmap_miss is increased. In the second try, since the page is in page cache now, ra->mmap_miss is decreased. And these are done in one fault, so we can't detect random mmap file access. Add a new flag to indicate .fault is tried once. In the second try, skip ra->mmap_miss decreasing. The filemap_fault state machine is ok with it. I only tested x86, didn't test other archs, but looks the change for other archs is obvious, but who knows :) Signed-off-by: Shaohua Li <shaohua.li@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:47 +09:00
Shaohua Li	e79bee24fd	atomic: implement generic atomic_dec_if_positive() The x86 implementation of atomic_dec_if_positive is quite generic, so make it available to all architectures. This is needed for "swap: add a simple detector for inappropriate swapin readahead". [akpm@linux-foundation.org: do the "#define foo foo" trick in the conventional manner] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:46 +09:00
Minchan Kim	435b405c06	memory-hotplug: fix pages missed by race rather than failing If race between allocation and isolation in memory-hotplug offline happens, some pages could be in MIGRATE_MOVABLE of free_list although the pageblock's migratetype of the page is MIGRATE_ISOLATE. The race could be detected by get_freepage_migratetype in __test_page_isolated_in_pageblock. If it is detected, now EBUSY gets bubbled all the way up and the hotplug operations fails. But better idea is instead of returning and failing memory-hotremove, move the free page to the correct list at the time it is detected. It could enhance memory-hotremove operation success ratio although the race is really rare. Suggested by Mel Gorman. [akpm@linux-foundation.org: small cleanup] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:46 +09:00
Minchan Kim	95e3441248	mm: remain migratetype in freed page The page allocator caches the pageblock information in page->private while it is in the PCP freelists but this is overwritten with the order of the page when freed to the buddy allocator. This patch stores the migratetype of the page in the page->index field so that it is available at all times when the page remain in free_list. This patch adds a new call site in __free_pages_ok so it might be overhead a bit but it's for high order allocation. So I believe damage isn't hurt. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:45 +09:00
Minchan Kim	b12c4ad14e	mm: page_alloc: use get_freepage_migratetype() instead of page_private() The page allocator uses set_page_private and page_private for handling migratetype when it frees page. Let's replace them with [set\|get] _freepage_migratetype to make it more clear. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:45 +09:00
Bartlomiej Zolnierkiewicz	d1ce749a0d	cma: count free CMA pages Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this counter always available (not only when CONFIG_CMA=y). [akpm@linux-foundation.org: use conventional migratetype naming] Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:44 +09:00
Minchan Kim	02c6de8d75	mm: cma: discard clean pages during contiguous allocation instead of migration Drop clean cache pages instead of migration during alloc_contig_range() to minimise allocation latency by reducing the amount of migration that is necessary. It's useful for CMA because latency of migration is more important than evicting the background process's working set. In addition, as pages are reclaimed then fewer free pages for migration targets are required so it avoids memory reclaiming to get free pages, which is a contributory factor to increased latency. I measured elapsed time of __alloc_contig_migrate_range() which migrates 10M in 40M movable zone in QEMU machine. Before - 146ms, After - 7ms [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Mel Gorman <mgorman@suse.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Rik van Riel <riel@redhat.com> Tested-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:43 +09:00
Michel Lespinasse	38a76013ad	mm: avoid taking rmap locks in move_ptes() During mremap(), the destination VMA is generally placed after the original vma in rmap traversal order: in move_vma(), we always have new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >= vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one. When the destination VMA is placed after the original in rmap traversal order, we can avoid taking the rmap locks in move_ptes(). Essentially, this reintroduces the optimization that had been disabled in "mm anon rmap: remove anon_vma_moveto_tail". The difference is that we don't try to impose the rmap traversal order; instead we just rely on things being in the desired order in the common case and fall back to taking locks in the uncommon case. Also we skip the i_mmap_mutex in addition to the anon_vma lock: in both cases, the vmas are traversed in increasing vm_pgoff order with ties resolved in tree insertion order. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:42 +09:00
Michel Lespinasse	ed8ea81501	mm: add CONFIG_DEBUG_VM_RB build option Add a CONFIG_DEBUG_VM_RB build option for the previously existing DEBUG_MM_RB code. Now that Andi Kleen modified it to avoid using recursive algorithms, we can expose it a bit more. Also extend this code to validate_mm() after stack expansion, and to check that the vma's start and last pgoffs have not changed since the nodes were inserted on the anon vma interval tree (as it is important that the nodes be reindexed after each such update). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:42 +09:00
Michel Lespinasse	bf181b9f9d	mm anon rmap: replace same_anon_vma linked list with an interval tree. When a large VMA (anon or private file mapping) is first touched, which will populate its anon_vma field, and then split into many regions through the use of mprotect(), the original anon_vma ends up linking all of the vmas on a linked list. This can cause rmap to become inefficient, as we have to walk potentially thousands of irrelevent vmas before finding the one a given anon page might fall into. By replacing the same_anon_vma linked list with an interval tree (where each avc's interval is determined by its vma's start and last pgoffs), we can make rmap efficient for this use case again. While the change is large, all of its pieces are fairly simple. Most places that were walking the same_anon_vma list were looking for a known pgoff, so they can just use the anon_vma_interval_tree_foreach() interval tree iterator instead. The exception here is ksm, where the page's index is not known. It would probably be possible to rework ksm so that the index would be known, but for now I have decided to keep things simple and just walk the entirety of the interval tree there. When updating vma's that already have an anon_vma assigned, we must take care to re-index the corresponding avc's on their interval tree. This is done through the use of anon_vma_interval_tree_pre_update_vma() and anon_vma_interval_tree_post_update_vma(), which remove the avc's from their interval tree before the update and re-insert them after the update. The anon_vma stays locked during the update, so there is no chance that rmap would miss the vmas that are being updated. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:41 +09:00
Michel Lespinasse	108d6642ad	mm anon rmap: remove anon_vma_moveto_tail mremap() had a clever optimization where move_ptes() did not take the anon_vma lock to avoid a race with anon rmap users such as page migration. Instead, the avc's were ordered in such a way that the origin vma was always visited by rmap before the destination. This ordering and the use of page table locks rmap usage safe. However, we want to replace the use of linked lists in anon rmap with an interval tree, and this will make it harder to impose such ordering as the interval tree will always be sorted by the avc->vma->vm_pgoff value. For now, let's replace the anon_vma_moveto_tail() ordering function with proper anon_vma locking in move_ptes(). Once we have the anon interval tree in place, we will re-introduce an optimization to avoid taking these locks in the most common cases. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:41 +09:00
Michel Lespinasse	9826a516ff	mm: interval tree updates Update the generic interval tree code that was introduced in "mm: replace vma prio_tree with an interval tree". Changes: - fixed 'endpoing' typo noticed by Andrew Morton - replaced include/linux/interval_tree_tmpl.h, which was used as a template (including it automatically defined the interval tree functions) with include/linux/interval_tree_generic.h, which only defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself defines the interval tree functions when invoked. Now that is a very long macro which is unfortunate, but it does make the usage sites (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously. - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro, instead of duplicating that code in the interval tree template. - replaced vma_interval_tree_add(), which was actually handling the nonlinear and interval tree cases, with vma_interval_tree_insert_after() which handles only the interval tree case and has an API that is more consistent with the other interval tree handling functions. The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap(). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:40 +09:00
Michel Lespinasse	9c079add0d	rbtree: move augmented rbtree functionality to rbtree_augmented.h Provide rb_insert_augmented() and rb_erase_augmented() through a new rbtree_augmented.h include file. rb_erase_augmented() is defined there as an __always_inline function, in order to allow inlining of augmented rbtree callbacks into it. Since this generates a relatively large function, each augmented rbtree user should make sure to have a single call site. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:40 +09:00
Michel Lespinasse	147e615f83	prio_tree: remove After both prio_tree users have been converted to use red-black trees, there is no need to keep around the prio tree library anymore. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:40 +09:00
Michel Lespinasse	6b2dbba8b6	mm: replace vma prio_tree with an interval tree Implement an interval tree as a replacement for the VMA prio_tree. The algorithms are similar to lib/interval_tree.c; however that code can't be directly reused as the interval endpoints are not explicitly stored in the VMA. So instead, the common algorithm is moved into a template and the details (node type, how to get interval endpoints from the node, etc) are filled in using the C preprocessor. Once the interval tree functions are available, using them as a replacement to the VMA prio tree is a relatively simple, mechanical job. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:39 +09:00
Michel Lespinasse	fff3fd8a12	rbtree: add prio tree and interval tree tests Patch 1 implements support for interval trees, on top of the augmented rbtree API. It also adds synthetic tests to compare the performance of interval trees vs prio trees. Short answers is that interval trees are slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x) on search. It is debatable how realistic the synthetic test is, and I have not made such measurements yet, but my impression is that interval trees would still come out faster. Patch 2 uses a preprocessor template to make the interval tree generic, and uses it as a replacement for the vma prio_tree. Patch 3 takes the other prio_tree user, kmemleak, and converts it to use a basic rbtree. We don't actually need the augmented rbtree support here because the intervals are always non-overlapping. Patch 4 removes the now-unused prio tree library. Patch 5 proposes an additional optimization to rb_erase_augmented, now providing it as an inline function so that the augmented callbacks can be inlined in. This provides an additional 5-10% performance improvement for the interval tree insert/erase benchmark. There is a maintainance cost as it exposes augmented rbtree users to some of the rbtree library internals; however I think this cost shouldn't be too high as I expect the augmented rbtree will always have much less users than the base rbtree. I should probably add a quick summary of why I think it makes sense to replace prio trees with augmented rbtree based interval trees now. One of the drivers is that we need augmented rbtrees for Rik's vma gap finding code, and once you have them, it just makes sense to use them for interval trees as well, as this is the simpler and more well known algorithm. prio trees, in comparison, seem too clever: they impose an additional 'heap' constraint on the tree, which they use to guarantee a faster worst-case complexity of O(k+log N) for stabbing queries in a well-balanced prio tree, vs O(k*log N) for interval trees (where k=number of matches, N=number of intervals). Now this sounds great, but in practice prio trees don't realize this theorical benefit. First, the additional constraint makes them harder to update, so that the kernel implementation has to simplify things by balancing them like a radix tree, which is not always ideal. Second, the fact that there are both index and heap properties makes both tree manipulation and search more complex, which results in a higher multiplicative time constant. As it turns out, the simple interval tree algorithm ends up running faster than the more clever prio tree. This patch: Add two test modules: - prio_tree_test measures the performance of lib/prio_tree.c, both for insertion/removal and for stabbing searches - interval_tree_test measures the performance of a library of equivalent functionality, built using the augmented rbtree support. In order to support the second test module, lib/interval_tree.c is introduced. It is kept separate from the interval_tree_test main file for two reasons: first we don't want to provide an unfair advantage over prio_tree_test by having everything in a single compilation unit, and second there is the possibility that the interval tree functionality could get some non-test users in kernel over time. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:39 +09:00
Michel Lespinasse	3908836aa7	rbtree: add RB_DECLARE_CALLBACKS() macro As proposed by Peter Zijlstra, this makes it easier to define the augmented rbtree callbacks. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:38 +09:00
Michel Lespinasse	9d9e6f9703	rbtree: remove prior augmented rbtree implementation convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api and remove the old augmented rbtree implementation. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:38 +09:00
Michel Lespinasse	14b94af0b2	rbtree: faster augmented rbtree manipulation Introduce new augmented rbtree APIs that allow minimal recalculation of augmented node information. A new callback is added to the rbtree insertion and erase rebalancing functions, to be called on each tree rotations. Such rotations preserve the subtree's root augmented value, but require recalculation of the one child that was previously located at the subtree root. In the insertion case, the handcoded search phase must be updated to maintain the augmented information on insertion, and then the rbtree coloring/rebalancing algorithms keep it up to date. In the erase case, things are more complicated since it is library code that manipulates the rbtree in order to remove internal nodes. This requires a couple additional callbacks to copy a subtree's augmented value when a new root is stitched in, and to recompute augmented values down the ancestry path when a node is removed from the tree. In order to preserve maximum speed for the non-augmented case, we provide two versions of each tree manipulation function. rb_insert_augmented() is the augmented equivalent of rb_insert_color(), and rb_erase_augmented() is the augmented equivalent of rb_erase(). Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:37 +09:00
Michel Lespinasse	bf7ad8eeab	rbtree: move some implementation details from rbtree.h to rbtree.c rbtree users must use the documented APIs to manipulate the tree structure. Low-level helpers to manipulate node colors and parenthood are not part of that API, so move them to lib/rbtree.c [dwmw2@infradead.org: fix jffs2 build issue due to renamed __rb_parent_color field] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:32 +09:00
Michel Lespinasse	4c199a93a2	rbtree: empty nodes have no color Empty nodes have no color. We can make use of this property to simplify the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also, we can get rid of the rb_init_node function which had been introduced by commit `88d19cf379` ("timers: Add rb_init_node() to allow for stack allocated rb nodes") to avoid some issue with the empty node's color not being initialized. I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are doing there, though. axboe introduced them in commit `10fd48f237` ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I see it, the 'empty node' abstraction is only used by rbtree users to flag nodes that they haven't inserted in any rbtree, so asking the predecessor or successor of such nodes doesn't make any sense. One final rb_init_node() caller was recently added in sysctl code to implement faster sysctl name lookups. This code doesn't make use of RB_EMPTY_NODE at all, and from what I could see it only called rb_init_node() under the mistaken assumption that such initialization was required before node insertion. [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:32 +09:00
Michel Lespinasse	1457d28778	rbtree: reference Documentation/rbtree.txt for usage instructions I recently started looking at the rbtree code (with an eye towards improving the augmented rbtree support, but I haven't gotten there yet). I noticed a lot of possible speed improvements, which I am now proposing in this patch set. Patches 1-4 are preparatory: remove internal functions from rbtree.h so that users won't be tempted to use them instead of the documented APIs, clean up some incorrect usages I've noticed (in particular, with the recently added fs/proc/proc_sysctl.c rbtree usage), reference the documentation so that people have one less excuse to miss it, etc. Patch 5 is a small module I wrote to check the rbtree performance. It creates 100 nodes with random keys and repeatedly inserts and erases them from an rbtree. Additionally, it has code to check for rbtree invariants after each insert or erase operation. Patches 6-12 is where the rbtree optimizations are done, and they touch only that one file, lib/rbtree.c . I am getting good results out of these - in my small benchmark doing rbtree insertion (including search) and erase, I'm seeing a 30% runtime reduction on Sandybridge E5, which is more than I initially thought would be possible. (the results aren't as impressive on my two other test hosts though, AMD barcelona and Intel Westmere, where I am seeing 14% runtime reduction only). The code size - both source (ommiting comments) and compiled - is also shorter after these changes. However, I do admit that the updated code is more arduous to read - one big reason for that is the removal of the tree rotation helpers, which added some overhead but also made it easier to reason about things locally. Overall, I believe this is an acceptable compromise, given that this code doesn't get modified very often, and that I have good tests for it. Upon Peter's suggestion, I added comments showing the rtree configuration before every rotation. I think they help; however it's still best to have a copy of the cormen/leiserson/rivest book when digging into this code. This patch: reference Documentation/rbtree.txt for usage instructions include/linux/rbtree.h included some basic usage instructions, while Documentation/rbtree.txt had some more complete and easier to follow instructions. Replacing the former with a reference to the latter. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:31 +09:00
Gerald Schaefer	46dcde735c	thp: introduce pmdp_invalidate() On s390, a valid page table entry must not be changed while it is attached to any CPU. So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE operation would be necessary there. This patch introduces the pmdp_invalidate() function, to allow architecture-specific implementations. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:29 +09:00
Gerald Schaefer	e3ebcf6438	thp: remove assumptions on pgtable_t type The thp page table pre-allocation code currently assumes that pgtable_t is of type "struct page *". This may not be true for all architectures, so this patch removes that assumption by replacing the functions prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that can be defined architecture-specific. It also removes two VM_BUG_ON checks for page_count() and page_mapcount() operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will be no functional change introduced by this patch. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:29 +09:00
Davidlohr Bueso	01dc52ebdf	oom: remove deprecated oom_adj The deprecated /proc/<pid>/oom_adj is scheduled for removal this month. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:24 +09:00
Sagi Grimberg	21a92735f6	mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely schedule With an RCU based mmu_notifier implementation, any callout to mmu_notifier_invalidate_range_{start,end}() or mmu_notifier_invalidate_page() would not be allowed to call schedule() as that could potentially allow a modification to the mmu_notifier structure while it is currently being used. Since srcu allocs 4 machine words per instance per cpu, we may end up with memory exhaustion if we use srcu per mm. So all mms share a global srcu. Note that during large mmu_notifier activity exit & unregister paths might hang for longer periods, but it is tolerable for current mmu_notifier clients. Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Haggai Eran <haggaie@mellanox.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:23 +09:00
Xiao Guangrong	48af0d7cb3	mm: mmu_notifier: fix inconsistent memory between secondary MMU and host There is a bug in set_pte_at_notify() which always sets the pte to the new page before releasing the old page in the secondary MMU. At this time, the process will access on the new page, but the secondary MMU still access on the old page, the memory is inconsistent between them The below scenario shows the bug more clearly: at the beginning: p = 0, and p is write-protected by KSM or shared with parent process CPU 0 CPU 1 write 1 to p to trigger COW, set_pte_at_notify will be called: pte = new_page + W; /* The W bit of pte is set / p = 1; /* pte is valid, so no #PF / return back to secondary MMU, then the secondary MMU read p, but get: p == 0; /* * !!!!!! * the host has already set p to 1, but the secondary * MMU still get the old value 0 */ call mmu_notifier_change_pte to release old page in secondary MMU We can fix it by release old page first, then set the pte to the new page. Note, the new page will be firstly used in secondary MMU before it is mapped into the page table of the process, but this is safe because it is protected by the page table lock, there is no race to change the pte [akpm@linux-foundation.org: add comment from Andrea] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:22 +09:00
Mel Gorman	b22d127a39	mempolicy: fix a race in shared_policy_replace() shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot be dereferenced if sp->lock is not held and 2) another thread can modify sp_node between spin_unlock for allocating a new sp node and next spin_lock. The bug was introduced before 2.6.12-rc2. Kosaki's original patch for this problem was to allocate an sp node and policy within shared_policy_replace and initialise it when the lock is reacquired. I was not keen on this approach because it partially duplicates sp_alloc(). As the paths were sp->lock is taken are not that performance critical this patch converts sp->lock to sp->mutex so it can sleep when calling sp_alloc(). [kosaki.motohiro@jp.fujitsu.com: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:22 +09:00
Mel Gorman	1fb3f8ca0e	mm: compaction: capture a suitable high-order page immediately when it is made available While compaction is migrating pages to free up large contiguous blocks for allocation it races with other allocation requests that may steal these blocks or break them up. This patch alters direct compaction to capture a suitable free page as soon as it becomes available to reduce this race. It uses similar logic to split_free_page() to ensure that watermarks are still obeyed. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:21 +09:00
Konstantin Khlebnikov	314e51b985	mm: kill vma flag VM_RESERVED and mm->reserved_vm counter A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: \| effect \| alternative flags -+------------------------+--------------------------------------------- 1\| account as reserved_vm \| VM_IO 2\| skip in core dump \| VM_IO, VM_DONTDUMP 3\| do not merge or expand \| VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4\| do not mlock \| VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND \| VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO\|VM_DONTEXPAND\|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND \| VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:19 +09:00
Konstantin Khlebnikov	0103bd16fb	mm: prepare VM_DONTDUMP for using in drivers Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags: VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for sys_madvise. The next patch will use it for replacing the outdated flag VM_RESERVED. Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL (VM_IO \| VM_DONTEXPAND \| VM_RESERVED \| VM_PFNMAP) Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:18 +09:00

1 2 3 4 5 ...

53855 Commits