Commit Graph

48395 Commits

Author SHA1 Message Date
Arnd Bergmann facc432faa net: simplify napi_synchronize() to avoid warnings
The napi_synchronize() function is defined twice: The definition
for SMP builds waits for other CPUs to be done, while the uniprocessor
variant just contains a barrier and ignores its argument.

In the mvneta driver, this leads to a warning about an unused variable
when we lookup the NAPI struct of another CPU and then don't use it:

ethernet/marvell/mvneta.c: In function 'mvneta_percpu_notifier':
ethernet/marvell/mvneta.c:2910:30: error: unused variable 'other_port' [-Werror=unused-variable]

There are no other CPUs on a UP build, so that code never runs, but
gcc does not know this.

The nicest solution seems to be to turn the napi_synchronize() helper
into an inline function for the UP case as well, as that leads gcc to
not complain about the argument being unused. Once we do that, we can
also combine the two cases into a single function definition and use
if(IS_ENABLED()) rather than #ifdef to make it look a bit nicer.

The warning first came up in linux-4.4, but I failed to catch it
earlier.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: f864288544 ("net: mvneta: Statically assign queues to CPUs")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-24 22:19:55 -08:00
Linus Torvalds a200dcb346 virtio: barrier rework+fixes
This adds a new kind of barrier, and reworks virtio and xen
 to use it.
 Plus some fixes here and there.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWlU2kAAoJECgfDbjSjVRpZ6IH/Ra19ecG8sCQo9zskr4zo22Z
 DZXC3u0sJDBYjjBAiw3IY1FKh7wx2Fr1RhUOj1bteBgcFCMCV1zInP5ITiCyzd1H
 YYh1w9C2tZaj2T4t9L4hIrAdtIF8fGS+oI2IojXPjOuDLEt6pfFBEjHp/sfl3UJq
 ZmZvw4OXviSNej7jBw8Xni3Uv18yfmLGXvMdkvMSPC1/XL29voGDqTVwhqJwxLVz
 k/ZLcKFOzIs9N7Nja0Jl1EiZtC2Y9cpItqweicNAzszlpkSL44vQxmCSefB+WyQ4
 gt0O3+AxYkLfrxzCBhUA4IpRex3/XPW1b+1e/V1XjfR2n/FlyLe+AIa8uPJElFc=
 =ukaV
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio barrier rework+fixes from Michael Tsirkin:
 "This adds a new kind of barrier, and reworks virtio and xen to use it.

  Plus some fixes here and there"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (44 commits)
  checkpatch: add virt barriers
  checkpatch: check for __smp outside barrier.h
  checkpatch.pl: add missing memory barriers
  virtio: make find_vqs() checkpatch.pl-friendly
  virtio_balloon: fix race between migration and ballooning
  virtio_balloon: fix race by fill and leak
  s390: more efficient smp barriers
  s390: use generic memory barriers
  xen/events: use virt_xxx barriers
  xen/io: use virt_xxx barriers
  xenbus: use virt_xxx barriers
  virtio_ring: use virt_store_mb
  sh: move xchg_cmpxchg to a header by itself
  sh: support 1 and 2 byte xchg
  virtio_ring: update weak barriers to use virt_xxx
  Revert "virtio_ring: Update weak barriers to use dma_wmb/rmb"
  asm-generic: implement virt_xxx memory barriers
  x86: define __smp_xxx
  xtensa: define __smp_xxx
  tile: define __smp_xxx
  ...
2016-01-18 16:44:24 -08:00
Linus Torvalds d05d82f711 Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
Pull arch/tile updates from Chris Metcalf:
 "This is a grab bag of changes that includes some NOHZ and
  context-tracking related changes, some debugging improvements,
  JUMP_LABEL support, and some fixes for tilepro allmodconfig support.

  We also remove the now-unused node_has_online_mem() definitions both
  for tile's asm/topology.h as well as in linux/topology.h itself"

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  numa: remove stale node_has_online_mem() define
  arch/tile: move user_exit() to early kernel entry sequence
  tile: fix bug in setting PT_FLAGS_DISABLE_IRQ on kernel entry
  tile: fix tilepro casts for readl, writel, etc
  tile: fix a -Wframe-larger-than warning
  tile: include the syscall number in the backtrace
  MAINTAINERS: add git URL for tile
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  tile/jump_label: add jump label support for TILE-Gx
  tile: define a macro ktext_writable_addr to get writable kernel text address
2016-01-18 12:57:18 -08:00
Linus Torvalds d90f351a9b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
Pull AVR32 updates from Hans-Christian Noren Egtvedt.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
  mmc: atmel: get rid of struct mci_dma_data
  mmc: atmel-mci: restore dma on AVR32
  avr32: wire up missing syscalls
  avr32: wire up accept4 syscall
2016-01-18 12:50:55 -08:00
Linus Torvalds 48f58ba9cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull more networking fixes from David Miller:

 1) Fix brcmfmac build with older gcc, from Arend van Spriel.

 2) IRQ values unintentionally truncated to u8 in mlx5 driver, from
    Doron Tsur.

 3) Fix build warnings wrt tcp cgroup changes, from Geert Uytterhoeven.

 4) Limit deep recursion in ovs stack, from Hannes Frederic Sowa.

 5) at803x phy driver bug fixes from, Martin Blumenstingl.

 6) Fix TSO handling in hns driver, from Daode Huang

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
  ovs: limit ovs recursions in ovs_execute_actions to not corrupt stack
  team: Replace rcu_read_lock with a mutex in team_vlan_rx_kill_vid
  net: hns: bug fix about hisilicon TSO BD mode
  brcmfmac: fix BRCMF_FW_NVRAM_DEF macro for older gcc compilers
  net: phy: at803x: Add the interrupt register bit definitions
  net: phy: at803x: Clean up duplicate register definitions
  net: phy: at803x: Allow specifying the RGMII RX clock delay via phy mode
  net: phy: at803x: Don't set gbit features for the AR8030 phy
  arm64: bpf: add extra pass to handle faulty codegen
  arm64: insn: remove BUG_ON from codegen
  sctp: the temp asoc's transports should not be hashed/unhashed
  net/mlx5_core: Fix trimming down IRQ number
  tcp_memcontrol: Forward declare cgroup_subsys and mem_cgroup stucts
  batman-adv: Drop immediate orig_node free function
  batman-adv: Drop immediate batadv_hard_iface free function
  batman-adv: Drop immediate neigh_ifinfo free function
  batman-adv: Drop immediate batadv_hardif_neigh_node free function
  batman-adv: Drop immediate batadv_neigh_node free function
  batman-adv: Drop immediate batadv_orig_ifinfo free function
  batman-adv: Avoid recursive call_rcu for batadv_nc_node
  ...
2016-01-18 12:35:14 -08:00
Linus Torvalds c38dec7166 RTC for 4.5
Core:
  - fix module reference count in rtc-proc
  - Replace simple_strtoul by kstrtoul
 
 New driver:
  - Epson RX8010SJ
 
 Subsystem wide cleanups:
  - use %ph for short hex dumps
  - constify *_chip_ops structures
 
 Drivers:
  - abx80x: Microcrystal rv1805 support, alarm support
  - cmos: prevent kernel warning on IRQ flags mismatch
  - s5m: various cleanups
  - rv8803: rx8900 compatibility, small error path fix
  - sunxi: various cleanups
  - lpc32xx: remove irq > NR_IRQS check from probe()
  - imxdi: fix spelling mistake in warning message
  - ds1685: don't try to micromanage sysfs output size
  - da9063: avoid writing undefined data to rtc
  - gemini: Remove unnecessary platform_set_drvdata()
  - efi: add efi_procfs in efi_rtc_ops
  - pcf8523: refuse to write dates later than 2099
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJWnRQIAAoJEKbNnwlvZCyzb8kP/26N+iQNxQc1SjR4n/ZpJlBp
 aGcqyloPWcrpHPaAlQ7slo/TD0LLYdpYjCshiPnJjm4OLgA5+mba/ca5BgMAKOJM
 t0oXtLRVM7//TB+wLkpdHgM/OrwsTmYFMkB15G6YymsSB56Dc5YUxg4fIqCGcpIX
 j78AGyfwFinx7WzxMfGQSQIIz3Itym/s4Mc5HAjFikXuOPBQbd6RRZfj80gB/u4f
 k3XUsU8bOvQxqUa0SzgQTww90K11h4L5JYIjNh8dBvMbYcVX+xv6OFWfQYcVqrMa
 YYiAlznlOJ7O4puy7LSAFzc2Vm42NxqM9WHQ4VNcpRTiQ4nYgcreyPc5DhpY0Tc5
 RdA2LmILEG6k+6b2d8yyGBEWOoXYVZyizelyCMkqJRixDXN4jow4Xc5qXDwP/bsX
 qQSQK7XFuvDTFFZ4hUqF2W/OWMWo2wIiaYaWY6EAQmh4iq3eFBQ+2yolSJn+1PEY
 mJqUyw7E7PNgPagbmfIVJ6g6Q6ZQ/J4G67feZTOFaEagbdFM5eP79HL6ZbtNYZce
 4eqcmILvlUtmF0pzm0Ec8fhTIWL55GpPVuge6QkaIPFYq0lId8xiAuOX2L6G4oRf
 iAI6URijGwPWIHrCUx5dLGQgkxgauPdW5MpnuvSnxGWezGT2tCzLzV4ZHLyzayiX
 L8txbwQN0P1ykTJu1R0/
 =xQhZ
 -----END PGP SIGNATURE-----

Merge tag 'rtc-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "Core:
   - fix module reference count in rtc-proc
   - Replace simple_strtoul by kstrtoul

  New driver:
   - Epson RX8010SJ

  Subsystem wide cleanups:
   - use %ph for short hex dumps
   - constify *_chip_ops structures

  Drivers:
   - abx80x: Microcrystal rv1805 support, alarm support
   - cmos: prevent kernel warning on IRQ flags mismatch
   - s5m: various cleanups
   - rv8803: rx8900 compatibility, small error path fix
   - sunxi: various cleanups
   - lpc32xx: remove irq > NR_IRQS check from probe()
   - imxdi: fix spelling mistake in warning message
   - ds1685: don't try to micromanage sysfs output size
   - da9063: avoid writing undefined data to rtc
   - gemini: Remove unnecessary platform_set_drvdata()
   - efi: add efi_procfs in efi_rtc_ops
   - pcf8523: refuse to write dates later than 2099"

* tag 'rtc-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (24 commits)
  rtc: cmos: prevent kernel warning on IRQ flags mismatch
  rtc: rtc-ds2404: constify ds2404_chip_ops structures
  rtc: s5m: Make register configuration per S2MPS device to remove exceptions
  rtc: s5m: Add separate field for storing auto-cleared mask in register config
  rtc: s5m: Cleanup by removing useless 'rtc' prefix from fields
  rtc: Replace simple_strtoul by kstrtoul
  rtc: abx80x: add alarm support
  rtc: abx80x: Add Microcrystal rv1805 support
  rtc: v3020: constify v3020_chip_ops structures
  rtc: rv8803: Extend compatibility with the rx8900
  rtc: rv8803: fix handling return value of i2c_smbus_read_byte_data
  rtc: Add Epson RX8010SJ RTC driver
  rtc: lpc32xx: remove irq > NR_IRQS check from probe()
  rtc: imxdi: fix spelling mistake in warning message
  rtc: ds1685: don't try to micromanage sysfs output size
  rtc: use %ph for short hex dumps
  rtc: da9063: avoid writing undefined data to rtc
  rtc: sunxi: use of_device_get_match_data
  rtc: sunxi: constify the data_year_param structure
  rtc: sunxi: fix signedness issues
  ...
2016-01-18 12:10:45 -08:00
Linus Torvalds d43fb9f3c5 fbdev changes for 4.5
* pxafb: device-tree support
 * An unsafe kernel parameter 'lockless_register_fb' for debugging problems
   happening while inside the console lock
 * Small miscellaneous fixes & cleanups
 * omapdss: add writeback support functions
 * Separation of omapfb and omapdrm (see below)
 
 About the separation of omapfb and omapdrm, see
 http://permalink.gmane.org/gmane.comp.video.dri.devel/143151 for longer story.
 
 The short version:
 
 omapfb and omapdrm have shared low level drivers (omapdss and panel drivers),
 making further development of omapdrm difficult. After these patches omapfb and
 omapdrm have their own versions of the drivers, which are more or less
 direct copies for now but will diverge soon.
 
 This also means that omapfb (everything under drivers/video/fbdev/omap2/) is
 now in maintenance mode, and all new development will be done for omapdrm
 (drivers/gpu/drm/omapdrm/).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWnKg1AAoJEPo9qoy8lh71lrsP/RwwG8FDMl2tgwcsKVa/VlbF
 oez/CaNeZ9Jz0qd5RbIzIS9QdbL0AZugg4twwl76UbHWT477Z3EbUmpw++78kasr
 RFKDWYqMbxw3kshRDyALinGQmxPOPjNnc5mt9CYKzK4x0pJSLBmZc8qaNK3L4a5a
 eLJ3h6UhQDY61D04qr+LuTCETAxNR78x+NNIG7vYa9oS0ZDDrhlDyVPw4akPDMS6
 6Y5NgtRL1h2mq2hLBgTDCrwx3p7yZbnkSRKbpFnw/yddiXilND1d75JoW+0F6vKW
 U8DiRKxYtHNBdry4HlpRwufT52wkmtA/2puCW5Smw8araQ7R+s+wOt/1HAYQM72g
 8UCmNFMbhBpk8x8pT24ja4wyTLM9gaZqG9MWHLPEPbE6WicxSbqEAvIX9sakXLv6
 dDaf1SHZ+DFpq0jOwC8Rcnx1JFeeNNDf5cJb2pZI2Zka5jayQRTdbxeZGGnpFzu1
 1ZMiNQ24U+n9hgjV9QMiCW24TEBXFhFTf0Nlne3VP7qUbmvLqMUdGxGwM+b25/El
 SW/peryWglxsn5EBA7XybK+RTYxbjDtD5a8SOjD2YTNqVVVFHgf7z05SfSmYO5yi
 H67eDqdt0YsEGG87I8hv3eKM7FSRlYAywTC2mPfSOJ3+/G+18OU/voepcJHZ15x7
 SO3e/TFTrtglJzjVzX8j
 =Nrji
 -----END PGP SIGNATURE-----

Merge tag 'fbdev-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux

Pull fbdev updates from Tomi Valkeinen:
 "Summary:

   - pxafb: device-tree support
   - An unsafe kernel parameter 'lockless_register_fb' for debugging
     problems happening while inside the console lock
   - Small miscellaneous fixes & cleanups
   - omapdss: add writeback support functions
   - Separation of omapfb and omapdrm (see below)

  About the separation of omapfb and omapdrm, see

    http://permalink.gmane.org/gmane.comp.video.dri.devel/143151

  for longer story.  The short version:

  omapfb and omapdrm have shared low level drivers (omapdss and panel
  drivers), making further development of omapdrm difficult.  After
  these patches omapfb and omapdrm have their own versions of the
  drivers, which are more or less direct copies for now but will diverge
  soon.

  This also means that omapfb (everything under drivers/video/fbdev/omap2/)
  is now in maintenance mode, and all new development will be done for
  omapdrm (drivers/gpu/drm/omapdrm/)"

* tag 'fbdev-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (49 commits)
  video: fbdev: pxafb: fix out of memory error path
  drm/omap: make omapdrm select OMAP2_DSS
  drm/omap: move omapdss & displays under omapdrm
  omapfb: move vrfb into omapfb
  omapfb: take omapfb's private omapdss into use
  omapfb/displays: change CONFIG_DISPLAY_* to CONFIG_FB_OMAP2_*
  omapfb/dss: change CONFIG_OMAP* to CONFIG_FB_OMAP*
  omapdss: remove CONFIG_OMAP2_DSS_VENC from omapdss.h
  omapfb: copy omapdss & displays for omapfb
  omapfb: allow compilation only if DRM_OMAP is disabled
  fbdev: omap2: panel-dpi: simplify gpio setting
  fbdev: omap2: panel-dpi: in .disable first disable backlight then display
  OMAPDSS: DSS: fix a warning message
  video: omapdss: delete unneeded of_node_put
  OMAPDSS: DISPC: Remove boolean comparisons
  OMAPDSS: DSI: cleanup DSI_IRQ_ERROR_MASK define
  OMAPDSS: remove extra out == NULL checks
  OMAPDSS: change internal dispc functions to static
  OMAPDSS: make a two dss feat funcs internal to omapdss
  OMAPDSS: remove extra EXPORT_SYMBOLs
  ...
2016-01-18 11:58:31 -08:00
Chris Metcalf 00d27c6336 numa: remove stale node_has_online_mem() define
This isn't used anywhere, so delete it.

Looks like the last usage (in x86-specific code) was removed by Tejun
in 2011 in commit bd6709a91a ("x86, NUMA: Make 32bit use common NUMA
init path").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
2016-01-18 14:49:33 -05:00
Linus Torvalds 5807fcaa9b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:

 - EVM gains support for loading an x509 cert from the kernel
   (EVM_LOAD_X509), into the EVM trusted kernel keyring.

 - Smack implements 'file receive' process-based permission checking for
   sockets, rather than just depending on inode checks.

 - Misc enhancments for TPM & TPM2.

 - Cleanups and bugfixes for SELinux, Keys, and IMA.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (41 commits)
  selinux: Inode label revalidation performance fix
  KEYS: refcount bug fix
  ima: ima_write_policy() limit locking
  IMA: policy can be updated zero times
  selinux: rate-limit netlink message warnings in selinux_nlmsg_perm()
  selinux: export validatetrans decisions
  gfs2: Invalid security labels of inodes when they go invalid
  selinux: Revalidate invalid inode security labels
  security: Add hook to invalidate inode security labels
  selinux: Add accessor functions for inode->i_security
  security: Make inode argument of inode_getsecid non-const
  security: Make inode argument of inode_getsecurity non-const
  selinux: Remove unused variable in selinux_inode_init_security
  keys, trusted: seal with a TPM2 authorization policy
  keys, trusted: select hash algorithm for TPM2 chips
  keys, trusted: fix: *do not* allow duplicate key options
  tpm_ibmvtpm: properly handle interrupted packet receptions
  tpm_tis: Tighten IRQ auto-probing
  tpm_tis: Refactor the interrupt setup
  tpm_tis: Get rid of the duplicate IRQ probing code
  ...
2016-01-17 19:13:15 -08:00
Linus Torvalds 2d663b5581 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "Seven audit patches for 4.5, all very minor despite the diffstat.

  The diffstat churn for linux/audit.h can be attributed to needing to
  reshuffle the linux/audit.h header to fix the seccomp auditing issue
  (see the commit description for details).

  Besides the seccomp/audit fix, most of the fixes are around trying to
  improve the connection with the audit daemon and a Kconfig
  simplification.  Nothing crazy, and everything passes our little
  audit-testsuite"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: always enable syscall auditing when supported and audit is enabled
  audit: force seccomp event logging to honor the audit_enabled flag
  audit: Delete unnecessary checks before two function calls
  audit: wake up threads if queue switched from limited to unlimited
  audit: include auditd's threads in audit_log_start() wait exception
  audit: remove audit_backlog_wait_overflow
  audit: don't needlessly reset valid wait time
2016-01-17 18:48:49 -08:00
Linus Torvalds 0cbeafb245 Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - more MM stuff:

    - Kirill's page-flags rework

    - Kirill's now-allegedly-fixed THP rework

    - MADV_FREE implementation

    - DAX feature work (msync/fsync).  This isn't quite complete but DAX
      is new and it's good enough and the guys have a handle on what
      needs to be done - I expect this to be wrapped in the next week or
      two.

  - some vsprintf maintenance work

  - various other misc bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (145 commits)
  printk: change recursion_bug type to bool
  lib/vsprintf: factor out %pN[F] handler as netdev_bits()
  lib/vsprintf: refactor duplicate code to special_hex_number()
  printk-formats.txt: remove unimplemented %pT
  printk: help pr_debug and pr_devel to optimize out arguments
  lib/test_printf.c: test dentry printing
  lib/test_printf.c: add test for large bitmaps
  lib/test_printf.c: account for kvasprintf tests
  lib/test_printf.c: add a few number() tests
  lib/test_printf.c: test precision quirks
  lib/test_printf.c: check for out-of-bound writes
  lib/test_printf.c: don't BUG
  lib/kasprintf.c: add sanity check to kvasprintf
  lib/vsprintf.c: warn about too large precisions and field widths
  lib/vsprintf.c: help gcc make number() smaller
  lib/vsprintf.c: expand field_width to 24 bits
  lib/vsprintf.c: eliminate potential race in string()
  lib/vsprintf.c: move string() below widen_string()
  lib/vsprintf.c: pull out padding code from dentry_name()
  printk: do cond_resched() between lines while outputting to consoles
  ...
2016-01-17 12:58:52 -08:00
Linus Torvalds 58cf279aca GPIO bulk updates for the v4.5 kernel cycle:
Infrastructural changes:
 
 - In struct gpio_chip, rename the .dev node to .parent to better reflect
   the fact that this is not the GPIO struct device abstraction. We will
   add that soon so this would be totallt confusing.
 
 - It was noted that the driver .get_value() callbacks was
   sometimes reporting negative -ERR values to the gpiolib core, expecting
   them to be propagated to consumer gpiod_get_value() and gpio_get_value()
   calls. This was not happening, so as there was a mess of drivers
   returning negative errors and some returning "anything else than zero"
   to indicate that a line was active. As some would have bit 31 set to
   indicate "line active" it clashed with negative error codes. This is
   fixed by the largeish series clamping values in all drivers with
   !!value to [0,1] and then augmenting the code to propagate error codes
   to consumers. (Includes some ACKed patches in other subsystems.)
 
 - Add a void *data pointer to struct gpio_chip. The container_of() design
   pattern is indeed very nice, but we want to reform the struct gpio_chip
   to be a non-volative, stateless business, and keep states internal to
   the gpiolib to be able to hold on to the state when adding a proper
   userspace ABI (character device) further down the road. To achieve this,
   drivers need a handle at the internal state that is not dependent on
   their struct gpio_chip() so we add gpiochip_add_data() and
   gpiochip_get_data() following the pattern of many other subsystems.
   All the "use gpiochip data pointer" patches transforms drivers to this
   scheme.
 
 - The Generic GPIO chip header has been merged into the general
   <linux/gpio/driver.h> header, and the custom header for that removed.
   Instead of having a separate mm_gpio_chip struct for these generic
   drivers, merge that into struct gpio_chip, simplifying the code and
   removing the need for separate and confusing includes.
 
 Misc improvements:
 
 - Stabilize the way GPIOs are looked up from the ACPI legacy
   specification.
 
 - Incremental driver features for PXA, PCA953X, Lantiq (patches from the
   OpenWRT community), RCAR, Zynq, PL061, 104-idi-48
 
 New drivers:
 
 - Add a GPIO chip to the ALSA SoC AC97 driver.
 
 - Add a new Broadcom NSP SoC driver (this lands in the pinctrl dir, but
   the branch is merged here too to account for infrastructural changes).
 
 - The sx150x driver now supports the sx1502.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmsZhAAoJEEEQszewGV1ztq0QAJ1KbNOpmf/s3INkOH4r771Z
 WIrNEsmwwLIAryo8gKNOM0H1zCwhRUV7hIE5jYWgD6JvjuAN6vobMlZAq21j6YpB
 pKgqnI5DuoND450xjb8wSwGQ5NTYp1rFXNmwCrtyTjOle6AAW+Kp2cvVWxVr77Av
 uJinRuuBr9GOKW/yYM1Fw/6EPjkvvhVOb+LBguRyVvq0s5Peyw7ZVeY1tjgPHJLn
 oSZ9dmPUjHEn91oZQbtfro3plOObcxdgJ8vo//pgEmyhMeR8XjXES+aUfErxqWOU
 PimrZuMMy4cxnsqWwh3Dyxo7KSWfJKfSPRwnGwc/HgbHZEoWxOZI1ezRtGKrRQtj
 vubxp5dUBA5z66TMsOCeJtzKVSofkvgX2Wr/Y9jKp5oy9cHdAZv9+jEHV1pr6asz
 Tas97MmmO77XuRI/GPDqVHx8dfa15OIz9s92+Gu64KxNzVxTo4+NdoPSNxkbCILO
 FKn7EmU3D0OjmN2NJ9GAURoFaj3BBUgNhaxacG9j2bieyh+euuUHRtyh2k8zXR9y
 8OnY1UOrTUYF8YIq9pXZxMQRD/lqwCNHvEjtI6BqMcNx4MptfTL+FKYUkn/SgCYk
 QTNV6Ui+ety5D5aEpp5q0ItGsrDJ2LYSItsS+cOtMy2ieOxbQav9NWwu7eI3l5ly
 gwYTZjG9p9joPXLW0E3g
 =63rR
 -----END PGP SIGNATURE-----

Merge tag 'gpio-v4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Pull GPIO updates from Linus Walleij:
 "Here is the bulk of GPIO changes for v4.5.

  Notably there are big refactorings mostly by myself, aimed at getting
  the gpio_chip into a shape that makes me believe I can proceed to
  preserve state for a proper userspace ABI (character device) that has
  already been proposed once, but resulted in the feedback that I need
  to go back and restructure stuff.  So I've been restructuring stuff.
  On the way I ran into brokenness (return code from the get_value()
  callback) and had to fix it.  Also, refactored generic GPIO to be
  simpler.

  Some of that is still waiting to trickle down from the subsystems all
  over the kernel that provide random gpio_chips, I've touched every
  single GPIO driver in the kernel now, oh man I didn't know I was
  responsible for so much...

  Apart from that we're churning along as usual.

  I took some effort to test and retest so it should merge nicely and we
  shook out a couple of bugs in -next.

  Infrastructural changes:

   - In struct gpio_chip, rename the .dev node to .parent to better
     reflect the fact that this is not the GPIO struct device
     abstraction.  We will add that soon so this would be totallt
     confusing.

   - It was noted that the driver .get_value() callbacks was sometimes
     reporting negative -ERR values to the gpiolib core, expecting them
     to be propagated to consumer gpiod_get_value() and gpio_get_value()
     calls.  This was not happening, so as there was a mess of drivers
     returning negative errors and some returning "anything else than
     zero" to indicate that a line was active.  As some would have bit
     31 set to indicate "line active" it clashed with negative error
     codes.  This is fixed by the largeish series clamping values in all
     drivers with !!value to [0,1] and then augmenting the code to
     propagate error codes to consumers.  (Includes some ACKed patches
     in other subsystems.)

   - Add a void *data pointer to struct gpio_chip.  The container_of()
     design pattern is indeed very nice, but we want to reform the
     struct gpio_chip to be a non-volative, stateless business, and keep
     states internal to the gpiolib to be able to hold on to the state
     when adding a proper userspace ABI (character device) further down
     the road.  To achieve this, drivers need a handle at the internal
     state that is not dependent on their struct gpio_chip() so we add
     gpiochip_add_data() and gpiochip_get_data() following the pattern
     of many other subsystems.  All the "use gpiochip data pointer"
     patches transforms drivers to this scheme.

   - The Generic GPIO chip header has been merged into the general
     <linux/gpio/driver.h> header, and the custom header for that
     removed.  Instead of having a separate mm_gpio_chip struct for
     these generic drivers, merge that into struct gpio_chip,
     simplifying the code and removing the need for separate and
     confusing includes.

  Misc improvements:

   - Stabilize the way GPIOs are looked up from the ACPI legacy
     specification.

   - Incremental driver features for PXA, PCA953X, Lantiq (patches from
     the OpenWRT community), RCAR, Zynq, PL061, 104-idi-48

  New drivers:

   - Add a GPIO chip to the ALSA SoC AC97 driver.

   - Add a new Broadcom NSP SoC driver (this lands in the pinctrl dir,
     but the branch is merged here too to account for infrastructural
     changes).

   - The sx150x driver now supports the sx1502"

* tag 'gpio-v4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (220 commits)
  gpio: generic: make bgpio_pdata always visible
  gpiolib: fix chip order in gpio list
  gpio: mpc8xxx: Do not use gpiochip_get_data() in mpc8xxx_gpio_save_regs()
  gpio: mm-lantiq: Do not use gpiochip_get_data() in ltq_mm_save_regs()
  gpio: brcmstb: Allow building driver for BMIPS_GENERIC
  gpio: brcmstb: Set endian flags for big-endian MIPS
  gpio: moxart: fix build regression
  gpio: xilinx: Do not use gpiochip_get_data() in xgpio_save_regs()
  leds: pca9532: use gpiochip data pointer
  leds: tca6507: use gpiochip data pointer
  hid: cp2112: use gpiochip data pointer
  bcma: gpio: use gpiochip data pointer
  avr32: gpio: use gpiochip data pointer
  video: fbdev: via: use gpiochip data pointer
  gpio: pch: Optimize pch_gpio_get()
  Revert "pinctrl: lantiq: Implement gpio_chip.to_irq"
  pinctrl: nsp-gpio: use gpiochip data pointer
  pinctrl: vt8500-wmt: use gpiochip data pointer
  pinctrl: exynos5440: use gpiochip data pointer
  pinctrl: at91-pio4: use gpiochip data pointer
  ...
2016-01-17 12:32:01 -08:00
Linus Torvalds 6606b342fe Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
 "This adds following items:

   - watchdog restart handler support
   - watchdog reboot notifier support
   - watchdog sysfs attributes
   - support for the following new devices: AMD Mullins platform, AMD
     Carrizo platform, meson8b SoC, CSRatlas7, TS-4800, Alphascale
     asm9260-wdt, Zodiac, Sigma Designs SMP86xx/SMP87xx
   - Changes in refcounting for the watchdog core
   - watchdog core improvements
   - and small fixes"

* git://www.linux-watchdog.org/linux-watchdog: (60 commits)
  watchdog: asm9260: remove __init and __exit annotations
  watchdog: Drop pointer to watchdog device from struct watchdog_device
  watchdog: ziirave: Use watchdog infrastructure to create sysfs attributes
  watchdog: Add support for creating driver specific sysfs attributes
  watchdog: kill unref/ref ops
  watchdog: stmp3xxx: Remove unused variables
  watchdog: add MT7621 watchdog support
  hwmon: (sch56xx) Drop watchdog driver data reference count callbacks
  watchdog: da9055_wdt: Drop reference counting
  watchdog: da9052_wdt: Drop reference counting
  watchdog: Separate and maintain variables based on variable lifetime
  watchdog: diag288: Stop re-using watchdog core internal flags
  watchdog: Create watchdog device in watchdog_dev.c
  watchdog: qcom-wdt: Do not set 'dev' in struct watchdog_device
  watchdog: mena21: Do not use device pointer from struct watchdog_device
  watchdog: gpio: Do not use device pointer from struct watchdog_device
  watchdog: tangox: Print info message using pointer to platform device
  watchdog: bcm2835_wdt: Drop log message if watchdog is stopped
  devicetree: watchdog: add binding for Sigma Designs SMP8642 watchdog
  watchdog: add support for Sigma Designs SMP86xx/SMP87xx
  ...
2016-01-17 12:15:38 -08:00
Linus Torvalds a016af2e70 sound updates for 4.5-rc1
We've had quite busy weeks in this cycle.  Looking at ALSA core, the
 significant changes are a few fixes wrt timer and sequencer ioctls
 that have been revealed by fuzzer recently.  Other than that, ASoC
 core got a few updates about DAI link handling, but these are rather
 straightforward refactoring.
 
 In drivers scene, ASoC received quite lots of new drivers in addition
 to bunch of updates for still ongoing Intel Skylake support and
 topology API.  HD-audio gained a new HDMI/DP hotplug notification via
 component.  FireWire got a pile of code refactoring/updates with
 SCS.1x driver integration.
 
 More highlights are shown below.
 
 [NOTE: this contains also many commits for DRM.  This is due to the
  pull of drm stable branch into sound tree, as the base of i915 audio
  component work for HD-audio.  The highlights below don't contain
  these DRM changes, as these are supposed to be pulled via drm tree in
  anyway sooner or later.]
 
 Core
  - Handful fixes to harden ALSA timer and sequencer ioctls against
    races reported by syzkaller fuzzer
  - Irq description string can be unique to each card; only for
    HD-audio for now
 
 ASoC
  - Conversion of the array of DAI links to a list for supporting
    dynamically adding and removing DAI links
  - Topology API enhancements to make everything more component based
    and being able to specify PCM links via topology
  - Some more fixes for the topology code, though it is still not final
    and ready for enabling in production; we really need to get to the
    point where that can be done
  - A pile of changes for Intel SkyLake drivers which hopefully deliver
    some useful initial functionality for systems with this chipset,
    though there is more work still to come
  - Lots of new features and cleanups for the Renesas drivers
  - ANC support for WM5110
  - New drivers: Imagination Technologies IPs, Atmel class D speaker,
    Cirrus CS47L24 and WM1831, Dialog DA7128, Realtek RT5659 and
    RT56156, Rockchip RK3036, TI PC3168A, and AMD ACP
  - Rename PCM1792a driver to be generic pcm179x
 
 HD-Audio
  - Use audio component for i915 HDMI/DP hotplug handling
  - On-demand binding with i915 driver
  - bdl_pos_adj parameter adjustment for Baytrail controllers
  - Enable power_save_node for CX20722; this shouldn't lead to
    regression, hopefully
  - Kabylake HDMI/DP codec support
  - Quirks for Lenovo E50-80, Dell Latitude E-series, and other Dell
    machines
  - A few code refactoring
 
 FireWire
  - Lots of code cleanup and refactoring
  - Integrate the support of SCS.1x devices into snd-oxfw driver;
    snd-scs1x driver is obsoleted
 
 USB-audio
  - Fix possible NULL dereference at disconnection
  - A regression fix for Native Instruments devices
 
 Misc
  - A few code cleanups of fm801 driver
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWmmhNAAoJEGwxgFQ9KSmk/wsP/3eO+giAT9VRPa6qxR6VdT6I
 dZwTxcp4ZzUrgLxk9k5VYjqey6QL+1xWfl3Abrd+NzXDj1wo4KsDh2XCKG1btO9K
 UpIZf76Nzt7o91pzHbsU6mrjDeoVNqloZoGbg1utAmmegaXH3owd18p/ZHfE3sz2
 BbaHmYW/R8lnaBgBhzqJB97+zRaLJmMWpWHfpHaIPjdfw8/V4j76jtPnpmv2hDZl
 BHXVHcQXjVGunFRzxdzBLuTC+FmhzUeTAbbAdOT4fEoOCv5MtZqYppNxdhj+b9l5
 mrsXe5FBTNmrt9Z5TtfCuzgJPkzoDperFb0aKd7wI1jVMtLzkNCMlanHr9U6B6fr
 jSrs6l25xrpF1BBfRMfHjNudA5vng/XC5dtW00JofXSrIxtwPNUoDDiqJgw7xVm5
 aVWK7KkQIjRbHdCQaeTymv70oHHKei92hbCrXUobXZ7wLeJMXNVPT25ttChWrgAI
 7cu5h+K5PjReI/sJFTMPL4aHZ+jAn9quQl7vK8EXiL9E6G8lLiuBiVW6hjGd9At+
 Z6UyGV+nCM6O3qZcyParMuLkNtWx9uT7Pcn8oTZAdKPngNhsf8+yl9qmsFkNLDC4
 LKPx0+rdCjtMKn2du3krsHhG3EN9pLDrE6g5U3d6Cz83e69Y7fCuSjl31SjD91H0
 bZDcM/ejYSbid3yKN4TL
 =Gvgb
 -----END PGP SIGNATURE-----

Merge tag 'sound-4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound updates from Takashi Iwai:
 "We've had quite busy weeks in this cycle.  Looking at ALSA core, the
  significant changes are a few fixes wrt timer and sequencer ioctls
  that have been revealed by fuzzer recently.  Other than that, ASoC
  core got a few updates about DAI link handling, but these are rather
  straightforward refactoring.

  In drivers scene, ASoC received quite lots of new drivers in addition
  to bunch of updates for still ongoing Intel Skylake support and
  topology API.  HD-audio gained a new HDMI/DP hotplug notification via
  component.  FireWire got a pile of code refactoring/updates with
  SCS.1x driver integration.

  More highlights are shown below.

  [ NOTE: this contains also many commits for DRM.  This is due to the
    pull of drm stable branch into sound tree, as the base of i915 audio
    component work for HD-audio.  The highlights below don't contain
    these DRM changes, as these are supposed to be pulled via drm tree
    in anyway sooner or later.  ]

  Core:
   - Handful fixes to harden ALSA timer and sequencer ioctls against
     races reported by syzkaller fuzzer
   - Irq description string can be unique to each card; only for
     HD-audio for now

  ASoC:
   - Conversion of the array of DAI links to a list for supporting
     dynamically adding and removing DAI links
   - Topology API enhancements to make everything more component based
     and being able to specify PCM links via topology
   - Some more fixes for the topology code, though it is still not final
     and ready for enabling in production; we really need to get to the
     point where that can be done
   - A pile of changes for Intel SkyLake drivers which hopefully deliver
     some useful initial functionality for systems with this chipset,
     though there is more work still to come
   - Lots of new features and cleanups for the Renesas drivers
   - ANC support for WM5110
   - New drivers: Imagination Technologies IPs, Atmel class D speaker,
     Cirrus CS47L24 and WM1831, Dialog DA7128, Realtek RT5659 and
     RT56156, Rockchip RK3036, TI PC3168A, and AMD ACP
   - Rename PCM1792a driver to be generic pcm179x

  HD-Audio:
   - Use audio component for i915 HDMI/DP hotplug handling
   - On-demand binding with i915 driver
   - bdl_pos_adj parameter adjustment for Baytrail controllers
   - Enable power_save_node for CX20722; this shouldn't lead to
     regression, hopefully
   - Kabylake HDMI/DP codec support
   - Quirks for Lenovo E50-80, Dell Latitude E-series, and other Dell
     machines
   - A few code refactoring

  FireWire:
   - Lots of code cleanup and refactoring
   - Integrate the support of SCS.1x devices into snd-oxfw driver;
     snd-scs1x driver is obsoleted

  USB-audio:
   - Fix possible NULL dereference at disconnection
   - A regression fix for Native Instruments devices

  Misc:
   - A few code cleanups of fm801 driver"

* tag 'sound-4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (722 commits)
  ALSA: timer: Code cleanup
  ALSA: timer: Harden slave timer list handling
  ALSA: hda - Add fixup for Dell Latitidue E6540
  ALSA: timer: Fix race among timer ioctls
  ALSA: hda - add codec support for Kabylake display audio codec
  ALSA: timer: Fix double unlink of active_list
  ALSA: usb-audio: Fix mixer ctl regression of Native Instrument devices
  ALSA: hda - fix the headset mic detection problem for a Dell laptop
  ALSA: hda - Fix white noise on Dell Latitude E5550
  ALSA: hda_intel: add card number to irq description
  ALSA: seq: Fix race at timer setup and close
  ALSA: seq: Fix missing NULL check at remove_events ioctl
  ALSA: usb-audio: Avoid calling usb_autopm_put_interface() at disconnect
  ASoC: hdac_hdmi: remove unused hdac_hdmi_query_pin_connlist
  ASoC: AMD: Add missing include file
  ALSA: hda - Fixup inverted internal mic for Lenovo E50-80
  ALSA: usb: Add native DSD support for Oppo HA-1
  ASoC: Make aux_dev more like a generic component
  ASoC: bcm2835: cleanup includes by ordering them alphabetically
  ASoC: AMD: Manage ACP 2.x SRAM banks power
  ...
2016-01-17 12:05:31 -08:00
Doron Tsur 0b6e26ce89 net/mlx5_core: Fix trimming down IRQ number
With several ConnectX-4 cards installed on a server, one may receive
irqn > 255 from the kernel API, which we mistakenly trim to 8bit.

This causes EQ creation failure with the following stack trace:
[<ffffffff812a11f4>] dump_stack+0x48/0x64
[<ffffffff810ace21>] __setup_irq+0x3a1/0x4f0
[<ffffffff810ad7e0>] request_threaded_irq+0x120/0x180
[<ffffffffa0923660>] ? mlx5_eq_int+0x450/0x450 [mlx5_core]
[<ffffffffa0922f64>] mlx5_create_map_eq+0x1e4/0x2b0 [mlx5_core]
[<ffffffffa091de01>] alloc_comp_eqs+0xb1/0x180 [mlx5_core]
[<ffffffffa091ea99>] mlx5_dev_init+0x5e9/0x6e0 [mlx5_core]
[<ffffffffa091ec29>] init_one+0x99/0x1c0 [mlx5_core]
[<ffffffff812e2afc>] local_pci_probe+0x4c/0xa0

Fixing it by changing of the irqn type from u8 to unsigned int to
support values > 255

Fixes: 61d0e73e0a ('net/mlx5_core: Use the the real irqn in eq->irqn')
Reported-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Doron Tsur <doront@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-17 12:08:04 -05:00
Aaron Conole fe22cd9b7c printk: help pr_debug and pr_devel to optimize out arguments
Currently, pr_debug and pr_devel will not elide function call arguments
appearing in calls to the no_printk for these macros.  This is because
all side effects must be honored before proceeding to the 0-value
assignment in no_printk.

The behavior is contrary to documentation found in the CodingStyle and
the header file where these functions are declared.

This patch corrects that behavior by shunting out the call to no_printk
completely.  The format string is still checked by gcc for correctness,
but no code seems to be emitted in common cases.

[akpm@linux-foundation.org: remove braces, per Joe]
Fixes: 5264f2f75d ("include/linux/printk.h: use and neaten no_printk")
Signed-off-by: Aaron Conole <aconole@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:29 -08:00
Tejun Heo 8d91f8b153 printk: do cond_resched() between lines while outputting to consoles
@console_may_schedule tracks whether console_sem was acquired through
lock or trylock.  If the former, we're inside a sleepable context and
console_conditional_schedule() performs cond_resched().  This allows
console drivers which use console_lock for synchronization to yield
while performing time-consuming operations such as scrolling.

However, the actual console outputting is performed while holding
irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule
before starting outputting lines.  Also, only a few drivers call
console_conditional_schedule() to begin with.  This means that when a
lot of lines need to be output by console_unlock(), for example on a
console registration, the task doing console_unlock() may not yield for
a long time on a non-preemptible kernel.

If this happens with a slow console devices, for example a serial
console, the outputting task may occupy the cpu for a very long time.
Long enough to trigger softlockup and/or RCU stall warnings, which in
turn pile more messages, sometimes enough to trigger the next cycle of
warnings incapacitating the system.

Fix it by making console_unlock() insert cond_resched() between lines if
@console_may_schedule.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Kyle McMartin <kyle@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:25 -08:00
Viresh Kumar dfffa587a6 err.h: add (missing) unlikely() to IS_ERR_OR_NULL()
IS_ERR_VALUE() already contains it and so we need to add this only to
the !ptr check.  That will allow users of IS_ERR_OR_NULL(), to not add
this compiler flag.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:24 -08:00
Yaowei Bai 3f1bfd9413 include/linux/kdev_t.h: remove new_valid_dev()
As all new_valid_dev() checks have been removed it's time to drop
new_valid_dev() itself.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:23 -08:00
Michal Nazarewicz 8f57e4d930 include/linux/kernel.h: change abs() macro so it uses consistent return type
Rewrite abs() so that its return type does not depend on the
architecture and no unexpected type conversion happen inside of it.  The
only conversion is from unsigned to signed type.  char is left as a
return type but treated as a signed type regradless of it's actual
signedness.

With the old version, int arguments were promoted to long and depending
on architecture a long argument might result in s64 or long return type
(which may or may not be the same).

This came after some back and forth with Nicolas.  The current macro has
different return type (for the same input type) depending on
architecture which might be midly iritating.

An alternative version would promote to int like so:

	#define abs(x)	__abs_choose_expr(x, long long,			\
			__abs_choose_expr(x, long,			\
			__builtin_choose_expr(				\
				sizeof(x) <= sizeof(int),		\
				({ int __x = (x); __x<0?-__x:__x; }),	\
				((void)0))))

I have no preference but imagine Linus might.  :] Nicolas argument against
is that promoting to int causes iconsistent behaviour:

	int main(void) {
		unsigned short a = 0, b = 1, c = a - b;
		unsigned short d = abs(a - b);
		unsigned short e = abs(c);
		printf("%u %u\n", d, e);  // prints: 1 65535
	}

Then again, no sane person expects consistent behaviour from C integer
arithmetic.  ;)

Note:

  __builtin_types_compatible_p(unsigned char, char) is always false, and
  __builtin_types_compatible_p(signed char, char) is also always false.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Nicolas Pitre <nico@linaro.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Wey-Yi Guy <wey-yi.w.guy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:22 -08:00
Vasily Kulikov b8a0255db9 include/linux/poison.h: use POISON_POINTER_DELTA for poison pointers
TIMER_ENTRY_STATIC and TAIL_MAPPING are defined as poison pointers which
should point to nowhere.  Redefine them using POISON_POINTER_DELTA
arithmetics to make sure they really point to non-mappable area declared
by the target architecture.

Signed-off-by: Vasily Kulikov <segoon@openwall.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Solar Designer <solar@openwall.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 11:17:22 -08:00
Linus Torvalds ece6267878 The clk framework and driver changes for 4.5 look pretty typical. The
bulk of the changes are to clk controller drivers, though some
 improvements to the core and some re-usable blocks/templates also
 received some love. In this past cycle the clk maintainers developed a
 good workflow for handling the common case of patch submissions
 containing a new drivers, new shared Device Tree header and a new Device
 Tree binding description. This requires coordination with the Device
 Tree maintainers and with the architecture maintainers (typically the
 arm-soc tree in our case). This explains the increase in changes to
 include/dt-bindings/... and to
 Documentation/devicetree/bindings/clock/... coming from the clk tree.
 The same commits can be expected to come through those trees on
 occasion, through the use of shared, immutable branches.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJWmVD1AAoJEKI6nJvDJaTUWqAQAJRqH+iKSnLyuVek6USYPXdp
 ZOW1JHIySCUF/ci3fvv9vbIuW/w4Uc9FdRdAfIIaWO32bpk4ljtEogVasvavL/54
 geN+0YERZpth+PL/wk+g+Dons/8BgN9AL+0ToaVVh2MsRE903jWpe+l0qWCl+SUn
 DQF+yB9F4QcVT7gb+KI0B6hr6lv5Mu/t9Ffq3DrK2UG71IEbrv953pVo19foZqbf
 FgMgduERMHvaM/R6p5xfXxIESbjE+QNMnykEo6bWcHF2wfQrIiYlW1khtnsigNus
 kze2mYWjG77KGkrOex5kwjuBDPfiaGAstk3jcRCCMB7nRmuFcJgryF8003CD3QvW
 +cY+ZBSkyXXtL/nrkefebAplQjsvum7dJ+W6hxA32B772jFL8s8ee5rGda0ibw8N
 nFQRopVcNPoR52tjmSEOxm7Z9vqoDj5qManDdZe62kr6bCFId97E6SzeGgeyDuBQ
 V7tkss9klHMhx29V37wkyl+A/pWXO+CsGzHLEivH8/L1CysMK1WGhyHJJ2/JTDTR
 n3z1EYdg67PbHpfhboscuLP+sHcrAefOCe2la4wKxwqd4DGw6J8RhhVZml6XnfQ3
 I4yLri8Pguol49G1ac1U87ylebKsXhWR5CRHVas1BFHrt0FGULHnl0ZOLzOu4GRZ
 h8/nd9ZODLFhB1w7fL1B
 =nkGr
 -----END PGP SIGNATURE-----

Merge tag 'clk-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk framework updates from Michael Turquette:
 "The clk framework and driver changes for 4.5 look pretty typical.  The
  bulk of the changes are to clk controller drivers, though some
  improvements to the core and some re-usable blocks/templates also
  received some love.

  In this past cycle the clk maintainers developed a good workflow for
  handling the common case of patch submissions containing a new
  drivers, new shared Device Tree header and a new Device Tree binding
  description.  This requires coordination with the Device Tree
  maintainers and with the architecture maintainers (typically the
  arm-soc tree in our case).

  This explains the increase in changes to include/dt-bindings/... and
  to Documentation/devicetree/bindings/clock/... coming from the clk
  tree.  The same commits can be expected to come through those trees on
  occasion, through the use of shared, immutable branches"

* tag 'clk-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (125 commits)
  clk: remove duplicated COMMON_CLK_NXP record from clk/Kconfig
  clk: fix clk-gpio.c with optional clock= DT property
  clk: rockchip: fix section mismatches with new child-clocks
  clk: gpio: handle error codes for of_clk_get_parent_count()
  clk: gpio: fix memory leak
  clk: shmobile: r8a7795: Add SATA0 clock
  clk: bcm2835: Add PWM clock support
  clk: bcm2835: Support for clock parent selection
  clk: bcm2835: add a round up ability to the clock divisor
  clk: lpc32xx: add common clock framework driver
  clk: lpc18xx: add NXP specific COMMON_CLK_NXP configuration symbol
  dt-bindings: clock: add NXP LPC32xx clock list for consumers
  dt-bindings: clock: add description of LPC32xx USB clock controller
  dt-bindings: clock: add description of LPC32xx clock controller
  clk: rockchip: rk3036: include downstream muxes into fractional dividers
  clk: add flag for clocks that need to be enabled on rate changes
  clk: rockchip: Allow the RK3288 SPDIF clocks to change their parent
  clk: rockchip: include downstream muxes into fractional dividers
  clk: rockchip: handle mux dependency of fractional dividers
  clk: bcm2835: Add a driver for the auxiliary peripheral clock gates.
  ...
2016-01-15 18:21:28 -08:00
Linus Torvalds d45187aaf0 Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
Pull dmi updates from Jean Delvare.

* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  firmware: dmi_scan: Save SMBIOS Type 9 System Slots
  firmware: dmi_scan: Fix dmi_find_device description
  firmware: dmi_scan: Clarify dmi_save_extended_devices
  firmware: dmi_scan: Optimize dmi_save_extended_devices
2016-01-15 18:12:18 -08:00
Wang Xiaoqiang 7f43add451 mm/mlock.c: change can_do_mlock return value type to boolean
Since can_do_mlock only return 1 or 0, so make it boolean.

No functional change.

[akpm@linux-foundation.org: update declaration in mm.h]
Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov 036fbb21de memblock: fix section mismatch
allmodconfig produces following warning for me:

  WARNING: vmlinux.o(.text.unlikely+0x10314): Section mismatch in reference from the function movable_node_is_enabled() to the variable .meminit.data:movable_node_enabled
  The function movable_node_is_enabled() references
  the variable __meminitdata movable_node_enabled.
  This is often because movable_node_is_enabled lacks a __meminitdata
  annotation or the annotation of movable_node_enabled is wrong.

Let's mark the function with __meminit.  It fixes the warning.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dominik Dingel 4a9e1cda27 mm: bring in additional flag for fixup_user_fault to signal unlock
During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.

The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault.  Till now we never cared about retries, but as the
userfaultfd code kind of relies on it.  this needs some fix.

This patchset does not take care of the futex code.  I will now look
closer at this.

This patch (of 2):

With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.

This patch brings in the logic to handle retries as well as it cleans up
the current documentation.  fixup_user_fault was not having the same
semantics as filemap_fault.  It never indicated if a retry happened and
so a caller wasn't able to handle that case.  So we now changed the
behaviour to always retry a locked mmap_sem.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 3565fce3a6 mm, x86: get_user_pages() for dax mappings
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 5c7fb56e5e mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd
A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page.  The distinction is especially important in the
get_user_pages() path.  pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.

Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.

[kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in  __split_huge_pmd()]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 5c2c2587b1 mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use.  Unlike get_page() it may fail if the device is, or
is in the process of being, disabled.  While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.

devm_memremap_pages() now requires a reference counter to be specified
at init time.  For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".

ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list.  That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert.  This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams f25748e3c3 mm, dax: convert vmf_insert_pfn_pmd() to pfn_t
Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 01c8f1c44b mm, dax, gpu: convert vm_insert_mixed to pfn_t
Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
evaluating the PFN_MAP and PFN_DEV flags.  When both are set it triggers
_PAGE_DEVMAP to be set in the resulting pte.

There are no functional changes to the gpu drivers as a result of this
conversion.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 888cdbc2c9 hugetlb: fix compile error on tile
Inlude asm/pgtable.h to get the definition for pud_t to fix:

  include/linux/hugetlb.h:203:29: error: unknown type name 'pud_t'

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 4b94ffdc41 x86, mm: introduce vmem_altmap to augment vmemmap_populate()
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 9476df7d80 mm: introduce find_dev_pagemap()
There are several scenarios where we need to retrieve and update
metadata associated with a given devm_memremap_pages() mapping, and the
only lookup key available is a pfn in the range:

1/ We want to augment vmemmap_populate() (called via arch_add_memory())
   to allocate memmap storage from pre-allocated pages reserved by the
   device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
   rather than page allocator pages.  This is in support of
   devm_memremap_pages() mappings where the memmap is too large to fit in
   main memory (i.e. large persistent memory devices).

2/ Taking a reference against the mapping when inserting device pages
   into the address_space radix of a given inode.  This facilitates
   unmap_mapping_range() and truncate_inode_pages() operations when the
   driver is tearing down the mapping.

3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
   reference against the mapping so that the driver teardown path can
   revoke and drain usage of device pages.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 260ae3f7db mm: skip memory block registration for ZONE_DEVICE
Prevent userspace from trying and failing to online ZONE_DEVICE pages
which are meant to never be onlined.

For example on platforms with a udev rule like the following:

  SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

...will generate futile attempts to online the ZONE_DEVICE sections.
Example kernel messages:

    Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1004747
    Policy zone: Normal
    online_pages [mem 0x248000000-0x24fffffff] failed

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 34c0fd540e mm, dax, pmem: introduce pfn_t
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams ba049e93ae kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver.  The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array.  Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory.  The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 18):

The core has developed a need for a "pfn_t" type [1].  Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams b2e0d1625e dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Minchan Kim b8d3c4c300 mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called
We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size.  The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy.  With
that, we could avoid unnecessary THP split.

For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void.  So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty.  With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Minchan Kim 10853a0392 mm: move lazily freed pages to inactive list
MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
them so there is no value keeping them in the active anonymous LRU so
this patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive
list are reclaimed first because they are more likely to be cold rather
than recently active pages.

An arguable issue for the approach would be whether we should put the
page to the head or tail of the inactive list.  I chose head because the
kernel cannot make sure it's really cold or warm for every MADV_FREE
usecase but at least we know it's not *hot*, so landing of inactive head
would be a comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily.  This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <yalin.wang2010@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Minchan Kim 854e9ed09d mm: support madvise(MADV_FREE)
Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation +
zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the
range.  If memory pressure happens, VM checks dirty bit of page table
and if it found still "clean", it means it's a "lazyfree pages" so VM
could discard the page instead of swapping out.  Once there was store
operation for the page before VM peek a page to reclaim, dirty bit is
set so VM can swap out the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty
bit in page table entry to decide whether VM allows to discard the page
or not.  IOW, if page table entry includes marked dirty bit, VM
shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table
entry doesn't have dirty bit so MADV_FREE could discard the page
wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty and
PageSwapCache.  It worked out because swapped-in page lives on swap
cache and since it is evicted from the swap cache, the page has PG_dirty
flag.  So both page flags check effectively prevent wrong discarding by
MADV_FREE.

However, a problem in above logic is that swapped-in page has PG_dirty
still after they are removed from swap cache so VM cannot consider the
page as freeable any more even if madvise_free is called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is
owned exclusively by current process when madvise is called because
PG_dirty represents ptes's dirtiness in several processes so we could
clear it only if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

  barrios@blaptop:~/benchmark/ebizzy$ lscpu
  Architecture:          x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
  CPU(s):                12
  On-line CPU(s) list:   0-11
  Thread(s) per core:    1
  Core(s) per socket:    1
  Socket(s):             12
  NUMA node(s):          1
  Vendor ID:             GenuineIntel
  CPU family:            6
  Model:                 2
  Stepping:              3
  CPU MHz:               3200.185
  BogoMIPS:              6400.53
  Virtualization:        VT-x
  Hypervisor vendor:     KVM
  Virtualization type:   full
  L1d cache:             32K
  L1i cache:             32K
  L2 cache:              4096K
  NUMA node0 CPU(s):     0-11
  ebizzy benchmark(./ebizzy -S 10 -n 512)

  Higher avg is better.

   vanilla-jemalloc             MADV_free-jemalloc

  1 thread
  records: 10                   records: 10
  avg:   2961.90                avg:  12069.70
  std:     71.96(2.43%)         std:    186.68(1.55%)
  max:   3070.00                max:  12385.00
  min:   2796.00                min:  11746.00

  2 thread
  records: 10                   records: 10
  avg:   5020.00                avg:  17827.00
  std:    264.87(5.28%)         std:    358.52(2.01%)
  max:   5244.00                max:  18760.00
  min:   4251.00                min:  17382.00

  4 thread
  records: 10                   records: 10
  avg:   8988.80                avg:  27930.80
  std:   1175.33(13.08%)        std:   3317.33(11.88%)
  max:   9508.00                max:  30879.00
  min:   5477.00                min:  21024.00

  8 thread
  records: 10                   records: 10
  avg:  13036.50                avg:  33739.40
  std:    170.67(1.31%)         std:   5146.22(15.25%)
  max:  13371.00                max:  40572.00
  min:  12785.00                min:  24088.00

  16 thread
  records: 10                   records: 10
  avg:  11092.40                avg:  31424.20
  std:    710.60(6.41%)         std:   3763.89(11.98%)
  max:  12446.00                max:  36635.00
  min:   9949.00                min:  25669.00

  32 thread
  records: 10                   records: 10
  avg:  11067.00                avg:  34495.80
  std:    971.06(8.77%)         std:   2721.36(7.89%)
  max:  12010.00                max:  38598.00
  min:   9002.00                min:  30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

This patch (of 12):

Add core MADV_FREE implementation.

[akpm@linux-foundation.org: small cleanups]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Shaohua Li" <shli@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Vladimir Davydov 8749cfea11 mm: add page_check_address_transhuge() helper
page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
code for looking up pte of a (possibly transhuge) page.  Move this code
to a new helper function, page_check_address_transhuge(), and make the
above mentioned functions use it.

This is just a cleanup, no functional changes are intended.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov b20ce5e03b mm: prepare page_referenced() and page_idle to new THP refcounting
Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages.  That's no true anymore: THP can be mapped
with PTEs too.

The patch removes PageTransHuge() test from the functions and opencode
page table check.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov 9a982250f7 thp: introduce deferred_split_huge_page()
Currently we don't split huge page on partial unmap.  It's not an ideal
situation.  It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap().  But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting.  The splitting itself will happen when we get
memory pressure via shrinker interface.  The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov e9b61f1985 thp: reintroduce split_huge_page()
This patch adds implementation of split_huge_page() for new
refcountings.

Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page.  It also means that pin on page
would prevent it from bening split under you.  It makes situation in
many places much cleaner.

The basic scheme of split_huge_page():

  - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

  - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

  - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

  - Split compound page.

  - Unfreeze the page by removing migration entries.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Naoya Horiguchi 4e41a30c6d mm: hwpoison: adjust for new thp refcounting
Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
changes in thp refcounting rule.  This patch fixes them up.

In the new refcounting, we no longer use tail->_mapcount to keep tail's
refcount, and thereby we can simplify get/put_hwpoison_page().

And another change is that tail's refcount is not transferred to the raw
page during thp split (more precisely, in new rule we don't take
refcount on tail page any more.) So when we need thp split, we have to
transfer the refcount properly to the 4kB soft-offlined page before
migration.

thp split code goes into core code only when precheck
(total_mapcount(head) == page_count(head) - 1) passes to avoid useless
split, where we assume that one refcount is held by the caller of thp
split and the others are taken via mapping.  To meet this assumption,
this patch moves thp split part in soft_offline_page() after
get_any_page().

[akpm@linux-foundation.org: remove unneeded #define, per Kirill]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A.  Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov eef1b3ba05 thp: implement split_huge_pmd()
Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page.  This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov e1534ae950 mm: differentiate page_mapped() from page_mapcount() for compound pages
Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

On other hand page_mapcount() return mapcount for this particular small
page.

This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.

Most users outside core-mm should use page_mapcount() instead of
page_mapped().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov 53f9263bab mm: rework mapcount accounting to enable 4k mapping of THPs
We're going to allow mapping of individual 4k pages of THP compound.  It
means we need to track mapcount on per small page basis.

Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined.  But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.

The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
track PTE mapcount.

We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.

Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount.  When we map part of compound
page with PTE we operate on ->_mapcount of the subpage.

page_mapcount() counts both: PTE and PMD mappings of the page.

Basically, we have mapcount for a subpage spread over two counters.  It
makes tricky to detect when last mapcount for a page goes away.

We introduced PageDoubleMap() for this.  When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.

This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.

[akpm@linux-foundation.org: fix typo in comment]
[mhocko@suse.com: ignore partial THP when moving task]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov 4b471e8898 mm, thp: remove infrastructure for handling splitting PMDs
With new refcounting we don't need to mark PMDs splitting.  Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00