In the VF driver, module parameter mlx4_log_num_mgm_entry_size was
mistakenly overwritten -- and in a manner which overrode the
device-managed flow steering option encoded in the parameter.
log_num_mgm_entry_size is a global module parameter which
affects all ConnectX-3 PFs installed on that host.
If a VF changes log_num_mgm_entry_size, this will affect all PFs
which are probed subsequent to the change (by disabling DMFS for
those PFs).
Fixes: 3c439b5586 ("mlx4_core: Allow choosing flow steering mode")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spoofcheck can't be enabled if VF MAC is zero.
Vice versa, can't zero MAC if spoofcheck is on.
Fixes: 8f7ba3ca12 ('net/mlx4: Add set VF mac address support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As ENOTSUPP is specific to NFS, change the return error value to
EOPNOTSUPP in various places in the mlx4 driver.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Suggested-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consistently use types from linux/types.h to fix the following
linux/rds.h userspace compilation errors:
/usr/include/linux/rds.h:198:2: error: unknown type name 'u8'
u8 rx_traces;
/usr/include/linux/rds.h:199:2: error: unknown type name 'u8'
u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
/usr/include/linux/rds.h:203:2: error: unknown type name 'u8'
u8 rx_traces;
/usr/include/linux/rds.h:204:2: error: unknown type name 'u8'
u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
/usr/include/linux/rds.h:205:2: error: unknown type name 'u64'
u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
Fixes: 3289025aed ("RDS: add receive message trace used by application")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include <linux/in6.h> in uapi/linux/seg6.h to fix the following
linux/seg6.h userspace compilation error:
/usr/include/linux/seg6.h:31:18: error: array type has incomplete element type 'struct in6_addr'
struct in6_addr segments[0];
Include <linux/seg6.h> in uapi/linux/seg6_iptunnel.h to fix
the following linux/seg6_iptunnel.h userspace compilation error:
/usr/include/linux/seg6_iptunnel.h:26:21: error: array type has incomplete element type 'struct ipv6_sr_hdr'
struct ipv6_sr_hdr srh[0];
Fixes: a50a05f497 ("ipv6: sr: add missing Kbuild export for header files")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As reported by Geert, remove the string so the user does not see this
config option. The option is explicitly selected only as a dependency of
in-kernel users.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 44091d29f2 ("lib: Introduce priority array area manager")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a void function, it is not necessary to append a return statement in it.
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
trivial fix to spelling mistake in verbose log message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include <linux/if.h> to fix the following linux/llc.h userspace
compilation error:
/usr/include/linux/llc.h:26:27: error: 'IFHWADDRLEN' undeclared here (not in a function)
unsigned char sllc_mac[IFHWADDRLEN];
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include <linux/if.h> and <linux/in6.h> to fix the following
linux/ip6_tunnel.h userspace compilation errors:
/usr/include/linux/ip6_tunnel.h:23:12: error: 'IFNAMSIZ' undeclared here (not in a function)
char name[IFNAMSIZ]; /* name of tunnel device */
/usr/include/linux/ip6_tunnel.h:30:18: error: field 'laddr' has incomplete type
struct in6_addr laddr; /* local tunnel end-point address */
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
Mellanox mlx5e fixes for 4.11-rc1
This series includes some important bug fixes for mlx5e driver.
Three misc fixes:
From Mohamad, compilation fix on s390 system
From Me, A fix for driver unload when switchdev mode is on.
From Tariq, HW LRO frag size optimization for when build_skb is not used
(striding RQ mode).
Three CQE compression related fixes:
Two fixes from Tariq and I, to correctly setup CQE compression
parameters on driver load and on arbitrary user modifications.
Last patch, fixes a very critical issue that was originally reported
by Tom, where the driver reported csum errors or even page ref issues
for when cqe compression is enabled and rapidly active.
For your convenience this series was generated on top of net-next branch:
005c3490e9 ('Revert "ath10k: Search SMBIOS for OEM board file extension"')
for -stable:
net/mlx5e: Register/unregister vport representors on interface (for kernel >= 4.9)
net/mlx5e: Do not reduce LRO WQE size when not using build_skb (for kernel >= 4.9)
net/mlx5e: Fix broken CQE compression initialization (for kernel >= 4.9)
net/mlx5e: Update MPWQE stride size when modifying CQE compress state (for kernel >= 4.7)
net/mlx5e: Fix wrong CQE decompression (for kernel >= 4.7)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In cqe compression with striding RQ, the decompression of the CQE field
wqe_counter was done with a wrong wraparound value.
This caused handling cqes with a wrong pointer to wqe (rx descriptor)
and creating SKBs with wrong data, pointing to wrong (and already consumed)
strides/pages.
The meaning of the CQE field wqe_counter in striding RQ holds the
stride index instead of the WQE index. Hence, when decompressing
a CQE, wqe_counter should have wrapped-around the number of strides
in a single multi-packet WQE.
We dropped this wrap-around mask at all in CQE decompression of striding
RQ. It is not needed as in such cases the CQE compression session would
break because of different value of wqe_id field, starting a new
compression session.
Tested:
ethtool -K ethxx lro off/on
ethtool --set-priv-flags ethxx rx_cqe_compress on
super_netperf 16 {ipv4,ipv6} -t TCP_STREAM -m 50 -D
verified no csum errors and no page refcount issues.
Fixes: 7219ab34f1 ("net/mlx5e: CQE compression")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Tom Herbert <tom@herbertland.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the admin enables/disables cqe compression, updating
mpwqe stride size is required:
CQE compress ON ==> stride size = 256B
CQE compress OFF ==> stride size = 64B
This is already done on driver load via mlx5e_set_rq_type_params, all we
need is just to call it on arbitrary admin changes of cqe compression
state via priv flags or when changing timestamping state
(as it is mutually exclusive with cqe compression).
This bug introduces no functional damage, it only makes cqe compression
occur less often, since in ConnectX4-LX CQE compression is performed
only on packets smaller than stride size.
Tested:
ethtool --set-priv-flags ethxx rx_cqe_compress on
pktgen with 64 < pkt size < 256 and netperf TCP_STREAM (IPv4/IPv6)
verify `ethtool -S ethxx | grep compress` are advancing more often
(rapidly)
Fixes: 7219ab34f1 ("net/mlx5e: CQE compression")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of RQ type parameters are derived from CQE compression state flag,
CQE compression flag was initialized only after RQ type parameters
setup. This leads to load RQ with stride size smaller than what we
want for when CQE compression is on.
This bug introduces no functional damage, it only makes CQE compression
occur less often, since in ConnectX4-LX CQE compression is performed
only on packets smaller than stride size.
Fix this by marking default status of CQE compression in PFLAG prior to
calling mlx5e_set_rq_priv_params(), as it inits some fields based on it.
Tested:
load driver on systems where rx CQE compress will be on (MH)
pktgen with 64 < pkt size < 256 and netperf TCP_STREAM (IPv4/IPv6)
verify `ethtool -S ethxx | grep compress` are advancing more often
(rapidly)
Fixes: 2fc4bfb725 ("net/mlx5e: Dynamic RQ type infrastructure")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rq_type is Striding RQ, no room of SKB_RESERVE is needed
as SKB allocation is not done via build_skb.
Fixes: e4b8550807 ("net/mlx5e: Slightly reduce hardware LRO size")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently vport representors are added only on driver load and removed on
driver unload. Apparently we forgot to handle them when we added the
seamless reset flow feature. This caused to leave the representors
netdevs alive and active with open HW resources on pci shutdown and on
error reset flows.
To overcome this we move their handling to interface attach/detach, so
they would be cleaned up on shutdown and recreated on reset flows.
Fixes: 26e59d8077 ("net/mlx5e: Implement mlx5e interface attach/detach callbacks")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must hold the rcu read lock across looking up glocks and trying to
bump their refcount to prevent the glocks from being freed in between.
Cc: <stable@vger.kernel.org> # 4.3+
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Since the
commit f1c131b454
crypto: xts - Convert to skcipher
the XTS mode is based on ECB, so the mode must select
ECB otherwise it can fail to initialize.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
pci_enable_msix has been long deprecated, but this driver adds a new
instance. Convert it to pci_alloc_irq_vectors and greatly simplify
the code, and make sure the prope code properly unwinds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
pci_enable_msix has been long deprecated, but this driver adds a new
instance. Convert it to pci_alloc_irq_vectors and greatly simplify
the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Merge updates from Andrew Morton:
"142 patches:
- DAX updates
- various misc bits
- OCFS2 updates
- most of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (142 commits)
mm/z3fold.c: limit first_num to the actual range of possible buddy indexes
mm: fix <linux/pagemap.h> stray kernel-doc notation
zram: remove obsolete sysfs attrs
mm/memblock.c: remove unnecessary log and clean up
oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
mm: drop unused argument of zap_page_range()
mm: drop zap_details::check_swap_entries
mm: drop zap_details::ignore_dirty
mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled
mm: help __GFP_NOFAIL allocations which do not trigger OOM killer
mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically
mm: consolidate GFP_NOFAIL checks in the allocator slowpath
lib/show_mem.c: teach show_mem to work with the given nodemask
arch, mm: remove arch specific show_mem
mm, page_alloc: warn_alloc print nodemask
mm, page_alloc: do not report all nodes in show_mem
Revert "mm: bail out in shrink_inactive_list()"
mm, vmscan: consider eligible zones in get_scan_count
mm, vmscan: cleanup lru size claculations
mm, vmscan: do not count freed pages as PGDEACTIVATE
...
- Sync dtc to upstream commit 0931cea3ba20. This picks up overlay
support in dtc.
- Set dma_ops for reserved memory users.
- Make references to IOMMU consistent in DT bindings.
- Cleanup references to pm_power_off in bindings.
- Move some display bindings that snuck into the old bindings/video/
path.
- Fix some wrong documentation paths caused from binding restructuring.
- Vendor prefixes for Faraday and Fujitsu.
- Fix an of_node ref counting leak in of_find_node_opts_by_path
- Introduce new graph helper of_graph_get_remote_node() which will be
used by DRM drivers in 4.12.
-----BEGIN PGP SIGNATURE-----
iQItBAABCAAXBQJYrcjTEBxyb2JoQGtlcm5lbC5vcmcACgkQ+vtdtY28YcOPjg/8
CZgxb3humk7Kkt+I3IsZCxsXhgdk2+CPfCSHlK1bc5Jqg0TzepQvFt4ZkuRHkZy/
pQLUpRnUEUl64aaE8WxZY8ELYZggWazcTnWgCOzvoYpqSD4dAkAsqeti1Qh9PGKz
fNLnPWREftojFZ7wVQ8btxC1dINc9eC9eEHsIsS8S8UIgLgv/aN6PeG1Ll/UUa9d
u9orY3lxrz8JvdZslGtd1XLZqehIG0AAXqYRasyKl6Sc1NgjdwJTrqoeHEnCYaBM
0+JUOf8kjRa1QNYN4SpuQ1gpovS8tPUGuODrWF2FvaQIxYHTzF8MpLqTvjBzdsj6
b1owHzMXOLlPqqmmQ+jkHY5phisM4heJCIanNerzfM9+lHvb6kB3EuoTkAGvzsr/
HCJi/Wk8tVrw6MYdnav0NT6aLIgZmrVspeJKlS1IdIkpsxZ64DsgX/YS/qGn/fcx
qMrDXh8gMFwJBENRCKmKYFu1kzJkBoVEXtGlIbRQDZwOIuHPJl/ed6naMSfyUmcL
wEe1I3buyB38FVzUTM2y0K1LfFJSJqOFSWTy5WCcTyP4cbKUyEja8vzN1Cx8BDwf
wYtGWQQcg1Pyo074De6ojXWPiiW8f64GLLjqPALjA+J6JtZY5PwAnGXV+1ZYgX+V
hV+kXbS4UIjN1koxroJ7ahfouWdOignmpwdvomQPLG8=
=D/BO
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree updates from Rob Herring:
"Pretty standard stuff with dtc upstream sync being the biggest piece.
- Sync dtc to upstream commit 0931cea3ba20. This picks up overlay
support in dtc.
- Set dma_ops for reserved memory users.
- Make references to IOMMU consistent in DT bindings.
- Cleanup references to pm_power_off in bindings.
- Move some display bindings that snuck into the old bindings/video/
path.
- Fix some wrong documentation paths caused from binding
restructuring.
- Vendor prefixes for Faraday and Fujitsu.
- Fix an of_node ref counting leak in of_find_node_opts_by_path
- Introduce new graph helper of_graph_get_remote_node() which will be
used by DRM drivers in 4.12"
* tag 'devicetree-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (27 commits)
DT: add Faraday Tec. as vendor
of: introduce of_graph_get_remote_node
of: Add missing space at end of pr_fmt().
of: make of_device_make_bus_id() static
of: fix of_node leak caused in of_find_node_opts_by_path
dt-bindings: net: remove reference to fixed link support
dt-bindings: power: reset: qnap-poweroff: Drop reference to pm_power_off
dt-bindings: power: reset: gpio-poweroff: Drop reference to pm_power_off
dt-bindings: mfd: as3722: Drop reference to pm_power_off
dt-bindings: display: move ANX7814 and SiI8620 bridge bindings
of/unittest: Swap arguments of of_unittest_apply_overlay()
Documentation: usb: fix wrong documentation paths
serial: fsl-imx-uart.txt: Remove generic property
devicetree: Add Fujitsu Ltd. vendor prefix
Documentation: display: fix wrong documentation paths
of: remove redundant memset in overlay
bus:qcom : Fix typo in qcom,ebi2.txt
dt-bindings: qman: Remove pool channel node
Documentation: panel-dpi: fix path to display-timing.txt
devicetree: bindings: clk: mvebu: fix description for sata1 on Armada XP
...
Three more DocBook template files have been converted to RST; only 21 to
go. There are various build improvements and the usual array of
documentation improvements and fixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYriFXAAoJEI3ONVYwIuV6iTMP/iV7ownq9IK1f8askcXKM76i
NoRdj4/JywAPQ73vLhOSDVELGdVJNRBjdyOdBRzxPgsqAhFmm79lVYV2eLIffQ2k
7LcVbEQR77I+4z9SwqIVbIWNCBry7Hu8aWh7moDL3I6yeuay408yr5YW2lIlsqHZ
V/LZgkTWDe+iQPeXNA4Djzylx0lcRlAy4yMSLjN1+gb9/uBnXb9J0eGJzgfZfrL8
fiIhymg3bv8vB99l6LMR5vT343QLWXf1yS31A7rPQvwkDo6zFehUJA0XNfIsl2dw
VQYsvl9vp9wy3e6Y0qKXPn1XhAhCrm64P3crBxK31MMvcKZVCfeRSZ78wrvpvewy
MVLlXdqop1bHPHowtRfA5jwxr1NqcYp+Jg0+YGX3iXpPi1Jfk36DNUy9iWvtvIzr
lWgQcIKsdCwwYUcvPR8Kt8T/3q/AHbYlI6mimWlkmbZwncQcgCrH5xSG+c2BIPfV
fn3W6eLHBn8RyVsxlaXlA0Y9TNtI/Cm85b3Ri10pFvhl868ppWfJxXHi7UtcbU58
sQzahISCTXOH/NQwkkh7kFMtczbB43rAcChvF7EUYpazVBpJ4P4HxKFg3eIzIdc6
VlBSaMu1hxUGoYxNNYuKr/nYstuczLOKzK7q4j/JOExY3RgTWP+T3bF02wgubvoa
D/9WfScewkgCJRoA7i17
=C5nd
-----END PGP SIGNATURE-----
Merge tag 'docs-4.11' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A slightly quieter cycle for documentation this time around.
Three more DocBook template files have been converted to RST; only 21
to go. There are various build improvements and the usual array of
documentation improvements and fixes"
* tag 'docs-4.11' of git://git.lwn.net/linux: (44 commits)
docs / driver-api: Fix structure references in device_link.rst
PM / docs: Fix structure references in device.rst
Add a target to check broken external links in the Documentation
Documentation: Fix linux-api list typo
Documentation: DocBook/Makefile comment typo
Improve sparse documentation
Documentation: make Makefile.sphinx no-ops quieter
Documentation: DMA-ISA-LPC.txt
Documentation: input: fix path to input code definitions
docs: Remove the copyright year from conf.py
docs: Fix a warning in the Korean HOWTO.rst translation
PM / sleep / docs: Convert PM notifiers document to reST
PM / core / docs: Convert sleep states API document to reST
PM / core: Update kerneldoc comments in pm.h
doc-rst: Fix recursive make invocation from macros
doc-rst: Delete output of failed dot-SVG conversion
doc-rst: Break shell command sequences on failure
Documentation/sphinx: make targets independent of Sphinx work for HAVE_SPHINX=0
doc-rst: fixed cleandoc target when used with O=dir
Documentation/sphinx: prevent generation of .pyc files in the source tree
...
200 commits and noteworthy changes for most architectures.
* ARM:
- GICv3 save/restore
- cache flushing fixes
- working MSI injection for GICv3 ITS
- physical timer emulation
* MIPS:
- various improvements under the hood
- support for SMP guests
- a large rewrite of MMU emulation. KVM MIPS can now use MMU notifiers
to support copy-on-write, KSM, idle page tracking, swapping, ballooning
and everything else. KVM_CAP_READONLY_MEM is also supported, so that
writes to some memory regions can be treated as MMIO. The new MMU also
paves the way for hardware virtualization support.
* PPC:
- support for POWER9 using the radix-tree MMU for host and guest
- resizable hashed page table
- bugfixes.
* s390: expose more features to the guest
- more SIMD extensions
- instruction execution protection
- ESOP2
* x86:
- improved hashing in the MMU
- faster PageLRU tracking for Intel CPUs without EPT A/D bits
- some refactoring of nested VMX entry/exit code, preparing for live
migration support of nested hypervisors
- expose yet another AVX512 CPUID bit
- host-to-guest PTP support
- refactoring of interrupt injection, with some optimizations thrown in
and some duct tape removed.
- remove lazy FPU handling
- optimizations of user-mode exits
- optimizations of vcpu_is_preempted() for KVM guests
* generic:
- alternative signaling mechanism that doesn't pound on tsk->sighand->siglock
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJYral1AAoJEL/70l94x66DbNgH/Rx8YXuidFq2fe3RWOvld3RK
85OM/D5g38cTLpBE0/sJpcvX34iYN8U/l5foCZwpxB+83GHEk2Cr57JyfTogdaAJ
x8dBhHKQCA/HxSQUQLN6nFqRV+yT8WUR92Fhqx82+80BSen5Yzcfee/TDoW6T1IW
g8CYgX9FrRaGOX066ImAuUfdAdUVjyssfs9VttDTX+HiusPeuBPx/wsRe1ZEEPlH
vnltIJQb1ETV2GOZLUojKjzH6aZkjIl29XxjkYii9JTUornClG0DfW+5QT3uLrB5
gJ+G+Zmpsq8ZBx9jNDtAi7sFsoPY1Mzf+JPNCGXBra2sP2GrBAuXcxmgznRYltQ=
=8IIp
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"4.11 is going to be a relatively large release for KVM, with a little
over 200 commits and noteworthy changes for most architectures.
ARM:
- GICv3 save/restore
- cache flushing fixes
- working MSI injection for GICv3 ITS
- physical timer emulation
MIPS:
- various improvements under the hood
- support for SMP guests
- a large rewrite of MMU emulation. KVM MIPS can now use MMU
notifiers to support copy-on-write, KSM, idle page tracking,
swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is
also supported, so that writes to some memory regions can be
treated as MMIO. The new MMU also paves the way for hardware
virtualization support.
PPC:
- support for POWER9 using the radix-tree MMU for host and guest
- resizable hashed page table
- bugfixes.
s390:
- expose more features to the guest
- more SIMD extensions
- instruction execution protection
- ESOP2
x86:
- improved hashing in the MMU
- faster PageLRU tracking for Intel CPUs without EPT A/D bits
- some refactoring of nested VMX entry/exit code, preparing for live
migration support of nested hypervisors
- expose yet another AVX512 CPUID bit
- host-to-guest PTP support
- refactoring of interrupt injection, with some optimizations thrown
in and some duct tape removed.
- remove lazy FPU handling
- optimizations of user-mode exits
- optimizations of vcpu_is_preempted() for KVM guests
generic:
- alternative signaling mechanism that doesn't pound on
tsk->sighand->siglock"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits)
x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
x86/paravirt: Change vcp_is_preempted() arg type to long
KVM: VMX: use correct vmcs_read/write for guest segment selector/base
x86/kvm/vmx: Defer TR reload after VM exit
x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss
x86/kvm/vmx: Simplify segment_base()
x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels
x86/kvm/vmx: Don't fetch the TSS base from the GDT
x86/asm: Define the kernel TSS limit in a macro
kvm: fix page struct leak in handle_vmon
KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now
KVM: Return an error code only as a constant in kvm_get_dirty_log()
KVM: Return an error code only as a constant in kvm_get_dirty_log_protect()
KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl()
KVM: x86: remove code for lazy FPU handling
KVM: race-free exit from KVM_RUN without POSIX signals
KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message
KVM: PPC: Book3S PR: Ratelimit copy data failure error messages
KVM: Support vCPU-based gfn->hva cache
KVM: use separate generations for each address space
...
* Fix a boot crash caused by the VT-d driver when booted with IOMMU
disabled. This was introduced with the recent IOMMU changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYrY3WAAoJECvwRC2XARrjb5YP/0ElmfA2AF+saJCfri/4f4Qp
7Qe5IbO2SWXe+46/PHyrTidhSC+zmO2uMmFNzhOYVdWg51XIUxSbF29qGX9/SxE+
CHMN4pwB7sZKyiJ9KJOM6zYSHIbyAUZvZpeD4xZOhK+6mJ+32CD8soqDFynpEMwi
pG8h7fyUXQF2Z2SEGE1kNA4L0Md2Ca7BSTXzwyv1cYVfq7Db7Lj9o0ykg/9U+9lX
oTtEa1WM9pwDjfqDLlbTjmg7kB6AcEmwPhckkhABXxshV1h5KmVS3WB1m7VedJeK
BJCs/GI2qvkr6/BfwFt4p/Mw4WXUmYSR0vPM3PqyS45rFQDUDcPFDLs2bwo7x43K
L/9VHnLfWc0LJ0aIU7jXMLVr454IkGBeOkT99ztSx5vRM/EQDl7Q1Rhe0ahNqOGe
B4tbX1n8p3dO2ftH3U4GvnDfBJ1rdOR9hD+ZBvMd/n7ezJhXAy/KI65va4IZAmn/
r7RfDvxhBskr/7ZhTttdoiU5EdScxG5wYm7sZej4N4Oa18xeSO05glSbwuFgAiFR
tcMj9stjd/Siz1LOuh+pwptOt2sv2F68+V9qC1xhFPdTD2Mkax0U0S1/2+Dztxm4
MUbXwkxw43r6sybeIATcZmumT6DY0vIiiU8OcwudtpOlhEdUmcgWAvyUMuP/zJrp
rDu/kbHzqShv/6di/a96
=K9RX
-----END PGP SIGNATURE-----
Merge tag 'iommu-fix-v4.11-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU fix from Joerg Roedel:
"Fix a boot crash caused by the VT-d driver when booted with IOMMU
disabled. This was introduced with the recent IOMMU changes"
* tag 'iommu-fix-v4.11-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Fix crash on boot when DMAR is disabled
- Fix i.MX5 TV encoder probing in case no dac-supply regulator
is set in the device tree.
- Remove 64 pixel min_width/height limit, which unnecessarily
prohibits creation of small frame buffers.
- Add missing ipu_csi_set_downsize export, for media drivers
built as modules.
- Stop modifying pdev->dev.of_node for IPU client devices that
do not have an OF modalias to fix module autoloading.
-----BEGIN PGP SIGNATURE-----
iQJLBAABCAA1FiEEBsBxhV1FaKwXuCOBUMKIHHCeYOsFAlimobkXHHAuemFiZWxA
cGVuZ3V0cm9uaXguZGUACgkQUMKIHHCeYOtMGw//YbLEgA/KvTw0WY2qZRmXjk0r
rYVJqYAdidwns6QiF/P4RsoXfMunk+y430ByS2TUFCFHZY6IQrtPyAylWyPqCoRA
+NALGAQ4whfkkobLuxyIlJ7ba9sh8DDgCTrStUaDtlpvQwmZ3BbH+g5uWnOeYvX3
Q9wRBY/w9DgupfKZ0pCT8Hu7YzcY2sOk/EgmroBOTWKBIXlwaHjfVek9H9D9geg7
6xSlpUp/7nOyX9aX9YJ3eCVrfUKtxFIrVepmvsdZw1p0r11/TgoAcJ5E7OBjJE26
IVONxjHE7NYroaq9qLp1LJf0i/htTKABAKirxso5NIlqpUPRnUqWRTXr7fZTIv64
D9slhxZzfnjS1X+9dMGykLm1Aj+DN84tqCw2y/+qVhSir2T8n9x/j9HQTy1NBCuh
esu18pCWBI7bPWD1UCaTn7qPhlatSPrNoHokD83xzSAnZgT9Efvli3eLPnrMPY1z
oNsZ5rGTI3m2TMLMYjWI+XRjWDR3yNwRMzQZR3DTxV9FZNIfgyPkjRGbGJmDQsN9
ptonQ1KReJnUXrFb7V1QMWlFvUMbDEN6fzuz6Gl4JrBVcjm1+WWcCYRgK/C0DfWb
xFPexIar3LX0w8QMbLPfYMHNhG3gTA9/kuj2FVV63kdnaaa4PJbhC0nmSQd/8Lnj
YW/VkTmYWTtfay5+5GU=
=hfko
-----END PGP SIGNATURE-----
Merge tag 'imx-drm-fixes-2017-02-17' of https://git.pengutronix.de/git/pza/linux into drm-next
imx-drm: TVE regulator, fb size limit, and ipu-v3 module fixes
- Fix i.MX5 TV encoder probing in case no dac-supply regulator
is set in the device tree.
- Remove 64 pixel min_width/height limit, which unnecessarily
prohibits creation of small frame buffers.
- Add missing ipu_csi_set_downsize export, for media drivers
built as modules.
- Stop modifying pdev->dev.of_node for IPU client devices that
do not have an OF modalias to fix module autoloading.
* tag 'imx-drm-fixes-2017-02-17' of https://git.pengutronix.de/git/pza/linux:
gpu: ipu-v3: Stop overwriting pdev->dev.of_node of child devices
gpu: ipu-v3: export ipu_csi_set_downsize
drm/imx: lift 64x64 pixel minimum framebuffer size requirement
drm/imx: imx-tve: Do not set the regulator voltage
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYoM2fAAoJEHm+PkMAQRiGr9MH/izEAMri7rJ0QMc3ejt+WmD0
8pkZw3+MVn71z6cIEgpzk4QkEWJd5rfhkETCeCp7qQ9V6cDW1FDE9+0OmPjiphDt
nnzKs7t7skEBwH5Mq5xygmIfkv+Z0QGHZ20gfQWY3F56Uxo+ARF88OBHBLKhqx3v
98C7YbMFLKBslKClA78NUEIdx0UfBaRqerlERx0Lfl9aoOrbBS6WI3iuREiylpih
9o7HTrwaGKkU4Kd6NdgJP2EyWPsd1LGalxBBjeDSpm5uokX6ALTdNXDZqcQscHjE
RmTqJTGRdhSThXOpNnvUJvk9L442yuNRrVme/IqLpxMdHPyjaXR3FGSIDb2SfjY=
=VMy8
-----END PGP SIGNATURE-----
Merge tag 'v4.10-rc8' into drm-next
Linux 4.10-rc8
Backmerge Linus rc8 to fix some conflicts, but also
to avoid pulling it in via a fixes pull from someone.
Pull seccomp fix from James Morris:
"A fix for a regression in the seccomp code (it was supposed to be in
the first pull req but I had it queued in the wrong branch)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
seccomp: Only dump core when single-threaded
- Various cleanups
- Livelock fixes for eofblocks scanning
- Improved input verification for on-disk metadata
- Fix races in the copy on write remap mechanism
- Fix buffer io error timeout controls
- Streamlining of directio copy on write
- Asynchronous discard support
- Fix asserts when splitting delalloc reservations
- Don't bloat bmbt when right shifting extents
- Inode alignment fixes for 32k block sizes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYp85wAAoJEPh/dxk0SrTr5HgP/jcx/oI+ap/NaXMi1Q8K65mh
C3gf27cgUxtdGnEO5KRUE1Jyscuu4ZpzugDdLQISwR55kesT5FU0xpgbsfiICc86
dxLAhg8auwpTfHV+96Do2hfpO3IhYoBC2w5jo32+C+SaQUqTdPixncZukX89tjyP
HOFLrQnpc336hCO2rv1Q9hSkD6IUCkSAtk+Dh1xMvbsmKFLGdmkTdqUQfl1U4YnV
2S98k9QSRdiVyzj3lAGOy+IU9aTcPX/PptMEYaQZEaod5WWNjy91lQZNM6zRc4QW
8P199yiH6CQa2vESO2SV72cJ40WihM1KQXqnrlJjAMGQ7mMGTGJcTwxhuZYUbDYZ
cuk6bAUaijt/PzfmydJKlcH8vFerX4aU4CGkxPU0nph0iTR5kxYlIAMmFw2cdRzf
Iar3SBb8Pc9jiNnEZMFsQ0Fd9hNk9rNoUSpKqm4FtSRocU6JjmpAdPqNYdTVKc2l
2EY7JMo0xCaTVC1WT6sE2NsxsFvm0R7H6HHG2vMFIMNkhI24GRijIXH6dQlaGCQJ
5oTHrSM7503qPlEQNsxF7zI02LpJT+duf+2ODw/FSjA1z/TWwOUYYUrPUOyQNdzP
NrRnMa6LWsEehkuvz2FFko8PKXD55lTuUP1KdjigjqKp8Jzkc/PP+uvuwF5vUFfd
pWRvE5m/NePWBZetbL3Q
=Ga1F
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"Here are the XFS changes for 4.11. We aren't introducing any major
features in this release cycle except for this being the first merge
window I've managed on my own. :)
Changes since last update:
- Various cleanups
- Livelock fixes for eofblocks scanning
- Improved input verification for on-disk metadata
- Fix races in the copy on write remap mechanism
- Fix buffer io error timeout controls
- Streamlining of directio copy on write
- Asynchronous discard support
- Fix asserts when splitting delalloc reservations
- Don't bloat bmbt when right shifting extents
- Inode alignment fixes for 32k block sizes"
* tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (39 commits)
xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG
xfs: simplify xfs_rtallocate_extent
xfs: tune down agno asserts in the bmap code
xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment
xfs: don't reserve blocks for right shift transactions
xfs: fix len comparison in xfs_extent_busy_trim
xfs: fix uninitialized variable in _reflink_convert_cow
xfs: split indlen reservations fairly when under reserved
xfs: handle indlen shortage on delalloc extent merge
xfs: resurrect debug mode drop buffered writes mechanism
xfs: clear delalloc and cache on buffered write failure
xfs: don't block the log commit handler for discards
xfs: improve busy extent sorting
xfs: improve handling of busy extents in the low-level allocator
xfs: don't fail xfs_extent_busy allocation
xfs: correct null checks and error processing in xfs_initialize_perag
xfs: update ctime and mtime on clone destinatation inodes
xfs: allocate direct I/O COW blocks in iomap_begin
xfs: go straight to real allocations for direct I/O COW writes
xfs: return the converted extent in __xfs_reflink_convert_cow
...
Pull printk updates from Petr Mladek:
- Add Petr Mladek, Sergey Senozhatsky as printk maintainers, and Steven
Rostedt as the printk reviewer. This idea came up after the
discussion about printk issues at Kernel Summit. It was formulated
and discussed at lkml[1].
- Extend a lock-less NMI per-cpu buffers idea to handle recursive
printk() calls by Sergey Senozhatsky[2]. It is the first step in
sanitizing printk as discussed at Kernel Summit.
The change allows to see messages that would normally get ignored or
would cause a deadlock.
Also it allows to enable lockdep in printk(). This already paid off.
The testing in linux-next helped to discover two old problems that
were hidden before[3][4].
- Remove unused parameter by Sergey Senozhatsky. Clean up after a past
change.
[1] http://lkml.kernel.org/r/1481798878-31898-1-git-send-email-pmladek@suse.com
[2] http://lkml.kernel.org/r/20161227141611.940-1-sergey.senozhatsky@gmail.com
[3] http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com
[4] http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
printk: drop call_console_drivers() unused param
printk: convert the rest to printk-safe
printk: remove zap_locks() function
printk: use printk_safe buffers in printk
printk: report lost messages in printk safe/nmi contexts
printk: always use deferred printk when flush printk_safe lines
printk: introduce per-cpu safe_print seq buffer
printk: rename nmi.c and exported api
printk: use vprintk_func in vprintk()
MAINTAINERS: Add printk maintainers
Summary of modules changes for the 4.11 merge window:
- A few small code cleanups
- Add modules git tree url to MAINTAINERS
Signed-off-by: Jessica Yu <jeyu@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYrKzPAAoJEMBFfjjOO8Fy8B4QAJpwYokr7a7irVaRt9+c+So5
iRoQ+ZGB7r0oiJOpuUeVwvr/h5WUCc8kZctGG249Gg9WrT7ypVCKNGOuGv9KUi4g
lZZ3SkebBfAkzqRABa4VA4uoz6lC4KgaxVeMBZOu0HUmcM4fGmjv8iONj/6eqkpv
jWSsO2iyTHa5c6L08I2M2unOMG4PqAS7ZS1S58A3A3vG7py9vJhq7gnom4dYHYQW
2sOGyNvs3RaTyyb/Gvsx/hcs5TPLyr+fzIruqFWzepGcafBSxQy/TdThJw5x5oGU
QLjP/EqSKQWGyJ/Pzx8UE9bGxxStyJOEEhniyigQvIq1ERkPeXZx+1nllWvBXZ9f
v+OplyWAzvQNNv+MZEE6s0l7EQDiowOmnpyfHZOQTHky4JwAZ/WwKcjzLsLaRENW
ePWLsM8F7Hhg9rpXBBEK5USTh0brcaNs6ox0CjlMqme8aNxYBoOUB7KDzlyWQCLd
rtY6F9upfYmG13J6cDV6qbSirHt5L2aErgOFTbl9ZGQb9rXdsv4VjYUMaVrFgz/5
9dUNqoZUd4jvLAZT7XSUJsqUKqIUWTE9vFhPiKUDGyptynCwk23VMcd3p/SDjhyL
kuuIOxYEAi/F+claors16UN6psGb9yHYHsmuTsSbJcBHA+VyavqIuCcVVbKIScVR
nR+VSRH0vnx3M7hR/OEq
=zs2F
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
"Summary of modules changes for the 4.11 merge window:
- A few small code cleanups
- Add modules git tree url to MAINTAINERS"
* tag 'modules-for-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
MAINTAINERS: add tree for modules
module: fix memory leak on early load_module() failures
module: Optimize search_module_extables()
modules: mark __inittest/__exittest as __maybe_unused
livepatch/module: print notice of TAINT_LIVEPATCH
module: Drop redundant declaration of struct module
At present, Tying the first_num size to NCHUNKS_ORDER is confusing. the
number of chunks is completely unrelated to the number of buddies.
The patch limits the first_num to actual range of possible buddy indexes.
and that is more reasonable and obvious without functional change.
Link: http://lkml.kernel.org/r/1476776569-29504-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Delete stray (second) function description in find_lock_page()
kernel-doc notation.
Note: scripts/kernel-doc just ignores the second function description.
Fixes: 2457aec637 ("mm: non-atomically mark page accessed during page cache allocation where possible")
Link: http://lkml.kernel.org/r/b037e9a3-516c-ec02-6c8e-fa5479747ba6@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We had a deprecated_attr_warn() warning for 2 years and now the time has
come and we finally can do the cleanup.
The plan was as follows:
: per-stat sysfs attributes are considered to be deprecated.
: The basic strategy is:
: -- the existing RW nodes will be downgraded to WO nodes (in linux 4.11)
: -- deprecated RO sysfs nodes will eventually be removed (in linux 4.11)
:
: The list of deprecated attributes can be found here:
: Documentation/ABI/obsolete/sysfs-block-zram
:
: Basically, every attribute that has its own read accessible sysfs
: node (e.g. num_reads) *AND* is accessible via one of the stat files
: (zram<id>/stat or zram<id>/io_stat or zram<id>/mm_stat) is considered
: to be deprecated.
The patch also removes `obsolete/sysfs-block-zram', clean ups
`testing/sysfs-block-zram' and tweaks zram.txt files.
Link: http://lkml.kernel.org/r/20170118035838.11090-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no variable named flags in memblock_add() and
memblock_reserve() so remove it from the log messages.
This patch also cleans up the type casting for phys_addr_t by using %pa
to print them.
Link: http://lkml.kernel.org/r/1484720165-25403-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Logic on whether we can reap pages from the VMA should match what we
have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP
VMAs, but we don't now.
Let's just extract condition on which we can shoot down pagesi from a
VMA with MADV_DONTNEED into separate function and use it in both places.
Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no users of zap_page_range() who wants non-NULL 'details'.
Let's drop it.
Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
detail == NULL would give the same functionality as
.check_swap_entries==true.
Link: http://lkml.kernel.org/r/20170118122429.43661-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only user of ignore_dirty is oom-reaper. But it doesn't really use
it.
ignore_dirty only has effect on file pages mapped with dirty pte. But
oom-repear skips shared VMAs, so there's no way we can dirty file pte in
them.
Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch "mm, page_alloc: warn_alloc print nodemask" implicitly sets
the allocation nodemask to cpuset_current_mems_allowed when there is no
effective mempolicy. cpuset_current_mems_allowed is only effective when
cpusets are enabled, which is also printed by warn_alloc(), so setting
the nodemask to cpuset_current_mems_allowed is redundant and prevents
debugging issues where ac->nodemask is not set properly in the page
allocator.
This provides better debugging output since
cpuset_print_current_mems_allowed() is already provided.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701181347320.142399@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that __GFP_NOFAIL doesn't override decisions to skip the oom killer
we are left with requests which require to loop inside the allocator
without invoking the oom killer (e.g. GFP_NOFS|__GFP_NOFAIL used by fs
code) and so they might, in very unlikely situations, loop for ever -
e.g. other parallel request could starve them.
This patch tries to limit the likelihood of such a lockup by giving
these __GFP_NOFAIL requests a chance to move on by consuming a small
part of memory reserves. We are using ALLOC_HARDER which should be
enough to prevent from the starvation by regular allocation requests,
yet it shouldn't consume enough from the reserves to disrupt high
priority requests (ALLOC_HIGH).
While we are at it, let's introduce a helper __alloc_pages_cpuset_fallback
which enforces the cpusets but allows to fallback to ignore them if the
first attempt fails. __GFP_NOFAIL requests can be considered important
enough to allow cpuset runaway in order for the system to move on. It
is highly unlikely that any of these will be GFP_USER anyway.
Link: http://lkml.kernel.org/r/20161220134904.21023-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_pages_may_oom makes sure to skip the OOM killer depending on the
allocation request. This includes lowmem requests, costly high order
requests and others. For a long time __GFP_NOFAIL acted as an override
for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is far from obvious. Note that the primary motivation
for skipping the OOM killer is to prevent from pre-mature invocation.
The exception has been added by commit 82553a937f ("oom: invoke oom
killer for __GFP_NOFAIL"). The changelog points out that the oom killer
has to be invoked otherwise the request would be looping for ever. But
this argument is rather weak because the OOM killer doesn't really
guarantee a forward progress for those exceptional cases:
- it will hardly help to form costly order which in turn can result in
the system panic because of no oom killable task in the end - I believe
we certainly do not want to put the system down just because there is a
nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
the consequences. It is much better this request would loop for ever
than the massive system disruption
- lowmem is also highly unlikely to be freed during OOM killer
- GFP_NOFS request could trigger while there is still a lot of memory
pinned by filesystems.
This patch simply removes the __GFP_NOFAIL special case in order to have a
more clear semantic without surprising side effects.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Nils Holland <nholland@tisys.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa has pointed out that commit 0a0337e0d1 ("mm, oom: rework
oom detection") has subtly changed semantic for costly high order
requests with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail
right now. My code inspection didn't reveal any such users in the tree
but it is true that this might lead to unexpected allocation failures
and subsequent OOPs.
__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.
While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.
[akpm@linux-foundation.org: fix alloc_flags may-be-used-uninitalized]
Link: http://lkml.kernel.org/r/20161220134904.21023-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show_mem() allows to filter out node specific data which is irrelevant
to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is
done in skip_free_areas_node which skips all nodes which are not in the
mems_allowed of the current process. This works most of the time as
expected because the nodemask shouldn't be outside of the allocating
task but there are some exceptions. E.g. memory hotplug might want to
request allocations from outside of the allowed nodes (see
new_node_page).
Get rid of this hardcoded behavior and push the allocation mask down the
show_mem path and use it instead of cpuset_current_mems_allowed. NULL
nodemask is interpreted as cpuset_current_mems_allowed.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a generic implementation for quite some time already. If there
is any arch specific information to be printed then we should add a
callback called from the generic code rather than duplicate the whole
show_mem.
The current code has resulted in the code duplication and the output
divergence which is both confusing and adds maintainance costs.
Let's just get rid of this mess.
Link: http://lkml.kernel.org/r/20170117091543.25850-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn> [UniCore32]
Acked-by: Helge Deller <deller@gmx.de> [for parisc]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
warn_alloc is currently used for to report an allocation failure or an
allocation stall. We print some details of the allocation request like
the gfp mask and the request order. We do not print the allocation
nodemask which is important when debugging the reason for the allocation
failure as well. We alreaddy print the nodemask in the OOM report.
Add nodemask to warn_alloc and print it in warn_alloc as well.
Link: http://lkml.kernel.org/r/20170117091543.25850-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>