Commit Graph

872045 Commits

Author SHA1 Message Date
Krzysztof Kozlowski 656b4e6398 cpuidle: Fix Kconfig indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
	$ sed -e 's/^        /\t/' -i */Kconfig

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-29 11:47:35 +01:00
Daniel Lezcano 5aa9ba6312 cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()
Modify cpuidle_use_deepest_state() to take an additional exit latency
limit argument to be passed to find_deepest_idle_state() and make
cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for
forced idle.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Rebase and rearrange code, subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 11:46:18 +01:00
Daniel Lezcano c55b51a06b cpuidle: Allow idle injection to apply exit latency limit
In some cases it may be useful to specify an exit latency limit for
the idle state to be used during CPU idle time injection.

Instead of duplicating the information in struct cpuidle_device
or propagating the latency limit in the call stack, replace the
use_deepest_state field with forced_latency_limit_ns to represent
that limit, so that the deepest idle state with exit latency within
that limit is forced (i.e. no governors) when it is set.

A zero exit latency limit for forced idle means to use governors in
the usual way (analogous to use_deepest_state equal to "false" before
this change).

Additionally, add play_idle_precise() taking two arguments, the
duration of forced idle and the idle state exit latency limit, both
in nanoseconds, and redefine play_idle() as a wrapper around that
new function.

This change is preparatory, no functional impact is expected.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 11:32:55 +01:00
Rafael J. Wysocki cbda56d5fe cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks
Commit 99e98d3fb1 ("cpuidle: Consolidate disabled state checks")
overlooked the fact that the imx6q and tegra20 cpuidle drivers use
the "disabled" field in struct cpuidle_state for quirks which trigger
after the initialization of cpuidle, so reading the initial value of
that field is not sufficient for those drivers.

In order to allow them to implement the quirks without using the
"disabled" field in struct cpuidle_state, introduce a new helper
function and modify them to use it.

Fixes: 99e98d3fb1 ("cpuidle: Consolidate disabled state checks")
Reported-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-19 10:35:13 +01:00
Rafael J. Wysocki 85f6a17f24 cpuidle: teo: Avoid code duplication in conditionals
There are three places in teo_select() where a given amount of time
is compared with TICK_NSEC if tick_nohz_tick_stopped() returns true,
which is a bit of duplicated code.

Avoid that code duplication by defining a helper function to do the
check and using it in all of the places in question.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-15 00:54:33 +01:00
Rafael J. Wysocki 63f202e5ed cpuidle: teo: Avoid using "early hits" incorrectly
If the current state with the maximum "early hits" metric in
teo_select() is also the one "matching" the expected idle duration,
it will be used as the candidate one for selection even if its
"misses" metric is greater than its "hits" metric, which is not
correct.

In that case, the candidate state should be shallower than the
current one and its "early hits" metric should be the maximum
among the idle states shallower than the current one.

To make that happen, modify teo_select() to save the index of
the state whose "early hits" metric is the maximum for the
range of states below the current one and go back to that state
if it turns out that the current one should be rejected.

Fixes: 159e48560f ("cpuidle: teo: Fix "early hits" handling for disabled idle states")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-13 22:31:17 +01:00
Rafael J. Wysocki b6495b7f00 cpuidle: teo: Exclude cpuidle overhead from computations
One purpose of the computations in teo_update() is to determine
whether or not the (saved) time till the next timer event and the
measured idle duration fall into the same "bin", so avoid using
values that include the cpuidle overhead to obtain the latter.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-13 22:31:17 +01:00
Rafael J. Wysocki c1d51f684c cpuidle: Use nanoseconds as the unit of time
Currently, the cpuidle subsystem uses microseconds as the unit of
time which (among other things) causes the idle loop to incur some
integer division overhead for no clear benefit.

In order to allow cpuidle to measure time in nanoseconds, add two
new fields, exit_latency_ns and target_residency_ns, to represent the
exit latency and target residency of an idle state in nanoseconds,
respectively, to struct cpuidle_state and initialize them with the
help of the corresponding values in microseconds provided by drivers.
Additionally, change cpuidle_governor_latency_req() to return the
idle state exit latency constraint in nanoseconds.

Also meeasure idle state residency (last_residency_ns in struct
cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds
and update the cpuidle core and governors accordingly.

However, the menu governor still computes typical intervals in
microseconds to avoid integer overflows.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
2019-11-11 21:56:07 +01:00
Rafael J. Wysocki 99e98d3fb1 cpuidle: Consolidate disabled state checks
There are two reasons why CPU idle states may be disabled: either
because the driver has disabled them or because they have been
disabled by user space via sysfs.

In the former case, the state's "disabled" flag is set once during
the initialization of the driver and it is never cleared later (it
is read-only effectively).  In the latter case, the "disable" field
of the given state's cpuidle_state_usage struct is set and it may be
changed via sysfs.  Thus checking whether or not an idle state has
been disabled involves reading these two flags every time.

In order to avoid the additional check of the state's "disabled" flag
(which is effectively read-only anyway), use the value of it at the
init time to set a (new) flag in the "disable" field of that state's
cpuidle_state_usage structure and use the sysfs interface to
manipulate another (new) flag in it.  This way the state is disabled
whenever the "disable" field of its cpuidle_state_usage structure is
nonzero, whatever the reason, and it is the only place to look into
to check whether or not the state has been disabled.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-11-06 13:19:56 +01:00
Yin Fengwei fa583f71a9 ACPI: processor_idle: Skip dummy wait if kernel is in guest
In function acpi_idle_do_entry(), an ioport access is used for
dummy wait to guarantee hardware behavior. But it could trigger
unnecessary VMexit if kernel is running as guest in virtualization
environment.

If it's in virtualization environment, the deeper C state enter
operation (inb()) will trap to hypervisor. It's not needed to do
dummy wait after the inb() call. So we could just remove the
dummy io port access to avoid unnecessary VMexit.

And keep dummy io port access to maintain timing for native
environment.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-25 10:33:20 +02:00
Zhenzhong Duan 918c1fe9fb cpuidle: Do not unset the driver if it is there already
Fix __cpuidle_set_driver() to check if any of the CPUs in the mask has
a driver different from drv already and, if so, return -EBUSY before
updating any cpuidle_drivers per-CPU pointers.

Fixes: 82467a5a88 ("cpuidle: simplify multiple driver support")
Cc: 3.11+ <stable@vger.kernel.org> # 3.11+
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-24 23:22:33 +02:00
Rafael J. Wysocki 2c2a83d329 Merge back earlier cpuidle material for v5.5. 2019-10-24 23:12:55 +02:00
Zhenzhong Duan 31d851407f cpuidle: haltpoll: Take 'idle=' override into account
Currenly haltpoll isn't aware of the 'idle=' override, the priority is
'idle=poll' > haltpoll > 'idle=halt'. When 'idle=poll' is used, cpuidle
driver is bypassed but current_driver in sys still shows 'haltpoll'.

When 'idle=halt' is used, haltpoll takes precedence and makes
'idle=halt' have no effect.

Add a check to prevent the haltpoll driver from loading if 'idle=' is
present.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-22 11:43:17 +02:00
Linus Torvalds 7d194c2100 Linux 5.4-rc4 2019-10-20 15:56:22 -04:00
Linus Torvalds e2ab4ef83f Kbuild fixes for v5.4 (2nd)
- fix a bashism of setlocalversion
 
  - do not use the too new --sort option of tar
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl2sh9MeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGO90P/13dNgaibb3JcCvp
 +ugn+o26V+aqZ8uDLPPifUNFLLnDfGmgORAOe0Z2mGI2CqMDaWMYjvyZNF1KzOdD
 +WFZubiTUevYyM9AOG6O6EOGEDRumgvdNhnJIP+aCUNKGM50zzhuAa23y6Xip1NZ
 qJY5AYu9KiH9jTXtZAdGReuMdvNRsMZuvmI4Qnc21utItVxDFEQKGdTlRoJaGEdN
 Zc/f1HzWwlr7VWZiy0iMRSCAAtAA3TN3S1bw5QCYsuJb3LRl2EIOKMeSinRTYWRc
 zU5CHrkRSX4/M/o8VantTOZueuGMfbaOAl1KEDh8DUPDcrl+gmL4s8aFuDW+5d2+
 7pPlEe/FNZOAGVBDgiK59N3bl2Rn/Uyg9VGwFc8PU3VeY5VbS97ayBaLFcua2zwX
 88mtsZoq+fN3qsq8gfbb21qauTbEibyrk8lzjHoQl07TW82GFWG/GXSt9dQAm1sb
 w/M15WdaONKOzkYXGGCo0l6vdXx1+qqV/LZQcaWhfcoeU5YVVEmSRwCakCxxSWUE
 rvWtn0vljgH8kyb0gnUP6qLJC9clwMv58sl/mG65/D7zs9B74g16BkX0tU8vLSpr
 Ww1BpGtPwuFPp7+kEzWemtlulhDvow2xIX7bNIkAk1IIVpyCmU98AKpivR7G9Cvt
 4cZPSGRU+s2ou/hrtjryzvmwhGUG
 =KNWF
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild fixes from Masahiro Yamada:

 - fix a bashism of setlocalversion

 - do not use the too new --sort option of tar

* tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kheaders: substituting --sort in archive creation
  scripts: setlocalversion: fix a bashism
  kbuild: update comment about KBUILD_ALLDIRS
2019-10-20 12:36:57 -04:00
Linus Torvalds 4fe34d61a3 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A small set of x86 fixes:

   - Prevent a NULL pointer dereference in the X2APIC code in case of a
     CPU hotplug failure.

   - Prevent boot failures on HP superdome machines by invalidating the
     level2 kernel pagetable entries outside of the kernel area as
     invalid so BIOS reserved space won't be touched unintentionally.

     Also ensure that memory holes are rounded up to the next PMD
     boundary correctly.

   - Enable X2APIC support on Hyper-V to prevent boot failures.

   - Set the paravirt name when running on Hyper-V for consistency

   - Move a function under the appropriate ifdef guard to prevent build
     warnings"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/acpi: Move get_cmdline_acpi_rsdp() under #ifdef guard
  x86/hyperv: Set pv_info.name to "Hyper-V"
  x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu
  x86/hyperv: Make vapic support x2apic mode
  x86/boot/64: Round memory hole size up to next PMD page
  x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
2019-10-20 06:31:14 -04:00
Linus Torvalds 81c4bc31c4 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A small set of irq chip driver fixes and updates:

   - Update the SIFIVE PLIC interrupt driver to use the fasteoi handler
     to address the shortcomings of the existing flow handling which was
     prone to lose interrupts

   - Use the proper limit for GIC interrupt line numbers

   - Add retrigger support for the recently merged Anapurna Labs Fabric
     interrupt controller to make it complete

   - Enable the ATMEL AIC5 interrupt controller driver on the new
     SAM9X60 SoC"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/sifive-plic: Switch to fasteoi flow
  irqchip/gic-v3: Fix GIC_LINE_NR accessor
  irqchip/atmel-aic5: Add support for sam9x60 irqchip
  irqchip/al-fic: Add support for irq retrigger
2019-10-20 06:27:54 -04:00
Linus Torvalds 188768f3c0 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull hrtimer fixlet from Thomas Gleixner:
 "A single commit annotating the lockcless access to timer->base with
  READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Annotate lockless access to timer->base
2019-10-20 06:25:12 -04:00
Linus Torvalds 589f1222e0 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stop-machine fix from Thomas Gleixner:
 "A single fix, amending stop machine with WRITE/READ_ONCE() to address
  the fallout of KCSAN"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stop_machine: Avoid potential race behaviour
2019-10-20 06:22:25 -04:00
Linus Torvalds 531e93d114 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "I was battling a cold after some recent trips, so quite a bit piled up
  meanwhile, sorry about that.

  Highlights:

   1) Fix fd leak in various bpf selftests, from Brian Vazquez.

   2) Fix crash in xsk when device doesn't support some methods, from
      Magnus Karlsson.

   3) Fix various leaks and use-after-free in rxrpc, from David Howells.

   4) Fix several SKB leaks due to confusion of who owns an SKB and who
      should release it in the llc code. From Eric Biggers.

   5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet.

   6) Jumbo packets don't work after resume on r8169, as the BIOS resets
      the chip into non-jumbo mode during suspend. From Heiner Kallweit.

   7) Corrupt L2 header during MPLS push, from Davide Caratti.

   8) Prevent possible infinite loop in tc_ctl_action, from Eric
      Dumazet.

   9) Get register bits right in bcmgenet driver, based upon chip
      version. From Florian Fainelli.

  10) Fix mutex problems in microchip DSA driver, from Marek Vasut.

  11) Cure race between route lookup and invalidation in ipv4, from Wei
      Wang.

  12) Fix performance regression due to false sharing in 'net'
      structure, from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits)
  net: reorder 'struct net' fields to avoid false sharing
  net: dsa: fix switch tree list
  net: ethernet: dwmac-sun8i: show message only when switching to promisc
  net: aquantia: add an error handling in aq_nic_set_multicast_list
  net: netem: correct the parent's backlog when corrupted packet was dropped
  net: netem: fix error path for corrupted GSO frames
  macb: propagate errors when getting optional clocks
  xen/netback: fix error path of xenvif_connect_data()
  net: hns3: fix mis-counting IRQ vector numbers issue
  net: usb: lan78xx: Connect PHY before registering MAC
  vsock/virtio: discard packets if credit is not respected
  vsock/virtio: send a credit update when buffer size is changed
  mlxsw: spectrum_trap: Push Ethernet header before reporting trap
  net: ensure correct skb->tstamp in various fragmenters
  net: bcmgenet: reset 40nm EPHY on energy detect
  net: bcmgenet: soft reset 40nm EPHYs before MAC init
  net: phy: bcm7xxx: define soft_reset for 40nm EPHY
  net: bcmgenet: don't set phydev->link from MAC
  net: Update address for MediaTek ethernet driver in MAINTAINERS
  ipv4: fix race condition between route lookup and invalidation
  ...
2019-10-19 17:09:11 -04:00
Eric Dumazet 2a06b8982f net: reorder 'struct net' fields to avoid false sharing
Intel test robot reported a ~7% regression on TCP_CRR tests
that they bisected to the cited commit.

Indeed, every time a new TCP socket is created or deleted,
the atomic counter net->count is touched (via get_net(net)
and put_net(net) calls)

So cpus might have to reload a contended cache line in
net_hash_mix(net) calls.

We need to reorder 'struct net' fields to move @hash_mix
in a read mostly cache line.

We move in the first cache line fields that can be
dirtied often.

We probably will have to address in a followup patch
the __randomize_layout that was added in linux-4.13,
since this might break our placement choices.

Fixes: 355b985537 ("netns: provide pure entropy for net_hash_mix()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:21:53 -07:00
Vivien Didelot 50c7d2ba9d net: dsa: fix switch tree list
If there are multiple switch trees on the device, only the last one
will be listed, because the arguments of list_add_tail are swapped.

Fixes: 83c0afaec7 ("net: dsa: Add new binding implementation")
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:19:41 -07:00
Mans Rullgard 05908d72cc net: ethernet: dwmac-sun8i: show message only when switching to promisc
Printing the info message every time more than the max number of mac
addresses are requested generates unnecessary log spam.  Showing it only
when the hw is not already in promiscous mode is equally informative
without being annoying.

Signed-off-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:18:10 -07:00
Chenwandun 3d00cf2fbb net: aquantia: add an error handling in aq_nic_set_multicast_list
add an error handling in aq_nic_set_multicast_list, it may not
work when hw_multicast_list_set error; and at the same time
it will remove gcc Wunused-but-set-variable warning.

Signed-off-by: Chenwandun <chenwandun@huawei.com>
Reviewed-by: Igor Russkikh <igor.russkikh@aquantia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:16:38 -07:00
David S. Miller 708738376c Merge branch 'netem-fix-further-issues-with-packet-corruption'
Jakub Kicinski says:

====================
net: netem: fix further issues with packet corruption

This set is fixing two more issues with the netem packet corruption.

First patch (which was previously posted) avoids NULL pointer dereference
if the first frame gets freed due to allocation or checksum failure.
v2 improves the clarity of the code a little as requested by Cong.

Second patch ensures we don't return SUCCESS if the frame was in fact
dropped. Thanks to this commit message for patch 1 no longer needs the
"this will still break with a single-frame failure" disclaimer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:12:36 -07:00
Jakub Kicinski e0ad032e14 net: netem: correct the parent's backlog when corrupted packet was dropped
If packet corruption failed we jump to finish_segs and return
NET_XMIT_SUCCESS. Seeing success will make the parent qdisc
increment its backlog, that's incorrect - we need to return
NET_XMIT_DROP.

Fixes: 6071bd1aa1 ("netem: Segment GSO packets on enqueue")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:12:36 -07:00
Jakub Kicinski a7fa12d158 net: netem: fix error path for corrupted GSO frames
To corrupt a GSO frame we first perform segmentation.  We then
proceed using the first segment instead of the full GSO skb and
requeue the rest of the segments as separate packets.

If there are any issues with processing the first segment we
still want to process the rest, therefore we jump to the
finish_segs label.

Commit 177b800746 ("net: netem: fix backlog accounting for
corrupted GSO frames") started using the pointer to the first
segment in the "rest of segments processing", but as mentioned
above the first segment may had already been freed at this point.

Backlog corrections for parent qdiscs have to be adjusted.

Fixes: 177b800746 ("net: netem: fix backlog accounting for corrupted GSO frames")
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:12:35 -07:00
Michael Tretter bd310aca44 macb: propagate errors when getting optional clocks
The tx_clk, rx_clk, and tsu_clk are optional. Currently the macb driver
marks clock as not available if it receives an error when trying to get
a clock. This is wrong, because a clock controller might return
-EPROBE_DEFER if a clock is not available, but will eventually become
available.

In these cases, the driver would probe successfully but will never be
able to adjust the clocks, because the clocks were not available during
probe, but became available later.

For example, the clock controller for the ZynqMP is implemented in the
PMU firmware and the clocks are only available after the firmware driver
has been probed.

Use devm_clk_get_optional() in instead of devm_clk_get() to get the
optional clock and propagate all errors to the calling function.

Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Tested-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 11:58:39 -07:00
Juergen Gross 3d5c1a037d xen/netback: fix error path of xenvif_connect_data()
xenvif_connect_data() calls module_put() in case of error. This is
wrong as there is no related module_get().

Remove the superfluous module_put().

Fixes: 279f438e36 ("xen-netback: Don't destroy the netdev until the vif is shut down")
Cc: <stable@vger.kernel.org> # 3.12
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 11:43:29 -07:00
Yonglong Liu 580a05f9d4 net: hns3: fix mis-counting IRQ vector numbers issue
Currently, the num_msi_left means the vector numbers of NIC,
but if the PF supported RoCE, it contains the vector numbers
of NIC and RoCE(Not expected).

This may cause interrupts lost in some case, because of the
NIC module used the vector resources which belongs to RoCE.

This patch adds a new variable num_nic_msi to store the vector
numbers of NIC, and adjust the default TQP numbers and rss_size
according to the value of num_nic_msi.

Fixes: 46a3df9f97 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 11:40:55 -07:00
Linus Torvalds 998d75510e Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "Rather a lot of fixes, almost all affecting mm/"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  scripts/gdb: fix debugging modules on s390
  kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
  mm/thp: allow dropping THP from page cache
  mm/vmscan.c: support removing arbitrary sized pages from mapping
  mm/thp: fix node page state in split_huge_page_to_list()
  proc/meminfo: fix output alignment
  mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
  mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
  mm: include <linux/huge_mm.h> for is_vma_temporary_stack
  zram: fix race between backing_dev_show and backing_dev_store
  mm/memcontrol: update lruvec counters in mem_cgroup_move_account
  ocfs2: fix panic due to ocfs2_wq is null
  hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
  mm: memblock: do not enforce current limit for memblock_phys* family
  mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
  mm/gup: fix a misnamed "write" argument, and a related bug
  mm/gup_benchmark: add a missing "w" to getopt string
  ocfs2: fix error handling in ocfs2_setattr()
  mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
  mm/memunmap: don't access uninitialized memmap in memunmap_pages()
  ...
2019-10-19 06:53:59 -04:00
Ilya Leoshkevich 585d730d41 scripts/gdb: fix debugging modules on s390
Currently lx-symbols assumes that module text is always located at
module->core_layout->base, but s390 uses the following layout:

  +------+  <- module->core_layout->base
  | GOT  |
  +------+  <- module->core_layout->base + module->arch->plt_offset
  | PLT  |
  +------+  <- module->core_layout->base + module->arch->plt_offset +
  | TEXT |     module->arch->plt_size
  +------+

Therefore, when trying to debug modules on s390, all the symbol
addresses are skewed by plt_offset + plt_size.

Fix by adding plt_offset + plt_size to module_addr in
load_module_symbols().

Link: http://lkml.kernel.org/r/20191017085917.81791-1-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
Song Liu aa5de305c9 kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries.  On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe.  This
is because perf_event based trace_uprobe calls uprobe_unregister twice
on close: first in TRACE_REG_PERF_CLOSE, then in
TRACE_REG_PERF_UNREGISTER.  The second call will split the PMD mapped
page table entry, which is not the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
Fixes: 5a52c9df62 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
Kirill A. Shutemov ef18a1ca84 mm/thp: allow dropping THP from page cache
Once a THP is added to the page cache, it cannot be dropped via
/proc/sys/vm/drop_caches.  Fix this issue with proper handling in
invalidate_mapping_pages().

Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
William Kucharski 906d278d75 mm/vmscan.c: support removing arbitrary sized pages from mapping
__remove_mapping() assumes that pages can only be either base pages or
HPAGE_PMD_SIZE.  Ask the page what size it is.

Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Kirill A. Shutemov 06d3eff62d mm/thp: fix node page state in split_huge_page_to_list()
Make sure split_huge_page_to_list() handles the state of shmem THP and
file THP properly.

Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com
Fixes: 60fbf0ab5d ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Kirill A. Shutemov 2be5fbf9a9 proc/meminfo: fix output alignment
Patch series "Fixes for THP in page cache", v2.

This patch (of 5):

Add extra space for FileHugePages and FilePmdMapped, so the output is
aligned with other rows.

Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com
Fixes: 60fbf0ab5d ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Ben Dooks (Codethink) a2ae8c0551 mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
mm_init.c needs to include <linux/mman.h> for the definition of
vm_committed_as_batch.  Fixes the following sparse warning:

  mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Ben Dooks d0e6a5821c mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to
fix the following warning:

  mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Ben Dooks 444f84fd2a mm: include <linux/huge_mm.h> for is_vma_temporary_stack
Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack
to fix the following sparse warning:

  mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Chenwandun f7daefe423 zram: fix race between backing_dev_show and backing_dev_store
CPU0:				       CPU1:
backing_dev_show		       backing_dev_store
    ......				   ......
    file = zram->backing_dev;
    down_read(&zram->init_lock);	   down_read(&zram->init_init_lock)
    file_path(file, ...);		   zram->backing_dev = backing_dev;
    up_read(&zram->init_lock);		   up_read(&zram->init_lock);

gets the value of zram->backing_dev too early in backing_dev_show, which
resultin the value being NULL at the beginning, and not NULL later.

backtrace:
  d_path+0xcc/0x174
  file_path+0x10/0x18
  backing_dev_show+0x40/0xb4
  dev_attr_show+0x20/0x54
  sysfs_kf_seq_show+0x9c/0x10c
  kernfs_seq_show+0x28/0x30
  seq_read+0x184/0x488
  kernfs_fop_read+0x5c/0x1a4
  __vfs_read+0x44/0x128
  vfs_read+0xa0/0x138
  SyS_read+0x54/0xb4

Link: http://lkml.kernel.org/r/1571046839-16814-1-git-send-email-chenwandun@huawei.com
Signed-off-by: Chenwandun <chenwandun@huawei.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Konstantin Khlebnikov ae8af4388d mm/memcontrol: update lruvec counters in mem_cgroup_move_account
Mapped, dirty and writeback pages are also counted in per-lruvec stats.
These counters needs update when page is moved between cgroups.

Currently is nobody *consuming* the lruvec versions of these counters and
that there is no user-visible effect.

Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
Fixes: 00f3ca2c2d ("mm: memcontrol: per-lruvec stats infrastructure")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Yi Li b918c43021 ocfs2: fix panic due to ocfs2_wq is null
mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters
an error.  ocfs2_initialize_super() returns before allocating ocfs2_wq.
ocfs2_dismount_volume() triggers the following panic.

  Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30
  ------------[ cut here ]------------
  Oops: 0002 [#1] SMP NOPTI
  CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G  E
        4.14.148-200.ckv.x86_64 #1
  Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
  task: ffff967af0520000 task.stack: ffffa5f05484000
  RIP: 0010:mutex_lock+0x19/0x20
  Call Trace:
    flush_workqueue+0x81/0x460
    ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
    ocfs2_dismount_volume+0x84/0x400 [ocfs2]
    ocfs2_fill_super+0xa4/0x1270 [ocfs2]
    ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
    mount_bdev+0x17f/0x1c0
    mount_fs+0x3a/0x160

Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com
Signed-off-by: Yi Li <yilikernel@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
David Hildenbrand f231fe4235 hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
Uninitialized memmaps contain garbage and in the worst case trigger
kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
touched.

Let's make sure that we only consider online memory (managed by the
buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.

page_zone() will call page_to_nid(), which will trigger
VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
can be the case when an offline memory block (e.g., never onlined) is
spanned by a zone.

Note: As explained by Michal in [1], alloc_contig_range() will verify
the range.  So it boils down to the wrong access in this function.

[1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz

Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: <stable@vger.kernel.org>	[4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Mike Rapoport f3057ad767 mm: memblock: do not enforce current limit for memblock_phys* family
Until commit 92d12f9544 ("memblock: refactor internal allocation
functions") the maximal address for memblock allocations was forced to
memblock.current_limit only for the allocation functions returning
virtual address.  The changes introduced by that commit moved the limit
enforcement into the allocation core and as a result the allocation
functions returning physical address also started to limit allocations
to memblock.current_limit.

This caused breakage of etnaviv GPU driver:

  etnaviv etnaviv: bound 130000.gpu (ops gpu_ops)
  etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
  etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
  etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
  etnaviv-gpu 130000.gpu: command buffer outside valid memory window
  etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
  etnaviv-gpu 134000.gpu: command buffer outside valid memory window
  etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
  etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0

Restore the behaviour of memblock_phys* family so that these functions
will not enforce memblock.current_limit.

Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org
Fixes: 92d12f9544 ("memblock: refactor internal allocation functions")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Adam Ford <aford173@gmail.com>
Tested-by: Adam Ford <aford173@gmail.com>	[imx6q-logicpd]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Honglei Wang b11edebbc9 mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
Commit 1a61ab8038 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective.

Tim has reported that this is not really the case for their database
benchmark which is showing an opposite results where lruvec_page_state
is taking up a huge chunk of CPU cycles (about 25% of the system time
which is roughly 7% of total cpu cycles) on 5.3 kernels.  The workload
is running on a larger machine (96cpus), it has many cgroups (500) and
it is heavily direct reclaim bound.

Tim Chen said:

: The problem can also be reproduced by running simple multi-threaded
: pmbench benchmark with a fast Optane SSD swap (see profile below).
:
:
: 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
:             |
:             |--3.07%--lruvec_lru_size
:             |          |
:             |          |--2.11%--cpumask_next
:             |          |          |
:             |          |           --1.66%--find_next_bit
:             |          |
:             |           --0.57%--call_function_interrupt
:             |                     |
:             |                      --0.55%--smp_call_function_interrupt
:             |
:             |--1.59%--0x441f0fc3d009
:             |          _ops_rdtsc_init_base_freq
:             |          access_histogram
:             |          page_fault
:             |          __do_page_fault
:             |          handle_mm_fault
:             |          __handle_mm_fault
:             |          |
:             |           --1.54%--do_swap_page
:             |                     swapin_readahead
:             |                     swap_cluster_readahead
:             |                     |
:             |                      --1.53%--read_swap_cache_async
:             |                                __read_swap_cache_async
:             |                                alloc_pages_vma
:             |                                __alloc_pages_nodemask
:             |                                __alloc_pages_slowpath
:             |                                try_to_free_pages
:             |                                do_try_to_free_pages
:             |                                shrink_node
:             |                                shrink_node_memcg
:             |                                |
:             |                                |--0.77%--lruvec_lru_size
:             |                                |
:             |                                 --0.76%--inactive_list_is_low
:             |                                           |
:             |                                            --0.76%--lruvec_lru_size
:             |
:              --1.50%--measure_read
:                        page_fault
:                        __do_page_fault
:                        handle_mm_fault
:                        __handle_mm_fault
:                        do_swap_page
:                        swapin_readahead
:                        swap_cluster_readahead
:                        |
:                         --1.48%--read_swap_cache_async
:                                   __read_swap_cache_async
:                                   alloc_pages_vma
:                                   __alloc_pages_nodemask
:                                   __alloc_pages_slowpath
:                                   try_to_free_pages
:                                   do_try_to_free_pages
:                                   shrink_node
:                                   shrink_node_memcg
:                                   |
:                                   |--0.75%--inactive_list_is_low
:                                   |          |
:                                   |           --0.75%--lruvec_lru_size
:                                   |
:                                    --0.73%--lruvec_lru_size

The likely culprit is the cache traffic the lruvec_page_state_local
generates.  Dave Hansen says:

: I was thinking purely of the cache footprint.  If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
:
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles

Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.

Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: <stable@vger.kernel.org>	[5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
John Hubbard 0cd22afdce mm/gup: fix a misnamed "write" argument, and a related bug
In several routines, the "flags" argument is incorrectly named "write".
Change it to "flags".

Also, in one place, the misnaming led to an actual bug:
"flags & FOLL_WRITE" is required, rather than just "flags".
(That problem was flagged by krobot, in v1 of this patch.)

Also, change the flags argument from int, to unsigned int.

You can see that this was a simple oversight, because the
calling code passes "flags" to the fifth argument:

gup_pgd_range():
    ...
    if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
		    PGDIR_SHIFT, next, flags, pages, nr))

...which, until this patch, the callees referred to as "write".

Also, change two lines to avoid checkpatch line length
complaints, and another line to fix another oversight
that checkpatch called out: missing "int" on pdshift.

Link: http://lkml.kernel.org/r/20191014184639.1512873-3-jhubbard@nvidia.com
Fixes: b798bec474 ("mm/gup: change write parameter to flags in fast walk")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
John Hubbard 6f24c8d30d mm/gup_benchmark: add a missing "w" to getopt string
Even though gup_benchmark.c has code to handle the -w command-line option,
the "w" is not part of the getopt string.  It looks as if it has been
missing the whole time.

On my machine, this leads naturally to the following predictable result:

  $ sudo ./gup_benchmark -w
  ./gup_benchmark: invalid option -- 'w'

...which is fixed with this commit.

Link: http://lkml.kernel.org/r/20191014184639.1512873-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: kbuild test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Chengguang Xu ce750f43f5 ocfs2: fix error handling in ocfs2_setattr()
Should set transfer_to[USRQUOTA/GRPQUOTA] to NULL on error case before
jumping to do dqput().

Link: http://lkml.kernel.org/r/20191010082349.1134-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Roman Gushchin b749ecfaf6 mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
Karsten reported the following panic in __free_slab() happening on a s390x
machine:

  Unable to handle kernel pointer dereference in virtual kernel address space
  Failing address: 0000000000000000 TEID: 0000000000000483
  Fault in home space mode while using kernel ASCE.
  AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
  Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
  Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
  Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
  Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
  Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
             0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
             0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
             0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
             000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
  Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
             00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
            #00000000003cadb2: a7f4fe36            brc     15,3caa1e
            >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
             00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
             00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
             00000000003cadc6: a7f4fe2c            brc     15,3caa
             00000000003cadc6: a7f4fe2c            brc     15,3caa1e
             00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
  Call Trace:
  (<00000000003cabcc> __free_slab+0x49c/0x6b0)
   <00000000001f5886> rcu_core+0x5a6/0x7e0
   <0000000000ca2dea> __do_softirq+0xf2/0x5c0
   <0000000000152644> irq_exit+0x104/0x130
   <000000000010d222> do_IRQ+0x9a/0xf0
   <0000000000ca2344> ext_int_handler+0x130/0x134
   <0000000000103648> enabled_wait+0x58/0x128
  (<0000000000103634> enabled_wait+0x44/0x128)
   <0000000000103b00> arch_cpu_idle+0x40/0x58
   <0000000000ca0544> default_idle_call+0x3c/0x68
   <000000000018eaa4> do_idle+0xec/0x1c0
   <000000000018ee0e> cpu_startup_entry+0x36/0x40
   <000000000122df34> arch_call_rest_init+0x5c/0x88
   <0000000000000000> 0x0
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
   <00000000003ca8f4> __free_slab+0x1c4/0x6b0
  Kernel panic - not syncing: Fatal exception in interrupt

The kernel panics on an attempt to dereference the NULL memcg pointer.
When shutdown_cache() is called from the kmem_cache_destroy() context, a
memcg kmem_cache might have empty slab pages in a partial list, which are
still charged to the memory cgroup.

These pages are released by free_partial() at the beginning of
shutdown_cache(): either directly or by scheduling a RCU-delayed work
(if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag).  The latter case
is when the reported panic can happen: memcg_unlink_cache() is called
immediately after shrinking partial lists, without waiting for scheduled
RCU works.  It sets the kmem_cache->memcg_params.memcg pointer to NULL,
and the following attempt to dereference it by __free_slab() from the
RCU work context causes the panic.

To fix the issue, let's postpone the release of the memcg pointer to
destroy_memcg_params().  It's called from a separate work context by
slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
This guarantees that all scheduled page release RCU works will complete
before the memcg pointer will be zeroed.

Big thanks for Karsten for the perfect report containing all necessary
information, his help with the analysis of the problem and testing of the
fix.

Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
Fixes: fb2f2b0adb ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Karsten Graul <kgraul@linux.ibm.com>
Tested-by: Karsten Graul <kgraul@linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00