Commit Graph

872045 Commits

Author SHA1 Message Date
Eugeniy Paltsev e1b2743d70 ARC: [plat-axs10x]: remove hardcoded video mode from bootargs
Now have pixel clock PLL driver and we can change pixel clock rate
so we don't need to enforce one exact video mode. Moreover enforcing
video mode is harmful in case of we enforce mode which isn't
supported by the monitor we are using.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-11-20 09:13:42 -08:00
Eugeniy Paltsev 93c53f2397 ARC: [plat-axs10x]: use pgu pll instead of fixed clock
Use PLL driver instead of fixed-clock for PGU pixel clock.
That allows us to support wider range of graphic modes.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-11-20 09:13:40 -08:00
Eugeniy Paltsev f091d5a426 ARC: ARCv2: jump label: implement jump label patching
Implement jump label patching for ARC. Jump labels provide
an interface to generate dynamic branches using
self-modifying code.

This allows us to implement conditional branches where
changing branch direction is expensive but branch selection
is basically 'free'

This implementation uses 32-bit NOP and BRANCH instructions
which forced to be aligned by 4 to guarantee that they don't
cross L1 cache line boundary and can be update atomically.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-11-08 16:14:32 -08:00
Vineet Gupta 2f4ecf68a0 ARC: mm: tlb flush optim: elide redundant uTLB invalidates for MMUv3
For MMUv3 (and prior) the flush_tlb_{range,mm,page} API use the MMU
TLBWrite cmd which already nukes the entire uTLB, so NO need for
additional IVUTLB cmd from utlb_invalidate() - hence this patch

local_flush_tlb_all() is special since it uses a weaker TLBWriteNI
cmd (prec commit) to shoot down JTLB, hence we retain the explicit
uTLB flush

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:32 -07:00
Vineet Gupta 1355ea2e60 ARC: mm: tlb flush optim: elide repeated uTLB invalidate in loop
The unconditional full TLB flush (on say ASID rollover) iterates over each
entry and uses TLBWrite to zero it out. TLBWrite by design also invalidates
the uTLBs thus we end up invalidating it as many times as numbe rof
entries (512 or 1k)

Optimize this by using a weaker TLBWriteNI cmd in loop, which doesn't
tinker with uTLBs and an explicit one time IVUTLB, outside the loop to
invalidate them all once.

And given the optimiztion, the IVUTLB is now needed on MMUv4 too where
the uTLBs and JTLBs are otherwise coherent given the TLBInsertEntry /
TLBDeleteEntry commands

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:32 -07:00
Vineet Gupta ad4c40e937 ARC: mm: tlb flush optim: Make TLBWriteNI fallback to TLBWrite if not available
TLBWriteNI was introduced in MMUv2 (to not invalidate uTLBs in Fast Path
TLB Refill Handler). To avoid #ifdef'ery make it fallback to TLBWrite availabel on all MMUs. This will also help with next change

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:32 -07:00
Vineet Gupta f4e2f7cc69 ARC: mm: TLB Miss optim: avoid re-reading ECR
For setting PTE Dirty bit, reuse the prior test for ST miss.

No need to reload ECR and test for ST cause code as the prev
condition code is still valid (uncloberred)

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:32 -07:00
Vineet Gupta 0fb1f35ed9 ARCv2: mm: TLB Miss optim: Use double world load/stores LDD/STD
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:32 -07:00
Vineet Gupta cfd9d70a85 ARCv2: mm: TLB Miss optim: SMP builds can cache pgd pointer in mmu scratch reg
ARC700 exception (and intr handling) didn't have auto stack switching
thus had to rely on stashing a reg temporarily (to free it up) at a
known place in memory, allowing to code up the low level stack switching.
This however was not re-entrant in SMP which thus had to repurpose the
per-cpu MMU SCRATCH DATA register otherwise used to "cache" the task pdg
pointer (vs. reading it from mm struct)

The newer HS cores do have auto-stack switching and thus even SMP builds
can use the MMU SCRATCH reg as originally intended.

This patch fixes the restriction to ARC700 SMP builds only

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev 7b491c0b62 ARC: nSIM_700: remove unused network options
We have snps,arc-emac enabled in nSIM_700. It's obsolete and it's
not used anymore so remove its device tree node and disable
unused network options in defconfig.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev 9c6375f77b ARC: nSIM_700: switch to DW UART usage
Switch nsim_700_defconfig to dwuart for consistent uart settings
for all nSIM configurations.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev 1681baa713 ARC: merge HAPS-HS with nSIM-HS configs
Starting from nSIM 2019.06 is possible to use DW UART
instead of ARC UART. That allows us to merge
"nsim_hs" with "haps_hs" and "nsim_hs_smp" with "haps_hs_smp"
with minor changes which were done in previous commits.

We eliminate nsim_hs_defconfig and nsim_hs_smp_defconfig
and leave haps_hs_defconfig and haps_hs_smp_defconfig
which can be used on HAPS / nSIM / ZEBU / QEMU platforms
without additional changes in Linux kernel.

For nSIM we should now use UART property values
"-prop=nsim_mem-dev=uart0,kind=dwuart,base=0xf0000000"
instead of previously used
"-prop=nsim_mem-dev=uart0,base=0xc0fc1000"
"use_connect" and "irq" values of UART property remains untouched.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev 8ae5bb05d7 ARC: HAPS: cleanup defconfigs from unused ETH drivers
We have multiple vendors ethernet drivers enabled in haps_hs and
haps_hs_smp defconfig. The only one we possibly require is
VIRTIO_NET. So disable unused ones via disabling entire
CONFIG_ETHERNET which controls all vendor-specific ethernet
drivers.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev 14fa486f5a ARC: HAPS: add HIGHMEM memory zone to DTS
This is required as a preparation of merging nSIM and HASP
defonfig and device tree.

As we have HIGHMEM disabled in both HAPS and nSIM defconfigs
this doesn't lead to any functional change.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev 3696fc9774 ARC: HAPS: use same UART configuration everywhere
For some reason we use ns8250 UART compatible on UP HAPS
configuration and ns16550a (which is ns8250 with FIFO support)
on SMP HAPS configuration.
Given that we have same UART IP with same IP configuration
on both HAPS configuration use ns16550a compatible everywhere.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev 4c36543e50 ARC: HAPS: cleanup defconfigs from unused IO-related options
We don't have any peripherals on HAPS which may require FB or
input_devices support. So get rid of them.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:31 -07:00
Eugeniy Paltsev ea25a153ee ARC: regenerate nSIM and HAPS defconfigs
No functional change intended.

Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2019-10-28 12:12:30 -07:00
Linus Torvalds 7d194c2100 Linux 5.4-rc4 2019-10-20 15:56:22 -04:00
Linus Torvalds e2ab4ef83f Kbuild fixes for v5.4 (2nd)
- fix a bashism of setlocalversion
 
  - do not use the too new --sort option of tar
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl2sh9MeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGO90P/13dNgaibb3JcCvp
 +ugn+o26V+aqZ8uDLPPifUNFLLnDfGmgORAOe0Z2mGI2CqMDaWMYjvyZNF1KzOdD
 +WFZubiTUevYyM9AOG6O6EOGEDRumgvdNhnJIP+aCUNKGM50zzhuAa23y6Xip1NZ
 qJY5AYu9KiH9jTXtZAdGReuMdvNRsMZuvmI4Qnc21utItVxDFEQKGdTlRoJaGEdN
 Zc/f1HzWwlr7VWZiy0iMRSCAAtAA3TN3S1bw5QCYsuJb3LRl2EIOKMeSinRTYWRc
 zU5CHrkRSX4/M/o8VantTOZueuGMfbaOAl1KEDh8DUPDcrl+gmL4s8aFuDW+5d2+
 7pPlEe/FNZOAGVBDgiK59N3bl2Rn/Uyg9VGwFc8PU3VeY5VbS97ayBaLFcua2zwX
 88mtsZoq+fN3qsq8gfbb21qauTbEibyrk8lzjHoQl07TW82GFWG/GXSt9dQAm1sb
 w/M15WdaONKOzkYXGGCo0l6vdXx1+qqV/LZQcaWhfcoeU5YVVEmSRwCakCxxSWUE
 rvWtn0vljgH8kyb0gnUP6qLJC9clwMv58sl/mG65/D7zs9B74g16BkX0tU8vLSpr
 Ww1BpGtPwuFPp7+kEzWemtlulhDvow2xIX7bNIkAk1IIVpyCmU98AKpivR7G9Cvt
 4cZPSGRU+s2ou/hrtjryzvmwhGUG
 =KNWF
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild fixes from Masahiro Yamada:

 - fix a bashism of setlocalversion

 - do not use the too new --sort option of tar

* tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kheaders: substituting --sort in archive creation
  scripts: setlocalversion: fix a bashism
  kbuild: update comment about KBUILD_ALLDIRS
2019-10-20 12:36:57 -04:00
Linus Torvalds 4fe34d61a3 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A small set of x86 fixes:

   - Prevent a NULL pointer dereference in the X2APIC code in case of a
     CPU hotplug failure.

   - Prevent boot failures on HP superdome machines by invalidating the
     level2 kernel pagetable entries outside of the kernel area as
     invalid so BIOS reserved space won't be touched unintentionally.

     Also ensure that memory holes are rounded up to the next PMD
     boundary correctly.

   - Enable X2APIC support on Hyper-V to prevent boot failures.

   - Set the paravirt name when running on Hyper-V for consistency

   - Move a function under the appropriate ifdef guard to prevent build
     warnings"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/acpi: Move get_cmdline_acpi_rsdp() under #ifdef guard
  x86/hyperv: Set pv_info.name to "Hyper-V"
  x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu
  x86/hyperv: Make vapic support x2apic mode
  x86/boot/64: Round memory hole size up to next PMD page
  x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
2019-10-20 06:31:14 -04:00
Linus Torvalds 81c4bc31c4 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A small set of irq chip driver fixes and updates:

   - Update the SIFIVE PLIC interrupt driver to use the fasteoi handler
     to address the shortcomings of the existing flow handling which was
     prone to lose interrupts

   - Use the proper limit for GIC interrupt line numbers

   - Add retrigger support for the recently merged Anapurna Labs Fabric
     interrupt controller to make it complete

   - Enable the ATMEL AIC5 interrupt controller driver on the new
     SAM9X60 SoC"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/sifive-plic: Switch to fasteoi flow
  irqchip/gic-v3: Fix GIC_LINE_NR accessor
  irqchip/atmel-aic5: Add support for sam9x60 irqchip
  irqchip/al-fic: Add support for irq retrigger
2019-10-20 06:27:54 -04:00
Linus Torvalds 188768f3c0 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull hrtimer fixlet from Thomas Gleixner:
 "A single commit annotating the lockcless access to timer->base with
  READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Annotate lockless access to timer->base
2019-10-20 06:25:12 -04:00
Linus Torvalds 589f1222e0 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stop-machine fix from Thomas Gleixner:
 "A single fix, amending stop machine with WRITE/READ_ONCE() to address
  the fallout of KCSAN"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stop_machine: Avoid potential race behaviour
2019-10-20 06:22:25 -04:00
Linus Torvalds 531e93d114 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "I was battling a cold after some recent trips, so quite a bit piled up
  meanwhile, sorry about that.

  Highlights:

   1) Fix fd leak in various bpf selftests, from Brian Vazquez.

   2) Fix crash in xsk when device doesn't support some methods, from
      Magnus Karlsson.

   3) Fix various leaks and use-after-free in rxrpc, from David Howells.

   4) Fix several SKB leaks due to confusion of who owns an SKB and who
      should release it in the llc code. From Eric Biggers.

   5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet.

   6) Jumbo packets don't work after resume on r8169, as the BIOS resets
      the chip into non-jumbo mode during suspend. From Heiner Kallweit.

   7) Corrupt L2 header during MPLS push, from Davide Caratti.

   8) Prevent possible infinite loop in tc_ctl_action, from Eric
      Dumazet.

   9) Get register bits right in bcmgenet driver, based upon chip
      version. From Florian Fainelli.

  10) Fix mutex problems in microchip DSA driver, from Marek Vasut.

  11) Cure race between route lookup and invalidation in ipv4, from Wei
      Wang.

  12) Fix performance regression due to false sharing in 'net'
      structure, from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits)
  net: reorder 'struct net' fields to avoid false sharing
  net: dsa: fix switch tree list
  net: ethernet: dwmac-sun8i: show message only when switching to promisc
  net: aquantia: add an error handling in aq_nic_set_multicast_list
  net: netem: correct the parent's backlog when corrupted packet was dropped
  net: netem: fix error path for corrupted GSO frames
  macb: propagate errors when getting optional clocks
  xen/netback: fix error path of xenvif_connect_data()
  net: hns3: fix mis-counting IRQ vector numbers issue
  net: usb: lan78xx: Connect PHY before registering MAC
  vsock/virtio: discard packets if credit is not respected
  vsock/virtio: send a credit update when buffer size is changed
  mlxsw: spectrum_trap: Push Ethernet header before reporting trap
  net: ensure correct skb->tstamp in various fragmenters
  net: bcmgenet: reset 40nm EPHY on energy detect
  net: bcmgenet: soft reset 40nm EPHYs before MAC init
  net: phy: bcm7xxx: define soft_reset for 40nm EPHY
  net: bcmgenet: don't set phydev->link from MAC
  net: Update address for MediaTek ethernet driver in MAINTAINERS
  ipv4: fix race condition between route lookup and invalidation
  ...
2019-10-19 17:09:11 -04:00
Eric Dumazet 2a06b8982f net: reorder 'struct net' fields to avoid false sharing
Intel test robot reported a ~7% regression on TCP_CRR tests
that they bisected to the cited commit.

Indeed, every time a new TCP socket is created or deleted,
the atomic counter net->count is touched (via get_net(net)
and put_net(net) calls)

So cpus might have to reload a contended cache line in
net_hash_mix(net) calls.

We need to reorder 'struct net' fields to move @hash_mix
in a read mostly cache line.

We move in the first cache line fields that can be
dirtied often.

We probably will have to address in a followup patch
the __randomize_layout that was added in linux-4.13,
since this might break our placement choices.

Fixes: 355b985537 ("netns: provide pure entropy for net_hash_mix()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:21:53 -07:00
Vivien Didelot 50c7d2ba9d net: dsa: fix switch tree list
If there are multiple switch trees on the device, only the last one
will be listed, because the arguments of list_add_tail are swapped.

Fixes: 83c0afaec7 ("net: dsa: Add new binding implementation")
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:19:41 -07:00
Mans Rullgard 05908d72cc net: ethernet: dwmac-sun8i: show message only when switching to promisc
Printing the info message every time more than the max number of mac
addresses are requested generates unnecessary log spam.  Showing it only
when the hw is not already in promiscous mode is equally informative
without being annoying.

Signed-off-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:18:10 -07:00
Chenwandun 3d00cf2fbb net: aquantia: add an error handling in aq_nic_set_multicast_list
add an error handling in aq_nic_set_multicast_list, it may not
work when hw_multicast_list_set error; and at the same time
it will remove gcc Wunused-but-set-variable warning.

Signed-off-by: Chenwandun <chenwandun@huawei.com>
Reviewed-by: Igor Russkikh <igor.russkikh@aquantia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:16:38 -07:00
David S. Miller 708738376c Merge branch 'netem-fix-further-issues-with-packet-corruption'
Jakub Kicinski says:

====================
net: netem: fix further issues with packet corruption

This set is fixing two more issues with the netem packet corruption.

First patch (which was previously posted) avoids NULL pointer dereference
if the first frame gets freed due to allocation or checksum failure.
v2 improves the clarity of the code a little as requested by Cong.

Second patch ensures we don't return SUCCESS if the frame was in fact
dropped. Thanks to this commit message for patch 1 no longer needs the
"this will still break with a single-frame failure" disclaimer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:12:36 -07:00
Jakub Kicinski e0ad032e14 net: netem: correct the parent's backlog when corrupted packet was dropped
If packet corruption failed we jump to finish_segs and return
NET_XMIT_SUCCESS. Seeing success will make the parent qdisc
increment its backlog, that's incorrect - we need to return
NET_XMIT_DROP.

Fixes: 6071bd1aa1 ("netem: Segment GSO packets on enqueue")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:12:36 -07:00
Jakub Kicinski a7fa12d158 net: netem: fix error path for corrupted GSO frames
To corrupt a GSO frame we first perform segmentation.  We then
proceed using the first segment instead of the full GSO skb and
requeue the rest of the segments as separate packets.

If there are any issues with processing the first segment we
still want to process the rest, therefore we jump to the
finish_segs label.

Commit 177b800746 ("net: netem: fix backlog accounting for
corrupted GSO frames") started using the pointer to the first
segment in the "rest of segments processing", but as mentioned
above the first segment may had already been freed at this point.

Backlog corrections for parent qdiscs have to be adjusted.

Fixes: 177b800746 ("net: netem: fix backlog accounting for corrupted GSO frames")
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:12:35 -07:00
Michael Tretter bd310aca44 macb: propagate errors when getting optional clocks
The tx_clk, rx_clk, and tsu_clk are optional. Currently the macb driver
marks clock as not available if it receives an error when trying to get
a clock. This is wrong, because a clock controller might return
-EPROBE_DEFER if a clock is not available, but will eventually become
available.

In these cases, the driver would probe successfully but will never be
able to adjust the clocks, because the clocks were not available during
probe, but became available later.

For example, the clock controller for the ZynqMP is implemented in the
PMU firmware and the clocks are only available after the firmware driver
has been probed.

Use devm_clk_get_optional() in instead of devm_clk_get() to get the
optional clock and propagate all errors to the calling function.

Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Tested-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 11:58:39 -07:00
Juergen Gross 3d5c1a037d xen/netback: fix error path of xenvif_connect_data()
xenvif_connect_data() calls module_put() in case of error. This is
wrong as there is no related module_get().

Remove the superfluous module_put().

Fixes: 279f438e36 ("xen-netback: Don't destroy the netdev until the vif is shut down")
Cc: <stable@vger.kernel.org> # 3.12
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 11:43:29 -07:00
Yonglong Liu 580a05f9d4 net: hns3: fix mis-counting IRQ vector numbers issue
Currently, the num_msi_left means the vector numbers of NIC,
but if the PF supported RoCE, it contains the vector numbers
of NIC and RoCE(Not expected).

This may cause interrupts lost in some case, because of the
NIC module used the vector resources which belongs to RoCE.

This patch adds a new variable num_nic_msi to store the vector
numbers of NIC, and adjust the default TQP numbers and rss_size
according to the value of num_nic_msi.

Fixes: 46a3df9f97 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 11:40:55 -07:00
Linus Torvalds 998d75510e Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "Rather a lot of fixes, almost all affecting mm/"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  scripts/gdb: fix debugging modules on s390
  kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
  mm/thp: allow dropping THP from page cache
  mm/vmscan.c: support removing arbitrary sized pages from mapping
  mm/thp: fix node page state in split_huge_page_to_list()
  proc/meminfo: fix output alignment
  mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
  mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
  mm: include <linux/huge_mm.h> for is_vma_temporary_stack
  zram: fix race between backing_dev_show and backing_dev_store
  mm/memcontrol: update lruvec counters in mem_cgroup_move_account
  ocfs2: fix panic due to ocfs2_wq is null
  hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
  mm: memblock: do not enforce current limit for memblock_phys* family
  mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
  mm/gup: fix a misnamed "write" argument, and a related bug
  mm/gup_benchmark: add a missing "w" to getopt string
  ocfs2: fix error handling in ocfs2_setattr()
  mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
  mm/memunmap: don't access uninitialized memmap in memunmap_pages()
  ...
2019-10-19 06:53:59 -04:00
Ilya Leoshkevich 585d730d41 scripts/gdb: fix debugging modules on s390
Currently lx-symbols assumes that module text is always located at
module->core_layout->base, but s390 uses the following layout:

  +------+  <- module->core_layout->base
  | GOT  |
  +------+  <- module->core_layout->base + module->arch->plt_offset
  | PLT  |
  +------+  <- module->core_layout->base + module->arch->plt_offset +
  | TEXT |     module->arch->plt_size
  +------+

Therefore, when trying to debug modules on s390, all the symbol
addresses are skewed by plt_offset + plt_size.

Fix by adding plt_offset + plt_size to module_addr in
load_module_symbols().

Link: http://lkml.kernel.org/r/20191017085917.81791-1-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
Song Liu aa5de305c9 kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries.  On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe.  This
is because perf_event based trace_uprobe calls uprobe_unregister twice
on close: first in TRACE_REG_PERF_CLOSE, then in
TRACE_REG_PERF_UNREGISTER.  The second call will split the PMD mapped
page table entry, which is not the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
Fixes: 5a52c9df62 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
Kirill A. Shutemov ef18a1ca84 mm/thp: allow dropping THP from page cache
Once a THP is added to the page cache, it cannot be dropped via
/proc/sys/vm/drop_caches.  Fix this issue with proper handling in
invalidate_mapping_pages().

Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
William Kucharski 906d278d75 mm/vmscan.c: support removing arbitrary sized pages from mapping
__remove_mapping() assumes that pages can only be either base pages or
HPAGE_PMD_SIZE.  Ask the page what size it is.

Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Kirill A. Shutemov 06d3eff62d mm/thp: fix node page state in split_huge_page_to_list()
Make sure split_huge_page_to_list() handles the state of shmem THP and
file THP properly.

Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com
Fixes: 60fbf0ab5d ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Kirill A. Shutemov 2be5fbf9a9 proc/meminfo: fix output alignment
Patch series "Fixes for THP in page cache", v2.

This patch (of 5):

Add extra space for FileHugePages and FilePmdMapped, so the output is
aligned with other rows.

Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com
Fixes: 60fbf0ab5d ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Ben Dooks (Codethink) a2ae8c0551 mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
mm_init.c needs to include <linux/mman.h> for the definition of
vm_committed_as_batch.  Fixes the following sparse warning:

  mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Ben Dooks d0e6a5821c mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to
fix the following warning:

  mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Ben Dooks 444f84fd2a mm: include <linux/huge_mm.h> for is_vma_temporary_stack
Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack
to fix the following sparse warning:

  mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Chenwandun f7daefe423 zram: fix race between backing_dev_show and backing_dev_store
CPU0:				       CPU1:
backing_dev_show		       backing_dev_store
    ......				   ......
    file = zram->backing_dev;
    down_read(&zram->init_lock);	   down_read(&zram->init_init_lock)
    file_path(file, ...);		   zram->backing_dev = backing_dev;
    up_read(&zram->init_lock);		   up_read(&zram->init_lock);

gets the value of zram->backing_dev too early in backing_dev_show, which
resultin the value being NULL at the beginning, and not NULL later.

backtrace:
  d_path+0xcc/0x174
  file_path+0x10/0x18
  backing_dev_show+0x40/0xb4
  dev_attr_show+0x20/0x54
  sysfs_kf_seq_show+0x9c/0x10c
  kernfs_seq_show+0x28/0x30
  seq_read+0x184/0x488
  kernfs_fop_read+0x5c/0x1a4
  __vfs_read+0x44/0x128
  vfs_read+0xa0/0x138
  SyS_read+0x54/0xb4

Link: http://lkml.kernel.org/r/1571046839-16814-1-git-send-email-chenwandun@huawei.com
Signed-off-by: Chenwandun <chenwandun@huawei.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Konstantin Khlebnikov ae8af4388d mm/memcontrol: update lruvec counters in mem_cgroup_move_account
Mapped, dirty and writeback pages are also counted in per-lruvec stats.
These counters needs update when page is moved between cgroups.

Currently is nobody *consuming* the lruvec versions of these counters and
that there is no user-visible effect.

Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
Fixes: 00f3ca2c2d ("mm: memcontrol: per-lruvec stats infrastructure")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Yi Li b918c43021 ocfs2: fix panic due to ocfs2_wq is null
mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters
an error.  ocfs2_initialize_super() returns before allocating ocfs2_wq.
ocfs2_dismount_volume() triggers the following panic.

  Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30
  Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30
  ------------[ cut here ]------------
  Oops: 0002 [#1] SMP NOPTI
  CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G  E
        4.14.148-200.ckv.x86_64 #1
  Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
  task: ffff967af0520000 task.stack: ffffa5f05484000
  RIP: 0010:mutex_lock+0x19/0x20
  Call Trace:
    flush_workqueue+0x81/0x460
    ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
    ocfs2_dismount_volume+0x84/0x400 [ocfs2]
    ocfs2_fill_super+0xa4/0x1270 [ocfs2]
    ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
    mount_bdev+0x17f/0x1c0
    mount_fs+0x3a/0x160

Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com
Signed-off-by: Yi Li <yilikernel@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
David Hildenbrand f231fe4235 hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
Uninitialized memmaps contain garbage and in the worst case trigger
kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
touched.

Let's make sure that we only consider online memory (managed by the
buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.

page_zone() will call page_to_nid(), which will trigger
VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
can be the case when an offline memory block (e.g., never onlined) is
spanned by a zone.

Note: As explained by Michal in [1], alloc_contig_range() will verify
the range.  So it boils down to the wrong access in this function.

[1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz

Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: <stable@vger.kernel.org>	[4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Mike Rapoport f3057ad767 mm: memblock: do not enforce current limit for memblock_phys* family
Until commit 92d12f9544 ("memblock: refactor internal allocation
functions") the maximal address for memblock allocations was forced to
memblock.current_limit only for the allocation functions returning
virtual address.  The changes introduced by that commit moved the limit
enforcement into the allocation core and as a result the allocation
functions returning physical address also started to limit allocations
to memblock.current_limit.

This caused breakage of etnaviv GPU driver:

  etnaviv etnaviv: bound 130000.gpu (ops gpu_ops)
  etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
  etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
  etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
  etnaviv-gpu 130000.gpu: command buffer outside valid memory window
  etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
  etnaviv-gpu 134000.gpu: command buffer outside valid memory window
  etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
  etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0

Restore the behaviour of memblock_phys* family so that these functions
will not enforce memblock.current_limit.

Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org
Fixes: 92d12f9544 ("memblock: refactor internal allocation functions")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Adam Ford <aford173@gmail.com>
Tested-by: Adam Ford <aford173@gmail.com>	[imx6q-logicpd]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Honglei Wang b11edebbc9 mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
Commit 1a61ab8038 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective.

Tim has reported that this is not really the case for their database
benchmark which is showing an opposite results where lruvec_page_state
is taking up a huge chunk of CPU cycles (about 25% of the system time
which is roughly 7% of total cpu cycles) on 5.3 kernels.  The workload
is running on a larger machine (96cpus), it has many cgroups (500) and
it is heavily direct reclaim bound.

Tim Chen said:

: The problem can also be reproduced by running simple multi-threaded
: pmbench benchmark with a fast Optane SSD swap (see profile below).
:
:
: 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
:             |
:             |--3.07%--lruvec_lru_size
:             |          |
:             |          |--2.11%--cpumask_next
:             |          |          |
:             |          |           --1.66%--find_next_bit
:             |          |
:             |           --0.57%--call_function_interrupt
:             |                     |
:             |                      --0.55%--smp_call_function_interrupt
:             |
:             |--1.59%--0x441f0fc3d009
:             |          _ops_rdtsc_init_base_freq
:             |          access_histogram
:             |          page_fault
:             |          __do_page_fault
:             |          handle_mm_fault
:             |          __handle_mm_fault
:             |          |
:             |           --1.54%--do_swap_page
:             |                     swapin_readahead
:             |                     swap_cluster_readahead
:             |                     |
:             |                      --1.53%--read_swap_cache_async
:             |                                __read_swap_cache_async
:             |                                alloc_pages_vma
:             |                                __alloc_pages_nodemask
:             |                                __alloc_pages_slowpath
:             |                                try_to_free_pages
:             |                                do_try_to_free_pages
:             |                                shrink_node
:             |                                shrink_node_memcg
:             |                                |
:             |                                |--0.77%--lruvec_lru_size
:             |                                |
:             |                                 --0.76%--inactive_list_is_low
:             |                                           |
:             |                                            --0.76%--lruvec_lru_size
:             |
:              --1.50%--measure_read
:                        page_fault
:                        __do_page_fault
:                        handle_mm_fault
:                        __handle_mm_fault
:                        do_swap_page
:                        swapin_readahead
:                        swap_cluster_readahead
:                        |
:                         --1.48%--read_swap_cache_async
:                                   __read_swap_cache_async
:                                   alloc_pages_vma
:                                   __alloc_pages_nodemask
:                                   __alloc_pages_slowpath
:                                   try_to_free_pages
:                                   do_try_to_free_pages
:                                   shrink_node
:                                   shrink_node_memcg
:                                   |
:                                   |--0.75%--inactive_list_is_low
:                                   |          |
:                                   |           --0.75%--lruvec_lru_size
:                                   |
:                                    --0.73%--lruvec_lru_size

The likely culprit is the cache traffic the lruvec_page_state_local
generates.  Dave Hansen says:

: I was thinking purely of the cache footprint.  If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
:
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles

Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.

Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: <stable@vger.kernel.org>	[5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00