Commit Graph

841319 Commits

Author SHA1 Message Date
Linus Torvalds 29f785ff76 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "MS_MOVE regression fix + breakage in fsmount(2) (also introduced in
  this cycle, along with fsmount(2) itself).

  I'm still digging through the piles of mail, so there might be more
  fixes to follow, but these two are obvious and self-contained, so
  there's no point delaying those..."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/namespace: fix unprivileged mount propagation
  vfs: fsmount: add missing mntget()
2019-06-17 16:28:28 -07:00
Linus Torvalds da0f382029 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of bug fixes here:

   1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

   2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
      Crispin.

   3) Use after free in psock backlog workqueue, from John Fastabend.

   4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
      Salem.

   5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

   6) Network header needs to be set for packet redirect in nfp, from
      John Hurley.

   7) Fix udp zerocopy refcnt, from Willem de Bruijn.

   8) Don't assume linear buffers in vxlan and geneve error handlers,
      from Stefano Brivio.

   9) Fix TOS matching in mlxsw, from Jiri Pirko.

  10) More SCTP cookie memory leak fixes, from Neil Horman.

  11) Fix VLAN filtering in rtl8366, from Linus Walluij.

  12) Various TCP SACK payload size and fragmentation memory limit fixes
      from Eric Dumazet.

  13) Use after free in pneigh_get_next(), also from Eric Dumazet.

  14) LAPB control block leak fix from Jeremy Sowden"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
  lapb: fixed leak of control-blocks.
  tipc: purge deferredq list for each grp member in tipc_group_delete
  ax25: fix inconsistent lock state in ax25_destroy_timer
  neigh: fix use-after-free read in pneigh_get_next
  tcp: fix compile error if !CONFIG_SYSCTL
  hv_sock: Suppress bogus "may be used uninitialized" warnings
  be2net: Fix number of Rx queues used for flow hashing
  net: handle 802.1P vlan 0 packets properly
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  Revert "net: phylink: set the autoneg state in phylink_phy_change"
  bpf: fix nested bpf tracepoints with per-cpu data
  bpf: Fix out of bounds memory access in bpf_sk_storage
  vsock/virtio: set SOCK_DONE on peer shutdown
  net: dsa: rtl8366: Fix up VLAN filtering
  net: phylink: set the autoneg state in phylink_phy_change
  net: add high_order_alloc_disable sysctl/static key
  tcp: add tcp_tx_skb_cache sysctl
  ...
2019-06-17 15:55:34 -07:00
Christian Brauner d728cf7916 fs/namespace: fix unprivileged mount propagation
When propagating mounts across mount namespaces owned by different user
namespaces it is not possible anymore to move or umount the mount in the
less privileged mount namespace.

Here is a reproducer:

  sudo mount -t tmpfs tmpfs /mnt
  sudo --make-rshared /mnt

  # create unprivileged user + mount namespace and preserve propagation
  unshare -U -m --map-root --propagation=unchanged

  # now change back to the original mount namespace in another terminal:
  sudo mkdir /mnt/aaa
  sudo mount -t tmpfs tmpfs /mnt/aaa

  # now in the unprivileged user + mount namespace
  mount --move /mnt/aaa /opt

Unfortunately, this is a pretty big deal for userspace since this is
e.g. used to inject mounts into running unprivileged containers.
So this regression really needs to go away rather quickly.

The problem is that a recent change falsely locked the root of the newly
added mounts by setting MNT_LOCKED. Fix this by only locking the mounts
on copy_mnt_ns() and not when adding a new mount.

Fixes: 3bd045cc9c ("separate copying and locking mount tree on cross-userns copies")
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Tested-by: Christian Brauner <christian@brauner.io>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-17 17:36:09 -04:00
Eric Biggers 1b0b9cc8d3 vfs: fsmount: add missing mntget()
sys_fsmount() needs to take a reference to the new mount when adding it
to the anonymous mount namespace.  Otherwise the filesystem can be
unmounted while it's still in use, as found by syzkaller.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: syzbot+99de05d099a170867f22@syzkaller.appspotmail.com
Reported-by: syzbot+7008b8b8ba7df475fdc8@syzkaller.appspotmail.com
Fixes: 93766fbd26 ("vfs: syscall: Add fsmount() to create a mount for a superblock")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-17 17:36:07 -04:00
David S. Miller 4fddbf8a99 Merge branch 'tcp-fixes'
Eric Dumazet says:

====================
tcp: make sack processing more robust

Jonathan Looney brought to our attention multiple problems
in TCP stack at the sender side.

SACK processing can be abused by malicious peers to either
cause overflows, or increase of memory usage.

First two patches fix the immediate problems.

Since the malicious peers abuse senders by advertizing a very
small MSS in their SYN or SYNACK packet, the last two
patches add a new sysctl so that admins can chose a higher
limit for MSS clamping.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 10:39:56 -07:00
Linus Torvalds eb7c825bf7 RISC-V patches for v5.2-rc6
This tag contains fixes, defconfig, and DT data changes for the v5.2-rc
 series.  The fixes are relatively straightforward:
 
 - Addition of a TLB fence in the vmalloc_fault path, so the CPU doesn't
   enter an infinite page fault loop;
 - Readdition of the pm_power_off export, so device drivers that
   reassign it can now be built as modules;
 - A udelay() fix for RV32, fixing a miscomputation of the delay time;
 - Removal of deprecated smp_mb__*() barriers.
 
 The tag also adds initial DT data infrastructure for arch/riscv, along
 with initial data for the SiFive FU540-C000 SoC and the corresponding
 HiFive Unleashed board.
 
 We also update the RV64 defconfig to include some core drivers for the
 FU540 in the build.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElRDoIDdEz9/svf2Kx4+xDQu9KksFAl0HtEkACgkQx4+xDQu9
 KkuRIw//f2vSrUyMh44sevr6euVD0K++hQ0AbteQ94cGHqYWWaNxfwMHFD91Gxbj
 wowTwgssq7H9nePsKANjiiLULnZNIkWXAlIncjzv3aXkH6JG3f9nEGR49yzvCbIZ
 yN8wgElJ8rcVWLd096E53Su84CzxuJJ2o3wOI1nQi8aI4h3LwkM2b/O4GxZFpnWb
 vIhWXqjvbUb8XL7Y+VPewtxnZItOUDHkuIkup4kP2bTgl2iDW93hzWwxNKbt6v+m
 9wTzAChjcepCAXSmEGeeZ/h2HNqw2crs+NWOe0drcKxL2vKPZ6gS8ZRX/NuIoDr4
 JgMILzYSO28z8N6w1cJJUdN4eGhCTvdxVTQXvkk/yZoT08X6M0xb5A1MbtizgOJ6
 mZK/vM9gtuoUSZG0SRNeNoqHbWu1tIm29z435Be8hWAtzXlEfewJm8ntgFO4dGmb
 E8TRSgjLzdHY0Nvwx/KVtvYmE/TMybVVRsxJJ525dqJlHT7f3VuRstvw7VQJQpz2
 +JfsZbYk1KjbUc25QpAqF1LUxrRQFn2JL0Cqw+L49J8eshY77rsTcAKP6ZZWiSFZ
 qodU0oPF4BkS1t0bnFuNwlqsAr/q9EiAnQO7+SvqQY/ZUnMNk9gCNn5k/rHMCfyD
 2Dyo6iAbj+Yyb1rrQxX6QnlbHgpFxsG3N4s9E5jOPgKyEQM4JQ4=
 =aotJ
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-v5.2/fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Paul Walmsley:
 "This contains fixes, defconfig, and DT data changes for the v5.2-rc
  series.

  The fixes are relatively straightforward:

   - Addition of a TLB fence in the vmalloc_fault path, so the CPU
     doesn't enter an infinite page fault loop

   - Readdition of the pm_power_off export, so device drivers that
     reassign it can now be built as modules

   - A udelay() fix for RV32, fixing a miscomputation of the delay time

   - Removal of deprecated smp_mb__*() barriers

  This also adds initial DT data infrastructure for arch/riscv, along
  with initial data for the SiFive FU540-C000 SoC and the corresponding
  HiFive Unleashed board.

  We also update the RV64 defconfig to include some core drivers for the
  FU540 in the build"

* tag 'riscv-for-v5.2/fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: remove unused barrier defines
  riscv: mm: synchronize MMU after pte change
  riscv: dts: add initial board data for the SiFive HiFive Unleashed
  riscv: dts: add initial support for the SiFive FU540-C000 SoC
  dt-bindings: riscv: convert cpu binding to json-schema
  dt-bindings: riscv: sifive: add YAML documentation for the SiFive FU540
  arch: riscv: add support for building DTB files from DT source data
  riscv: Fix udelay in RV32.
  riscv: export pm_power_off again
  RISC-V: defconfig: enable clocks, serial console
2019-06-17 10:34:03 -07:00
Rolf Eike Beer 259931fd3b riscv: remove unused barrier defines
They were introduced in commit fab957c11e ("RISC-V: Atomic and
Locking Code") long after commit 2e39465abc ("locking: Remove
deprecated smp_mb__() barriers") removed the remnants of all previous
instances from the tree.

Signed-off-by: Rolf Eike Beer <eb@emlix.com>
[paul.walmsley@sifive.com: stripped spurious mbox header from patch
 description; fixed commit references in patch header]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-06-17 07:09:43 -07:00
ShihPo Hung bf587caae3 riscv: mm: synchronize MMU after pte change
Because RISC-V compliant implementations can cache invalid entries
in TLB, an SFENCE.VMA is necessary after changes to the page table.
This patch adds an SFENCE.vma for the vmalloc_fault path.

Signed-off-by: ShihPo Hung <shihpo.hung@sifive.com>
[paul.walmsley@sifive.com: reversed tab->whitespace conversion,
 wrapped comment lines]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-riscv@lists.infradead.org
Cc: stable@vger.kernel.org
2019-06-17 03:44:44 -07:00
Paul Walmsley c35f1b87fc riscv: dts: add initial board data for the SiFive HiFive Unleashed
Add initial board data for the SiFive HiFive Unleashed A00.

Currently the data populated in this DT file describes the board
DRAM configuration and the external clock sources that supply the
PRCI.

Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Tested-by: Loys Ollivier <lollivier@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Antony Pavlov <antonynpavlov@gmail.com>
Cc: devicetree@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
2019-06-17 02:04:10 -07:00
Paul Walmsley 72296bde4f riscv: dts: add initial support for the SiFive FU540-C000 SoC
Add initial support for the SiFive FU540-C000 SoC.  This is a 28nm SoC
based around the SiFive U54-MC core complex and a TileLink
interconnect.

This file is expected to grow as more device drivers are added to the
kernel.

This patch includes a fix to the QSPI memory map due to a
documentation bug, found by ShihPo Hung <shihpo.hung@sifive.com>, adds
entries for the I2C controller, and merges all DT changes that
formerly were made dynamically by the riscv-pk BBL proxy kernel.

Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Tested-by: Loys Ollivier <lollivier@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: ShihPo Hung <shihpo.hung@sifive.com>
Cc: devicetree@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
2019-06-17 02:04:05 -07:00
Paul Walmsley 4fd669a8c4 dt-bindings: riscv: convert cpu binding to json-schema
At Rob's request, we're starting to migrate our DT binding
documentation to json-schema YAML format.  Start by converting our cpu
binding documentation.  While doing so, document more properties and
nodes.  This includes adding binding documentation support for the E51
and U54 CPU cores ("harts") that are present on this SoC.  These cores
are described in:

    https://static.dev.sifive.com/FU540-C000-v1.0.pdf

This cpus.yaml file is intended to be a starting point and to
evolve over time.  It passes dt-doc-validate as of the yaml-bindings
commit 4c79d42e9216.

This patch was originally based on the ARM json-schema binding
documentation as added by commit 672951cbd1 ("dt-bindings: arm: Convert
cpu binding to json-schema").

Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: devicetree@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
2019-06-17 02:03:58 -07:00
Paul Walmsley c7af559817 dt-bindings: riscv: sifive: add YAML documentation for the SiFive FU540
Add YAML DT binding documentation for the SiFive FU540 SoC.  This
SoC is documented at:

    https://static.dev.sifive.com/FU540-C000-v1.0.pdf

Passes dt-doc-validate, as of yaml-bindings commit 4c79d42e9216.

Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: devicetree@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
2019-06-17 02:03:52 -07:00
Paul Walmsley 8d4e048d60 arch: riscv: add support for building DTB files from DT source data
Similar to ARM64, add support for building DTB files from DT source
data for RISC-V boards.

This patch starts with the infrastructure needed for SiFive boards.
Boards from other vendors would add support here in a similar form.

Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Tested-by: Loys Ollivier <lollivier@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
2019-06-17 02:03:40 -07:00
Jeremy Sowden 6be8e297f9 lapb: fixed leak of control-blocks.
lapb_register calls lapb_create_cb, which initializes the control-
block's ref-count to one, and __lapb_insert_cb, which increments it when
adding the new block to the list of blocks.

lapb_unregister calls __lapb_remove_cb, which decrements the ref-count
when removing control-block from the list of blocks, and calls lapb_put
itself to decrement the ref-count before returning.

However, lapb_unregister also calls __lapb_devtostruct to look up the
right control-block for the given net_device, and __lapb_devtostruct
also bumps the ref-count, which means that when lapb_unregister returns
the ref-count is still 1 and the control-block is leaked.

Call lapb_put after __lapb_devtostruct to fix leak.

Reported-by: syzbot+afb980676c836b4a0afa@syzkaller.appspotmail.com
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 20:44:20 -07:00
Xin Long 5cf02612b3 tipc: purge deferredq list for each grp member in tipc_group_delete
Syzbot reported a memleak caused by grp members' deferredq list not
purged when the grp is be deleted.

The issue occurs when more(msg_grp_bc_seqno(hdr), m->bc_rcv_nxt) in
tipc_group_filter_msg() and the skb will stay in deferredq.

So fix it by calling __skb_queue_purge for each member's deferredq
in tipc_group_delete() when a tipc sk leaves the grp.

Fixes: b87a5ea31c ("tipc: guarantee group unicast doesn't bypass group broadcast")
Reported-by: syzbot+78fbe679c8ca8d264a8d@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 20:42:05 -07:00
Eric Dumazet d4d5d8e83c ax25: fix inconsistent lock state in ax25_destroy_timer
Before thread in process context uses bh_lock_sock()
we must disable bh.

sysbot reported :

WARNING: inconsistent lock state
5.2.0-rc3+ #32 Not tainted

inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
blkid/26581 [HC0[0]:SC1[1]:HE1:SE0] takes:
00000000e0da85ee (slock-AF_AX25){+.?.}, at: spin_lock include/linux/spinlock.h:338 [inline]
00000000e0da85ee (slock-AF_AX25){+.?.}, at: ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275
{SOFTIRQ-ON-W} state was registered at:
  lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303
  __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
  _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
  spin_lock include/linux/spinlock.h:338 [inline]
  ax25_rt_autobind+0x3ca/0x720 net/ax25/ax25_route.c:429
  ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1221
  __sys_connect+0x264/0x330 net/socket.c:1834
  __do_sys_connect net/socket.c:1845 [inline]
  __se_sys_connect net/socket.c:1842 [inline]
  __x64_sys_connect+0x73/0xb0 net/socket.c:1842
  do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
irq event stamp: 2272
hardirqs last  enabled at (2272): [<ffffffff810065f3>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (2271): [<ffffffff8100660f>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (1522): [<ffffffff87400654>] __do_softirq+0x654/0x94c kernel/softirq.c:320
softirqs last disabled at (2267): [<ffffffff81449010>] invoke_softirq kernel/softirq.c:374 [inline]
softirqs last disabled at (2267): [<ffffffff81449010>] irq_exit+0x180/0x1d0 kernel/softirq.c:414

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(slock-AF_AX25);
  <Interrupt>
    lock(slock-AF_AX25);

 *** DEADLOCK ***

1 lock held by blkid/26581:
 #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:175 [inline]
 #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: call_timer_fn+0xe0/0x720 kernel/time/timer.c:1312

stack backtrace:
CPU: 1 PID: 26581 Comm: blkid Not tainted 5.2.0-rc3+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_usage_bug.cold+0x393/0x4a2 kernel/locking/lockdep.c:2935
 valid_state kernel/locking/lockdep.c:2948 [inline]
 mark_lock_irq kernel/locking/lockdep.c:3138 [inline]
 mark_lock+0xd46/0x1370 kernel/locking/lockdep.c:3513
 mark_irqflags kernel/locking/lockdep.c:3391 [inline]
 __lock_acquire+0x159f/0x5490 kernel/locking/lockdep.c:3745
 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:338 [inline]
 ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275
 call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
 expire_timers kernel/time/timer.c:1366 [inline]
 __run_timers kernel/time/timer.c:1685 [inline]
 __run_timers kernel/time/timer.c:1653 [inline]
 run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
 __do_softirq+0x25c/0x94c kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
 </IRQ>
RIP: 0033:0x7f858d5c3232
Code: 8b 61 08 48 8b 84 24 d8 00 00 00 4c 89 44 24 28 48 8b ac 24 d0 00 00 00 4c 8b b4 24 e8 00 00 00 48 89 7c 24 68 48 89 4c 24 78 <48> 89 44 24 58 8b 84 24 e0 00 00 00 89 84 24 84 00 00 00 8b 84 24
RSP: 002b:00007ffcaf0cf5c0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: 00007f858d7d27a8 RBX: 00007f858d7d8820 RCX: 00007f858d3940d8
RDX: 00007ffcaf0cf798 RSI: 00000000f5e616f3 RDI: 00007f858d394fee
RBP: 0000000000000000 R08: 00007ffcaf0cf780 R09: 00007f858d7db480
R10: 0000000000000000 R11: 0000000009691a75 R12: 0000000000000005
R13: 00000000f5e616f3 R14: 0000000000000000 R15: 00007ffcaf0cf798

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:22:37 -07:00
Eric Dumazet f3e92cb8e2 neigh: fix use-after-free read in pneigh_get_next
Nine years ago, I added RCU handling to neighbours, not pneighbours.
(pneigh are not commonly used)

Unfortunately I missed that /proc dump operations would use a
common entry and exit point : neigh_seq_start() and neigh_seq_stop()

We need to read_lock(tbl->lock) or risk use-after-free while
iterating the pneigh structures.

We might later convert pneigh to RCU and revert this patch.

sysbot reported :

BUG: KASAN: use-after-free in pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158
Read of size 8 at addr ffff888097f2a700 by task syz-executor.0/9825

CPU: 1 PID: 9825 Comm: syz-executor.0 Not tainted 5.2.0-rc4+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158
 neigh_seq_next+0xdb/0x210 net/core/neighbour.c:3240
 seq_read+0x9cf/0x1110 fs/seq_file.c:258
 proc_reg_read+0x1fc/0x2c0 fs/proc/inode.c:221
 do_loop_readv_writev fs/read_write.c:714 [inline]
 do_loop_readv_writev fs/read_write.c:701 [inline]
 do_iter_read+0x4a4/0x660 fs/read_write.c:935
 vfs_readv+0xf0/0x160 fs/read_write.c:997
 kernel_readv fs/splice.c:359 [inline]
 default_file_splice_read+0x475/0x890 fs/splice.c:414
 do_splice_to+0x127/0x180 fs/splice.c:877
 splice_direct_to_actor+0x2d2/0x970 fs/splice.c:954
 do_splice_direct+0x1da/0x2a0 fs/splice.c:1063
 do_sendfile+0x597/0xd00 fs/read_write.c:1464
 __do_sys_sendfile64 fs/read_write.c:1525 [inline]
 __se_sys_sendfile64 fs/read_write.c:1511 [inline]
 __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4592c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f4aab51dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000004592c9
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000080000000 R11: 0000000000000246 R12: 00007f4aab51e6d4
R13: 00000000004c689d R14: 00000000004db828 R15: 00000000ffffffff

Allocated by task 9827:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 __do_kmalloc mm/slab.c:3660 [inline]
 __kmalloc+0x15c/0x740 mm/slab.c:3669
 kmalloc include/linux/slab.h:552 [inline]
 pneigh_lookup+0x19c/0x4a0 net/core/neighbour.c:731
 arp_req_set_public net/ipv4/arp.c:1010 [inline]
 arp_req_set+0x613/0x720 net/ipv4/arp.c:1026
 arp_ioctl+0x652/0x7f0 net/ipv4/arp.c:1226
 inet_ioctl+0x2a0/0x340 net/ipv4/af_inet.c:926
 sock_do_ioctl+0xd8/0x2f0 net/socket.c:1043
 sock_ioctl+0x3ed/0x780 net/socket.c:1194
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:509 [inline]
 do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9824:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kfree+0xcf/0x220 mm/slab.c:3755
 pneigh_ifdown_and_unlock net/core/neighbour.c:812 [inline]
 __neigh_ifdown+0x236/0x2f0 net/core/neighbour.c:356
 neigh_ifdown+0x20/0x30 net/core/neighbour.c:372
 arp_ifdown+0x1d/0x21 net/ipv4/arp.c:1274
 inetdev_destroy net/ipv4/devinet.c:319 [inline]
 inetdev_event+0xa14/0x11f0 net/ipv4/devinet.c:1544
 notifier_call_chain+0xc2/0x230 kernel/notifier.c:95
 __raw_notifier_call_chain kernel/notifier.c:396 [inline]
 raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:403
 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1749
 call_netdevice_notifiers_extack net/core/dev.c:1761 [inline]
 call_netdevice_notifiers net/core/dev.c:1775 [inline]
 rollback_registered_many+0x9b9/0xfc0 net/core/dev.c:8178
 rollback_registered+0x109/0x1d0 net/core/dev.c:8220
 unregister_netdevice_queue net/core/dev.c:9267 [inline]
 unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9260
 unregister_netdevice include/linux/netdevice.h:2631 [inline]
 __tun_detach+0xd8a/0x1040 drivers/net/tun.c:724
 tun_detach drivers/net/tun.c:741 [inline]
 tun_chr_close+0xe0/0x180 drivers/net/tun.c:3451
 __fput+0x2ff/0x890 fs/file_table.c:280
 ____fput+0x16/0x20 fs/file_table.c:313
 task_work_run+0x145/0x1c0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:185 [inline]
 exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:168
 prepare_exit_to_usermode arch/x86/entry/common.c:199 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:279 [inline]
 do_syscall_64+0x58e/0x680 arch/x86/entry/common.c:304
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888097f2a700
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 0 bytes inside of
 64-byte region [ffff888097f2a700, ffff888097f2a740)
The buggy address belongs to the page:
page:ffffea00025fca80 refcount:1 mapcount:0 mapping:ffff8880aa400340 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea000250d548 ffffea00025726c8 ffff8880aa400340
raw: 0000000000000000 ffff888097f2a000 0000000100000020 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888097f2a600: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
 ffff888097f2a680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888097f2a700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                   ^
 ffff888097f2a780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff888097f2a800: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc

Fixes: 767e97e1e0 ("neigh: RCU conversion of struct neighbour")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:15:58 -07:00
Eric Dumazet 2e05fcae83 tcp: fix compile error if !CONFIG_SYSCTL
tcp_tx_skb_cache_key and tcp_rx_skb_cache_key must be available
even if CONFIG_SYSCTL is not set.

Fixes: 0b7d7f6b22 ("tcp: add tcp_tx_skb_cache sysctl")
Fixes: ede61ca474 ("tcp: add tcp_rx_skb_cache sysctl")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:15:07 -07:00
Dexuan Cui d424a2afd7 hv_sock: Suppress bogus "may be used uninitialized" warnings
gcc 8.2.0 may report these bogus warnings under some condition:

warning: ‘vnew’ may be used uninitialized in this function
warning: ‘hvs_new’ may be used uninitialized in this function

Actually, the 2 pointers are only initialized and used if the variable
"conn_from_host" is true. The code is not buggy here.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:00:51 -07:00
Ivan Vecera 718f4a2537 be2net: Fix number of Rx queues used for flow hashing
Number of Rx queues used for flow hashing returned by the driver is
incorrect and this bug prevents user to use the last Rx queue in
indirection table.

Let's say we have a NIC with 6 combined queues:

[root@sm-03 ~]# ethtool -l enp4s0f0
Channel parameters for enp4s0f0:
Pre-set maximums:
RX:             5
TX:             5
Other:          0
Combined:       6
Current hardware settings:
RX:             0
TX:             0
Other:          0
Combined:       6

Default indirection table maps all (6) queues equally but the driver
reports only 5 rings available.

[root@sm-03 ~]# ethtool -x enp4s0f0
RX flow hash indirection table for enp4s0f0 with 5 RX ring(s):
    0:      0     1     2     3     4     5     0     1
    8:      2     3     4     5     0     1     2     3
   16:      4     5     0     1     2     3     4     5
   24:      0     1     2     3     4     5     0     1
...

Now change indirection table somehow:

[root@sm-03 ~]# ethtool -X enp4s0f0 weight 1 1
[root@sm-03 ~]# ethtool -x enp4s0f0
RX flow hash indirection table for enp4s0f0 with 6 RX ring(s):
    0:      0     0     0     0     0     0     0     0
...
   64:      1     1     1     1     1     1     1     1
...

Now it is not possible to change mapping back to equal (default) state:

[root@sm-03 ~]# ethtool -X enp4s0f0 equal 6
Cannot set RX flow hash configuration: Invalid argument

Fixes: 594ad54a2c ("be2net: Add support for setting and getting rx flow hash options")
Reported-by: Tianhao <tizhao@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 13:47:41 -07:00
Govindarajulu Varadarajan 36b2f61a42 net: handle 802.1P vlan 0 packets properly
When stack receives pkt: [802.1P vlan 0][802.1AD vlan 100][IPv4],
vlan_do_receive() returns false if it does not find vlan_dev. Later
__netif_receive_skb_core() fails to find packet type handler for
skb->protocol 801.1AD and drops the packet.

801.1P header with vlan id 0 should be handled as untagged packets.
This patch fixes it by checking if vlan_id is 0 and processes next vlan
header.

Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 13:45:14 -07:00
Linus Torvalds 9e0babf2c0 Linux 5.2-rc5 2019-06-16 08:49:45 -10:00
Linus Torvalds 963172d9c7 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "The accumulated fixes from this and last week:

   - Fix vmalloc TLB flush and map range calculations which lead to
     stale TLBs, spurious faults and other hard to diagnose issues.

   - Use fault_in_pages_writable() for prefaulting the user stack in the
     FPU code as it's less fragile than the current solution

   - Use the PF_KTHREAD flag when checking for a kernel thread instead
     of current->mm as the latter can give the wrong answer due to
     use_mm()

   - Compute the vmemmap size correctly for KASLR and 5-Level paging.
     Otherwise this can end up with a way too small vmemmap area.

   - Make KASAN and 5-level paging work again by making sure that all
     invalid bits are masked out when computing the P4D offset. This
     worked before but got broken recently when the LDT remap area was
     moved.

   - Prevent a NULL pointer dereference in the resource control code
     which can be triggered with certain mount options when the
     requested resource is not available.

   - Enforce ordering of microcode loading vs. perf initialization on
     secondary CPUs. Otherwise perf tries to access a non-existing MSR
     as the boot CPU marked it as available.

   - Don't stop the resource control group walk early otherwise the
     control bitmaps are not updated correctly and become inconsistent.

   - Unbreak kgdb by returning 0 on success from
     kgdb_arch_set_breakpoint() instead of an error code.

   - Add more Icelake CPU model defines so depending changes can be
     queued in other trees"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
  x86/kasan: Fix boot with 5-level paging and KASAN
  x86/fpu: Don't use current->mm to check for a kthread
  x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
  x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
  x86/resctrl: Don't stop walking closids when a locksetup group is found
  x86/fpu: Update kernel's FPU state before using for the fsave header
  x86/mm/KASLR: Compute the size of the vmemmap section properly
  x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  x86/CPU: Add more Icelake model numbers
  mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
  mm/vmalloc: Fix calculation of direct map addr range
2019-06-16 07:28:14 -10:00
Linus Torvalds efba92d58f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A set of small fixes:

   - Repair the ktime_get_coarse() functions so they actually deliver
     what they are supposed to: tick granular time stamps. The current
     code missed to add the accumulated nanoseconds part of the
     timekeeper so the resulting granularity was 1 second.

   - Prevent the tracer from infinitely recursing into time getter
     functions in the arm architectured timer by marking these functions
     notrace

   - Fix a trivial compiler warning caused by wrong qualifier ordering"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Repair ktime_get_coarse*() granularity
  clocksource/drivers/arm_arch_timer: Don't trace count reader functions
  clocksource/drivers/timer-ti-dm: Change to new style declaration
2019-06-16 07:22:56 -10:00
Linus Torvalds f763cf8e47 Merge branch 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
 "Two small fixes for RAS:

   - Use a proper search algorithm to find the correct element in the
     CEC array. The replacement was a better choice than fixing the
     crash causes by the original search function with horrible duct
     tape.

   - Move the timer based decay function into thread context so it can
     actually acquire the mutex which protects the CEC array to prevent
     corruption"

* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  RAS/CEC: Convert the timer callback to a workqueue
  RAS/CEC: Fix binary search function
2019-06-16 07:19:15 -10:00
Eric Dumazet 967c05aee4 tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.

Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Eric Dumazet 5f3e2bf008 tcp: add tcp_min_snd_mss sysctl
Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.

This forces the stack to send packets with a very high network/cpu
overhead.

Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.

In some cases, it can be useful to increase the minimal value
to a saner value.

We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.

Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f1 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.

We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Eric Dumazet f070ef2ac6 tcp: tcp_fragment() should apply sane memory limits
Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.

TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.

A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.

Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.

CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
	socket is already using more than half the allowed space

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Eric Dumazet 3b4929f65b tcp: limit payload size of sacked skbs
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
David S. Miller 1eb4169c1e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-06-15

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix stack layout of JITed x64 bpf code, from Alexei.

2) fix out of bounds memory access in bpf_sk_storage, from Arthur.

3) fix lpm trie walk, from Jonathan.

4) fix nested bpf_perf_event_output, from Matt.

5) and several other fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:19:47 -07:00
David S. Miller 5db2e7c791 Revert "net: phylink: set the autoneg state in phylink_phy_change"
This reverts commit ef7bfa8472.

Russell King espressed some strong opposition to this
change, explaining that this is trying to make phylink
behave outside of how it has been designed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:10:30 -07:00
Matt Mullins 9594dc3c7e bpf: fix nested bpf tracepoints with per-cpu data
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
they do not increment bpf_prog_active while executing.

This enables three levels of nesting, to support
  - a kprobe or raw tp or perf event,
  - another one of the above that irq context happens to call, and
  - another one in nmi context
(at most one of which may be a kprobe or perf event).

Fixes: 20b9d7ac48 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-15 16:33:35 -07:00
Arthur Fabre 85749218e3 bpf: Fix out of bounds memory access in bpf_sk_storage
bpf_sk_storage maps use multiple spin locks to reduce contention.
The number of locks to use is determined by the number of possible CPUs.
With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used.

When updating elements, the correct lock is determined with hash_ptr().
Calling hash_ptr() with 0 bits is undefined behavior, as it does:

x >> (64 - bits)

Using the value results in an out of bounds memory access.
In my case, this manifested itself as a page fault when raw_spin_lock_bh()
is called later, when running the self tests:

./tools/testing/selftests/bpf/test_verifier 773 775
[   16.366342] BUG: unable to handle page fault for address: ffff8fe7a66f93f8

Force the minimum number of locks to two.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-15 14:37:56 -07:00
Stephen Barber 42f5cda5ea vsock/virtio: set SOCK_DONE on peer shutdown
Set the SOCK_DONE flag to match the TCP_CLOSING state when a peer has
shut down and there is nothing left to read.

This fixes the following bug:
1) Peer sends SHUTDOWN(RDWR).
2) Socket enters TCP_CLOSING but SOCK_DONE is not set.
3) read() returns -ENOTCONN until close() is called, then returns 0.

Signed-off-by: Stephen Barber <smbarber@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 14:01:09 -07:00
Linus Walleij 760c80b70b net: dsa: rtl8366: Fix up VLAN filtering
We get this regression when using RTL8366RB as part of a bridge
with OpenWrt:

WARNING: CPU: 0 PID: 1347 at net/switchdev/switchdev.c:291
	 switchdev_port_attr_set_now+0x80/0xa4
lan0: Commit of attribute (id=7) failed.
(...)
realtek-smi switch lan0: failed to initialize vlan filtering on this port

This is because it is trying to disable VLAN filtering
on VLAN0, as we have forgot to add 1 to the port number
to get the right VLAN in rtl8366_vlan_filtering(): when
we initialize the VLAN we associate VLAN1 with port 0,
VLAN2 with port 1 etc, so we need to add 1 to the port
offset.

Fixes: d8652956cf ("net: dsa: realtek-smi: Add Realtek SMI driver")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 13:40:30 -07:00
Ioana Ciornei ef7bfa8472 net: phylink: set the autoneg state in phylink_phy_change
The phy_state field of phylink should carry only valid information
especially when this can be passed to the .mac_config callback.
Update the an_enabled field with the autoneg state in the
phylink_phy_change function.

Fixes: 9525ae8395 ("phylink: add phylink infrastructure")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 13:29:32 -07:00
Linus Torvalds e01e060fe0 platform-drivers-x86 for v5.2-3
Couple of driver enumeration issues have been fixed for Mellanox.
 ASUS laptops got a regression with backlight, which is fixed now.
 Dell computers got a wrong mode (tablet versus laptop) after resume,
 that is fixed now.
 
 The following is an automated git shortlog grouped by driver:
 
 asus-wmi:
  -  Only Tell EC the OS will handle display hotkeys from asus_nb_wmi
 
 intel-vbtn:
  -  Report switch events when event wakes device
 
 mlx-platform:
  -  Fix parent device in i2c-mux-reg device registration
 
 platform/mellanox:
  -  mlxreg-hotplug: Add devm_free_irq call to remove flow
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEqaflIX74DDDzMJJtb7wzTHR8rCgFAl0FJxcACgkQb7wzTHR8
 rCirlhAAqtceFp9MwqM3EaHkgurK/zMdq3dw62/TKFQfK6n0MaxYeE6G43BI7vBX
 XOoqNzOH3vDELGrZaP1DfmXpd6xASngd57p2qElHUYVqqVHgkqORcuIG5VJpzKsY
 1+GRey4eWJ114G16tpGi/MH+i44ro+Y+WJV/F0qbZHRDANenW/VCl1EDWYYf4SYX
 9QGKtQWkmGKMzklorhIHiTmrg3HoH08tddR5zStTr88Db8/hErnCtp8Xfm48cJ2J
 4mLEXuta0bh2W9WbG2prFL14FE12iawLjDQH67r1XpYtXb+LN1n/f7nq+AgdSTbH
 raEYkmhARGbxUjwuW46hDUBHFMkgRY4VG+VyBeGgAZAFtSTbyV5XtQIoV1541sW5
 7R4hjta8Gb/6r40L6izAyy23KVjI9wygt++fpSwRtf3/77yICMWHq3Uel4RuX/gX
 AWiyrGoOPb3BVwk+Hx1ME6HNMupV0OLTdS+eS0nUH1MTdidawAZ+xM+Y7D7BPJKd
 116lHuGlQcMzvP6r+cGAQiptzQvRP+NzuSfbY7H7YY+11dQGm4XULWbFC6q28EgY
 42REhawu0PXpMWUQRPK7ZcqFKiqE+nJiKmAuRpzpC9nlr3aaPKP6KCU0eBbYbeyZ
 N4rOB8y6MBqm2HgFL9yh9l0tcnMF7o9P5NU/aztZEndL6/LMxQA=
 =/H6h
 -----END PGP SIGNATURE-----

Merge tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86

Pull x86 platform driver fixes from Andy Shevchenko:

 - fix a couple of Mellanox driver enumeration issues

 - fix ASUS laptop regression with backlight

 - fix Dell computers that got a wrong mode (tablet versus laptop) after
   resume

* tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86:
  platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow
  platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration
  platform/x86: intel-vbtn: Report switch events when event wakes device
  platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi
2019-06-15 07:38:54 -10:00
Linus Torvalds ff39074b1d USB fixes for 5.2-rc5
Here are some small USB driver fixes for 5.2-rc5
 
 Nothing major, just some small gadget fixes, usb-serial new device ids,
 a few new quirks, and some small fixes for some regressions that have
 been found after the big 5.2-rc1 merge.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQUTIg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykPJQCfZdiElw3N35bKVf7bhruxSb5DevEAni96uDM7
 jSOSEB/EWqxZEy0ZjU7J
 =cKx4
 -----END PGP SIGNATURE-----

Merge tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB driver fixes for 5.2-rc5

  Nothing major, just some small gadget fixes, usb-serial new device
  ids, a few new quirks, and some small fixes for some regressions that
  have been found after the big 5.2-rc1 merge.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: Make sure an alt mode exist before getting its partner
  usb: gadget: udc: lpc32xx: fix return value check in lpc32xx_udc_probe()
  usb: gadget: dwc2: fix zlp handling
  usb: dwc2: Set actual frame number for completed ISOC transfer for none DDMA
  usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC
  usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i]
  usb: phy: mxs: Disable external charger detect in mxs_phy_hw_init()
  usb: dwc2: Fix DMA cache alignment issues
  usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression)
  USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.
  USB: usb-storage: Add new ID to ums-realtek
  usb: typec: ucsi: ccg: fix memory leak in do_flash
  USB: serial: option: add Telit 0x1260 and 0x1261 compositions
  USB: serial: pl2303: add Allied Telesis VT-Kit3
  USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode
2019-06-15 07:34:23 -10:00
Linus Torvalds fa1827d773 powerpc fixes for 5.2 #4
One fix for a regression introduced by our 32-bit KASAN support, which broke
 booting on machines with "bootx" early debugging enabled.
 
 A fix for a bug which broke kexec on 32-bit, introduced by changes to the 32-bit
 STRICT_KERNEL_RWX support in v5.1.
 
 Finally two fixes going to stable for our THP split/collapse handling,
 discovered by Nick. The first fixes random crashes and/or corruption in guests
 under sufficient load.
 
 Thanks to:
   Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu Malaterre.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdBOivAAoJEFHr6jzI4aWA/T0P/1pmr4JLWzXOPQOGOwS2mmu5
 HhXBFuDflpGiWS31syOhfKhiE2qlwdIcGaclSo1wAgUnMKp+sxVEagF6DEt484r3
 DXs3eRyrGu5vQT7Q6yReuT3Kw2ZR474a5ob00WGAQosBKyJF4ZHWz16ETVWMdAMQ
 TknEEU3hOUnMWWIEvLnZOKT7eJcmzj5IYy1OtLHBjWiHHizGC8PSxdVhiRcD/O6R
 F6C7XrFb7RRj5ran6gxwMbcTvjgu922TSQPOCw93qnXYWLfvWDUXC4yCqY21oHnr
 b3zgJNgIdSoMYxE8pfOH7Y+eaJrbzgnhlS5OJNEz/4NOfGQnJXYcSF8QO6eeVoKM
 L2SkT1Ov+QMmZQjMC5e9OAe7DFHfM59RYFg11eaUqfiaObRsmwu8rqjngITV5Ede
 Ydq3W39XQkjB3aQ4qb0MnEBbVgyQ/y6/T5hoHRlvnb5byFk5Pd3jpub56sq87UGM
 M4GD3YdmT5eBqzGxApddyIiS839PZpdw7g/Ivtp2GYCtiNNZcJqaXvN7IwfcVF3c
 YJrCfhNTUJPjICIL9k9v2RQovGuGZqtM3BHc8PmyVvDxTqtEbh8B9GEg+QC58xp/
 s+UI2sEd/Vg6NVKluIJHau7aEmJlaxYIshqsIhYvPgJRN6KrFMhdfPNx7D+zMlu8
 nbM6Z1n2VFk9Cxb0vA9K
 =MXJI
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix for a regression introduced by our 32-bit KASAN support, which
  broke booting on machines with "bootx" early debugging enabled.

  A fix for a bug which broke kexec on 32-bit, introduced by changes to
  the 32-bit STRICT_KERNEL_RWX support in v5.1.

  Finally two fixes going to stable for our THP split/collapse handling,
  discovered by Nick. The first fixes random crashes and/or corruption
  in guests under sufficient load.

  Thanks to: Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu
  Malaterre"

* tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX
  powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
  powerpc/64s: Fix THP PMD collapse serialisation
  powerpc: Fix kexec failure on book3s/32
2019-06-15 07:29:32 -10:00
Linus Torvalds 6a71398c6a This includes the following fixes:
- Out of range read of stack trace output
  - Fix for NULL pointer dereference in trace_uprobe_create()
  - Fix to a livepatching / ftrace permission race in the module code
  - Fix for NULL pointer dereference in free_ftrace_func_mapper()
  - A couple of build warning clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXQToxhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qusmAP4/mmJPgsDchnu5ui0wB8BByzJlsPn8
 luXFDuqI4f34zgD+JCmeYbj5LLh98D9XkaaEgP4yz3yKsWeSdwWPCU0vTgo=
 =M/+E
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Out of range read of stack trace output

 - Fix for NULL pointer dereference in trace_uprobe_create()

 - Fix to a livepatching / ftrace permission race in the module code

 - Fix for NULL pointer dereference in free_ftrace_func_mapper()

 - A couple of build warning clean ups

* tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
  module: Fix livepatch/ftrace module text permissions race
  tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
  tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
  tracing: Make two symbols static
  tracing: avoid build warning with HAVE_NOP_MCOUNT
  tracing: Fix out-of-range read in trace_stack_print()
2019-06-15 07:24:11 -10:00
Borislav Petkov 78f4e932f7 x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
Adric Blake reported the following warning during suspend-resume:

  Enabling non-boot CPUs ...
  x86: Booting SMP configuration:
  smpboot: Booting Node 0 Processor 1 APIC 0x2
  unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
   at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
  Call Trace:
   intel_set_tfa
   intel_pmu_cpu_starting
   ? x86_pmu_dead_cpu
   x86_pmu_starting_cpu
   cpuhp_invoke_callback
   ? _raw_spin_lock_irqsave
   notify_cpu_starting
   start_secondary
   secondary_startup_64
  microcode: sig=0x806ea, pf=0x80, revision=0x96
  microcode: updated to revision 0xb4, date = 2019-04-01
  CPU1 is up

The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
by microcode. The log above shows that the microcode loader callback
happens after the PMU restoration, leading to the conjecture that
because the microcode hasn't been updated yet, that MSR is not present
yet, leading to the #GP.

Add a microcode loader-specific hotplug vector which comes before
the PERF vectors and thus executes earlier and makes sure the MSR is
present.

Fixes: 400816f60c ("perf/x86/intel: Implement support for TSX Force Abort")
Reported-by: Adric Blake <promarbler14@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: x86@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637
2019-06-15 10:00:29 +02:00
Linus Torvalds 0011572c88 Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This has an unusually high density of tricky fixes:

   - task_get_css() could deadlock when it races against a dying cgroup.

   - cgroup.procs didn't list thread group leaders with live threads.

     This could mislead readers to think that a cgroup is empty when
     it's not. Fixed by making PROCS iterator include dead tasks. I made
     a couple mistakes making this change and this pull request contains
     a couple follow-up patches.

   - When cpusets run out of online cpus, it updates cpusmasks of member
     tasks in bizarre ways. Joel improved the behavior significantly"

* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: restore sanity to cpuset_cpus_allowed_fallback()
  cgroup: Fix css_task_iter_advance_css_set() cset skip condition
  cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
  cgroup: Include dying leaders with live threads in PROCS iterations
  cgroup: Implement css_task_iter_skip()
  cgroup: Call cgroup_release() before __exit_signal()
  docs cgroups: add another example size for hugetlb
  cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
2019-06-14 17:46:14 -10:00
Linus Torvalds 6aa7a22b97 drm-fixes for 5.2-rc5:
- fix regression on amdgpu on SI
 - fix edid override regression
 - driver fixes: amdgpu, i915, mediatek, meson, panfrost
 - fix writecombine for vmap in gem-shmem helper (used by panfrost)
 - add more panel quirks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAl0D4LAACgkQTA9ye/CY
 qnFePw/8DymJIm8hNIJ1hF65be0XVdQiKhggw76llBDAWv/Zc5LHZxK/4QJgwbrF
 ounfcJOnLWHyeMaoxO8AtLFZQc+4hlm6PMV3tkou8UwYPszcTK++wbvWiYmYn5/t
 q7MoLUf/kd1sngzFYJHEBDmNHFo34uPInc9YJX4Ufvz5fl6uIr+Yc2gLPOZsUhig
 PvcKSm7sJwNgUGJq7g0qPgLFaIwRL0FMVrPltTg533OkTBcQwAmC+cJVrx5AKxmo
 qJYyPy0iK3YNAz2KvMf4q+c8TGk4EpkExpqLD/6bHHiglD8PKMh9C5ER7iyhlzqo
 vgog+dD3cjFaTn+Fo1mX5UYo/rRl6DYantySKZcdfDVJ8VUo8HALeHs6vNE47WzD
 jDziMVRwgXs8uYFpQJM+E+V1XmuxmsJ+H21p+Lzv0BFxHDakTnPNY7Yibclgoz+v
 zMJP7uQbrMthGmM7VsVsHbiMARSak9UasbHoikUuMBhYG3pJ59jFpCwKlqTGfJ3+
 hKWvL5jnXpOOpLP1beJg4bpFn2seBkN1uNDllCz0gv+WBw7B8vJMBOQR4c29Oy4v
 fDD/scs/eG29btdLUMeA4cJQNbUNY2MC2hfde3gWJHO3V1GCFdGMfAlQh09b+qhg
 Q9RqiRmw4oZqAZLCmx5oa+7vf19SWjUqh36zfXOu3aTy/9nDa0Q=
 =4gYP
 -----END PGP SIGNATURE-----

Merge tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Daniel Vetter:
 "Nothing unsettling here, also not aware of anything serious still
  pending.

  The edid override regression fix took a bit longer since this seems to
  be an area with an overabundance of bad options. But the fix we have
  now seems like a good path forward.

  Next week it should be back to Dave.

  Summary:

   - fix regression on amdgpu on SI

   - fix edid override regression

   - driver fixes: amdgpu, i915, mediatek, meson, panfrost

   - fix writecombine for vmap in gem-shmem helper (used by panfrost)

   - add more panel quirks"

* tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm: (25 commits)
  drm/amdgpu: return 0 by default in amdgpu_pm_load_smu_firmware
  drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()
  drm: add fallback override/firmware EDID modes workaround
  drm/edid: abstract override/firmware EDID retrieval
  drm/i915/perf: fix whitelist on Gen10+
  drm/i915/sdvo: Implement proper HDMI audio support for SDVO
  drm/i915: Fix per-pixel alpha with CCS
  drm/i915/dmc: protect against reading random memory
  drm/i915/dsi: Use a fuzzy check for burst mode clock check
  drm/amdgpu/{uvd,vcn}: fetch ring's read_ptr after alloc
  drm/panfrost: Require the simple_ondemand governor
  drm/panfrost: make devfreq optional again
  drm/gem_shmem: Use a writecombine mapping for ->vaddr
  drm: panel-orientation-quirks: Add quirk for GPD MicroPC
  drm: panel-orientation-quirks: Add quirk for GPD pocket2
  drm/meson: fix G12A primary plane disabling
  drm/meson: fix primary plane disabling
  drm/meson: fix G12A HDMI PLL settings for 4K60 1000/1001 variations
  drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable()
  drm/mediatek: clear num_pipes when unbind driver
  ...
2019-06-14 17:34:45 -10:00
Linus Torvalds 4066524401 Fix rounding error in gfs2_iomap_page_prepare
-----BEGIN PGP SIGNATURE-----
 
 iQIbBAABAgAGBQJdA9AdAAoJENW/n+sDE2U67xgP+PnBqsdJpFVfodQyu+AEHdxJ
 et2nZmcsQGRXj5iha7lWI6Mu0x/7q9FOCS1I0aZDwcHn1A0R5R/xulKcnl6ySjrp
 bd6GnehP1KicufykAPr5lGHmiIIp/SdIDe7z6XUkV1Eksl6ls8WTWi+JEsP7uDjw
 TGQ6GTOziRUKHshuEkvvYfA3OhbebNzP1FjgPr6keg7PFeLAaiUWKxlTYj9Vov8F
 4YnOsdfFEFpiJy8w5N/Ggfmp1fbY+uyzoS70Xe+y7XTdkX57fPnzlCxLsji7PgAp
 AN38cpUHib46Ng8YyGfdB2wzJ5BAauYLy7zYGLiNdHXFvwkm4bP2CyeALyhG/Jg7
 HOBJ1owekaBV/o96NgyuJTfv65zdV0su7s24YBVJlY+aUOkZWJb2N4quyv/WyiTY
 Ds1xR4Uql6WOBt6fb4Ady+4lD0j01k5YMk56fcmHlykldVYXXeJXksS/0MxKBR5T
 qGWPHpldVXK6TZhRo8SkS/H2h6CdqevwxECa2YMrnWFVaioTo5jBtO6l5Nn7z0Ni
 qaELyzFdhDlnq8ywvmo1zBPT2NjF4m0XQkEiy5P1RvkxSq500zvOEVVCx8vXICY7
 TwUpCSRP5Tu9P9LurvgTf/Pltsf0pApKyYRIC4hu8/6jNxum1Rq0JZW2JepNDSJH
 qnQPowGLnhAJDWIfIPU=
 =Z7P5
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.2.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:
 "Fix rounding error in gfs2_iomap_page_prepare"

* tag 'gfs2-v5.2.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix rounding error in gfs2_iomap_page_prepare
2019-06-14 17:27:12 -10:00
David S. Miller 35fc07aee8 Merge branch 'tcp-add-three-static-keys'
Eric Dumazet says:

====================
tcp: add three static keys

Recent addition of per TCP socket rx/tx cache brought
regressions for some workloads, as reported by Feng Tang.

It seems better to make them opt-in, before we adopt better
heuristics.

The last patch adds high_order_alloc_disable sysctl
to ask TCP sendmsg() to exclusively use order-0 allocations,
as mm layer has specific optimizations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet ce27ec6064 net: add high_order_alloc_disable sysctl/static key
>From linux-3.7, (commit 5640f76858 "net: use a per task frag
allocator") TCP sendmsg() has preferred using order-3 allocations.

While it gives good results for most cases, we had reports
that heavy uses of TCP over loopback were hitting a spinlock
contention in page allocations/freeing.

This commits adds a sysctl so that admins can opt-in
for order-0 allocations. Hopefully mm layer might optimize
order-3 allocations in the future since it could give us
a nice boost  (see 8 lines of following benchmark)

The following benchmark shows a win when more than 8 TCP_STREAM
threads are running (56 x86 cores server in my tests)

for thr in {1..30}
do
 sysctl -wq net.core.high_order_alloc_disable=0
 T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
 sysctl -wq net.core.high_order_alloc_disable=1
 T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
 echo $thr:$T0:$T1
done

1: 49979: 37267
2: 98745: 76286
3: 141088: 110051
4: 177414: 144772
5: 197587: 173563
6: 215377: 208448
7: 241061: 234087
8: 267155: 263373
9: 295069: 297402
10: 312393: 335213
11: 340462: 368778
12: 371366: 403954
13: 412344: 443713
14: 426617: 473580
15: 474418: 507861
16: 503261: 538539
17: 522331: 563096
18: 532409: 567084
19: 550824: 605240
20: 525493: 641988
21: 564574: 665843
22: 567349: 690868
23: 583846: 710917
24: 588715: 736306
25: 603212: 763494
26: 604083: 792654
27: 602241: 796450
28: 604291: 797993
29: 611610: 833249
30: 577356: 841062

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet 0b7d7f6b22 tcp: add tcp_tx_skb_cache sysctl
Feng Tang reported a performance regression after introduction
of per TCP socket tx/rx caches, for TCP over loopback (netperf)

There is high chance the regression is caused by a change on
how well the 32 KB per-thread page (current->task_frag) can
be recycled, and lack of pcp caches for order-3 pages.

I could not reproduce the regression myself, cpus all being
spinning on the mm spinlocks for page allocs/freeing, regardless
of enabling or disabling the per tcp socket caches.

It seems best to disable the feature by default, and let
admins enabling it.

MM layer either needs to provide scalable order-3 pages
allocations, or could attempt a trylock on zone->lock if
the caller only attempts to get a high-order page and is
able to fallback to order-0 ones in case of pressure.

Tests run on a 56 cores host (112 hyper threads)

-	35.49%	netperf 		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
   - 35.49% queued_spin_lock_slowpath
	  - 18.18% get_page_from_freelist
		 - __alloc_pages_nodemask
			- 18.18% alloc_pages_current
				 skb_page_frag_refill
				 sk_page_frag_refill
				 tcp_sendmsg_locked
				 tcp_sendmsg
				 inet_sendmsg
				 sock_sendmsg
				 __sys_sendto
				 __x64_sys_sendto
				 do_syscall_64
				 entry_SYSCALL_64_after_hwframe
				 __libc_send
	  + 17.31% __free_pages_ok
+	31.43%	swapper 		 [kernel.vmlinux]	  [k] intel_idle
+	 9.12%	netperf 		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
+	 6.53%	netserver		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
+	 0.69%	netserver		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
+	 0.68%	netperf 		 [kernel.vmlinux]	  [k] skb_release_data
+	 0.52%	netperf 		 [kernel.vmlinux]	  [k] tcp_sendmsg_locked
	 0.46%	netperf 		 [kernel.vmlinux]	  [k] _raw_spin_lock_irqsave

Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet ede61ca474 tcp: add tcp_rx_skb_cache sysctl
Instead of relying on rps_needed, it is safer to use a separate
static key, since we do not want to enable TCP rx_skb_cache
by default. This feature can cause huge increase of memory
usage on hosts with millions of sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet a8e11e5c56 sysctl: define proc_do_static_key()
Convert proc_dointvec_minmax_bpf_stats() into a more generic
helper, since we are going to use jump labels more often.

Note that sysctl_bpf_stats_enabled is removed, since
it is no longer needed/used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:27 -07:00
Haiyang Zhang 9a33629ba6 hv_netvsc: Set probe mode to sync
For better consistency of synthetic NIC names, we set the probe mode to
PROBE_FORCE_SYNCHRONOUS. So the names can be aligned with the vmbus
channel offer sequence.

Fixes: af0a5646cb ("use the new async probing feature for the hyperv drivers")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:47:05 -07:00