Commit Graph

506 Commits

Author SHA1 Message Date
Jason Wang fe8dd45bb7 tun: switch to new type of msg_control
This patch introduces to a new tun/tap specific msg_control:

#define TUN_MSG_UBUF 1
#define TUN_MSG_PTR  2
struct tun_msg_ctl {
       int type;
       void *ptr;
};

This allows us to pass different kinds of msg_control through
sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will
be used by the existed vhost_net zerocopy code. The second is XDP
buff, which allows vhost_net to pass XDP buff to TUN. This could be
used to implement accepting an array of XDP buffs from vhost_net in
the following patches.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Jason Wang 1a097910ad tuntap: move XDP flushing out of tun_do_xdp()
This will allow adding batch flushing on top.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Jason Wang 8ae1aff0b3 tuntap: split out XDP logic
This patch split out XDP logic into a single function. This make it to
be reused by XDP batching path in the following patch.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Jason Wang ac1f1f6c5a tuntap: tweak on the path of skb XDP case in tun_build_skb()
If we're sure not to go native XDP, there's no need for several things
like bh and rcu stuffs. So this patch introduces a helper to build skb
and hold page refcnt. When we found we will go through skb path, build
skb directly.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Jason Wang f7053b6ccb tuntap: simplify error handling in tun_build_skb()
There's no need to duplicate page get logic in each action. So this
patch tries to get page and calculate the offset before processing XDP
actions (except for XDP_DROP), and undo them when meet errors (we
don't care the performance on errors). This will be used for factoring
out XDP logic.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Jason Wang 291aeb2b1d tuntap: enable bh early during processing XDP
This patch move the bh enabling a little bit earlier, this will be
used for factoring out the core XDP logic of tuntap.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Jason Wang 4f23aff871 tuntap: switch to use XDP_PACKET_HEADROOM
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Jason Wang e4a2a3048e net: sock: introduce SOCK_XDP
This patch introduces a new sock flag - SOCK_XDP. This will be used
for notifying the upper layer that XDP program is attached on the
lower socket, and requires for extra headroom.

TUN will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:25:40 -07:00
Linus Torvalds 0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Li RongQing f13b546847 tun: not use hardcoded mask value
0x3ff in tun_hashfn is mask of TUN_NUM_FLOW_ENTRIES, instead
of hardcode, define a macro to setup the relationship with
TUN_NUM_FLOW_ENTRIES

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-04 13:15:11 -07:00
Eric W. Biederman 019191342f signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
When f_setown is called a pid and a pid type are stored.  Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group.  Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.

Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID.

For now the code continues to capture task_pid (when task_tgid would
really be appropriate), and iterate on PIDTYPE_PID (even when type ==
PIDTYPE_TGID) out of an abundance of caution to preserve existing
behavior.

Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
for tgid lookup also be used to avoid taking the tasklist lock.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
David S. Miller c4c5551df1 Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux
All conflicts were trivial overlapping changes, so reasonably
easy to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 21:17:12 -07:00
Toshiaki Makita 6e8cfd6d9d tun: Fix use-after-free on XDP_TX
On XDP_TX we need to free up the frame only when tun_xdp_tx() returns a
negative value. A positive value indicates that the packet is
successfully enqueued to the ptr_ring, so freeing the page causes
use-after-free.

Fixes: 735fc4054b ("xdp: change ndo_xdp_xmit API to support bulking")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 13:38:29 -07:00
David S. Miller 2aa4a3378a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-07-15

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various different arm32 JIT improvements in order to optimize code emission
   and make the JIT code itself more robust, from Russell.

2) Support simultaneous driver and offloaded XDP in order to allow for advanced
   use-cases where some work is offloaded to the NIC and some to the host. Also
   add ability for bpftool to load programs and maps beyond just the cgroup case,
   from Jakub.

3) Add BPF JIT support in nfp for multiplication as well as division. For the
   latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.

4) Add BTF pretty print functionality to bpftool in plain and JSON output
   format, from Okash.

5) Add build and installation to the BPF helper man page into bpftool, from Quentin.

6) Add a TCP BPF callback for listening sockets which is triggered right after
   the socket transitions to TCP_LISTEN state, from Andrey.

7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
   tree and prints all attached programs, from Roman.

8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
   packets, from Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14 18:47:44 -07:00
Jakub Kicinski 6b86758973 xdp: don't make drivers report attachment mode
prog_attached of struct netdev_bpf should have been superseded
by simply setting prog_id long time ago, but we kept it around
to allow offloading drivers to communicate attachment mode (drv
vs hw).  Subsequently drivers were also allowed to report back
attachment flags (prog_flags), and since nowadays only programs
attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
the attachment mode from the flags driver reports.  Remove
prog_attached member.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13 20:26:35 +02:00
Alexander Duyck 4f49dec907 net: allow ndo_select_queue to pass netdev
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09 13:41:34 -07:00
Willem de Bruijn fd3a886258 net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan
Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
to communicate packet metadata to userspace.

For skbuffs with vlan, the first two return the packet as it may have
existed on the wire, inserting the VLAN tag in the user buffer.  Then
virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.

Commit f09e2249c4 ("macvtap: restore vlan header on user read")
added this feature to macvtap. Commit 3ce9b20f19 ("macvtap: Fix
csum_start when VLAN tags are present") then fixed up csum_start.

Virtio, packet and uml do not insert the vlan header in the user
buffer.

When introducing virtio_net_hdr_from_skb to deduplicate filling in
the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
applied uniformly, breaking csum offset for packets with vlan on
virtio and packet.

Make insertion of VLAN_HLEN optional. Convert the callers to pass it
when needed.

Fixes: e858fae2b0 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Fixes: 1276f24eee ("packet: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07 16:15:38 -04:00
David S. Miller fd129f8941 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-06-05

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a new BPF hook for sendmsg similar to existing hooks for bind and
   connect: "This allows to override source IP (including the case when it's
   set via cmsg(3)) and destination IP:port for unconnected UDP (slow path).
   TCP and connected UDP (fast path) are not affected. This makes UDP support
   complete, that is, connected UDP is handled by connect hooks, unconnected
   by sendmsg ones.", from Andrey.

2) Rework of the AF_XDP API to allow extending it in future for type writer
   model if necessary. In this mode a memory window is passed to hardware
   and multiple frames might be filled into that window instead of just one
   that is the case in the current fixed frame-size model. With the new
   changes made this can be supported without having to add a new descriptor
   format. Also, core bits for the zero-copy support for AF_XDP have been
   merged as agreed upon, where i40e bits will be routed via Jeff later on.
   Various improvements to documentation and sample programs included as
   well, all from Björn and Magnus.

3) Given BPF's flexibility, a new program type has been added to implement
   infrared decoders. Quote: "The kernel IR decoders support the most
   widely used IR protocols, but there are many protocols which are not
   supported. [...] There is a 'long tail' of unsupported IR protocols,
   for which lircd is need to decode the IR. IR encoding is done in such
   a way that some simple circuit can decode it; therefore, BPF is ideal.
   [...] user-space can define a decoder in BPF, attach it to the rc
   device through the lirc chardev.", from Sean.

4) Several improvements and fixes to BPF core, among others, dumping map
   and prog IDs into fdinfo which is a straight forward way to correlate
   BPF objects used by applications, removing an indirect call and therefore
   retpoline in all map lookup/update/delete calls by invoking the callback
   directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper
   for tc BPF programs to have an efficient way of looking up cgroup v2 id
   for policy or other use cases. Fixes to make sure we zero tunnel/xfrm
   state that hasn't been filled, to allow context access wrt pt_regs in
   32 bit archs for tracing, and last but not least various test cases
   for fixes that landed in bpf earlier, from Daniel.

5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with
   a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect
   call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper.

6) Add a new bpf_get_current_cgroup_id() helper that can be used in
   tracing to retrieve the cgroup id from the current process in order
   to allow for e.g. aggregation of container-level events, from Yonghong.

7) Two follow-up fixes for BTF to reject invalid input values and
   related to that also two test cases for BPF kselftests, from Martin.

8) Various API improvements to the bpf_fib_lookup() helper, that is,
   dropping MPLS bits which are not fully hashed out yet, rejecting
   invalid helper flags, returning error for unsupported address
   families as well as renaming flowlabel to flowinfo, from David.

9) Various fixes and improvements to sockmap BPF kselftests in particular
   in proper error detection and data verification, from Prashant.

10) Two arm32 BPF JIT improvements. One is to fix imm range check with
    regards to whether immediate fits into 24 bits, and a naming cleanup
    to get functions related to rsh handling consistent to those handling
    lsh, from Wang.

11) Two compile warning fixes in BPF, one for BTF and a false positive
    to silent gcc in stack_map_get_build_id_offset(), from Arnd.

12) Add missing seg6.h header into tools include infrastructure in order
    to fix compilation of BPF kselftests, from Mathieu.

13) Several formatting cleanups in the BPF UAPI helper description that
    also fix an error during rst2man compilation, from Quentin.

14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is
    not built into the kernel, from Yue.

15) Remove a useless double assignment in dev_map_enqueue(), from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05 12:42:19 -04:00
Jesper Dangaard Brouer 42421a5654 tun: remove ndo_xdp_flush call tun_xdp_flush
Remove the ndo_xdp_flush call implementation tun_xdp_flush
as no callers of ndo_xdp_flush are left.

The tun drivers XDP_TX implementation also used tun_xdp_flush (and
tun_xdp_xmit).  This is easily solved by passing the XDP_XMIT_FLUSH
flag to tun_xdp_xmit in tun_xdp_tx.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-05 14:03:16 +02:00
Chas Williams 4e24f2dd51 Allow ethtool to change tun link settings
Let user space set whatever it would like to advertise for the
tun interface.  Preserve the existing defaults.

Signed-off-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 16:05:16 -04:00
Jesper Dangaard Brouer 0c9d917b7d tun: implement flush flag for ndo_xdp_xmit
When passed the XDP_XMIT_FLUSH flag tun_xdp_xmit now performs the same
kind of socket wake up as in tun_xdp_flush(). The wake up code from
tun_xdp_flush is generalized and shared with tun_xdp_xmit.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 08:11:34 -07:00
Jesper Dangaard Brouer 42b3346898 xdp: add flags argument to ndo_xdp_xmit API
This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.

The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 08:11:34 -07:00
David S. Miller 9c54aeb03a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Filling in the padding slot in the bpf structure as a bug fix in 'ne'
overlapped with actually using that padding area for something in
'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-03 09:31:58 -04:00
Toshiaki Makita 6547e387d7 tun: Fix NULL pointer dereference in XDP redirect
Calling XDP redirection requires bh disabled. Softirq can call another
XDP function and redirection functions, then the percpu static variable
ri->map can be overwritten to NULL.

This is a generic XDP case called from tun.

[ 3535.736058] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 3535.743974] PGD 0 P4D 0
[ 3535.746530] Oops: 0000 [#1] SMP PTI
[ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel crypto_simd cryptd enclosure hpwdt hpilo glue_helper ipmi_si pcspkr wmi mei_me ioatdma mei ipmi_devintf shpchp dca ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel scsi_transport_sas tg3 i2c_core ptp pps_core
[ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 4.17.0-rc4 #2
[ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
[ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
[ 3535.833740] RSP: 0018:ffffb4bc47bf7c58 EFLAGS: 00010246
[ 3535.839009] RAX: ffff9fdfcfea1c40 RBX: 0000000000000000 RCX: ffff9fdf27fe3100
[ 3535.846205] RDX: ffff9fdfca769200 RSI: 0000000000000000 RDI: 0000000000000000
[ 3535.853402] RBP: ffffb4bc491d9000 R08: 00000000000045ad R09: 0000000000000ec0
[ 3535.860597] R10: 0000000000000001 R11: ffff9fdf26c3ce4e R12: ffff9fdf9e72c000
[ 3535.867794] R13: 0000000000000000 R14: fffffffffffffff2 R15: ffff9fdfc82cdd00
[ 3535.874990] FS:  0000000000000000(0000) GS:ffff9fdfcfe80000(0000) knlGS:0000000000000000
[ 3535.883152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3535.888948] CR2: 0000000000000018 CR3: 0000000bde724004 CR4: 00000000007626e0
[ 3535.896145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3535.903342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3535.910538] PKRU: 55555554
[ 3535.913267] Call Trace:
[ 3535.915736]  xdp_do_generic_redirect+0x7a/0x310
[ 3535.920310]  do_xdp_generic.part.117+0x285/0x370
[ 3535.924970]  tun_get_user+0x5b9/0x1260 [tun]
[ 3535.929279]  tun_sendmsg+0x52/0x70 [tun]
[ 3535.933237]  handle_tx+0x2ad/0x5f0 [vhost_net]
[ 3535.937721]  vhost_worker+0xa5/0x100 [vhost]
[ 3535.942030]  kthread+0xf5/0x130
[ 3535.945198]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
[ 3535.950031]  ? kthread_bind+0x10/0x10
[ 3535.953727]  ret_from_fork+0x35/0x40
[ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 c0 c3
[ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffffb4bc47bf7c58
[ 3535.982883] CR2: 0000000000000018
[ 3535.987096] ---[ end trace 383b299dd1430240 ]---
[ 3536.131325] Kernel panic - not syncing: Fatal exception
[ 3536.137484] Kernel Offset: 0x26a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]---

And a kernel with generic case fixed still panics in tun driver XDP
redirect, because it disabled only preemption, but not bh.

[ 2055.128746] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 2055.136662] PGD 0 P4D 0
[ 2055.139219] Oops: 0000 [#1] SMP PTI
[ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel ipmi_ssif crypto_simd enclosure cryptd hpwdt glue_helper ioatdma hpilo wmi dca pcspkr ipmi_si acpi_power_meter ipmi_devintf shpchp mei_me ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas crc32c_intel i2c_core ptp pps_core
[ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: G        W         4.17.0-rc5-fix-tun+ #1
[ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
[ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
[ 2055.228624] RSP: 0018:ffff998b07607cc0 EFLAGS: 00010246
[ 2055.233892] RAX: ffff8dbd8e235700 RBX: ffff8dbd8ff21c40 RCX: 0000000000000004
[ 2055.241089] RDX: ffff998b097a9000 RSI: 0000000000000000 RDI: 0000000000000000
[ 2055.248286] RBP: 0000000000000000 R08: 00000000000065a8 R09: 0000000000005d80
[ 2055.255483] R10: 0000000000000040 R11: ffff8dbcf0100000 R12: ffff998b097a9000
[ 2055.262681] R13: ffff8dbd8c98c000 R14: 0000000000000000 R15: ffff998b07607d78
[ 2055.269879] FS:  0000000000000000(0000) GS:ffff8dbd8ff00000(0000) knlGS:0000000000000000
[ 2055.278039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2055.283834] CR2: 0000000000000018 CR3: 0000000c0c8cc005 CR4: 00000000007626e0
[ 2055.291030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2055.298227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2055.305424] PKRU: 55555554
[ 2055.308153] Call Trace:
[ 2055.310624]  xdp_do_redirect+0x7b/0x380
[ 2055.314499]  tun_get_user+0x10fe/0x12a0 [tun]
[ 2055.318895]  tun_sendmsg+0x52/0x70 [tun]
[ 2055.322852]  handle_tx+0x2ad/0x5f0 [vhost_net]
[ 2055.327337]  vhost_worker+0xa5/0x100 [vhost]
[ 2055.331646]  kthread+0xf5/0x130
[ 2055.334813]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
[ 2055.339646]  ? kthread_bind+0x10/0x10
[ 2055.343343]  ret_from_fork+0x35/0x40
[ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 c0 c3
[ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffff998b07607cc0
[ 2055.372500] CR2: 0000000000000018
[ 2055.375856] ---[ end trace 2a2dcc5e9e174268 ]---
[ 2055.523626] Kernel panic - not syncing: Fatal exception
[ 2055.529796] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]---

v2:
 - Removed preempt_disable/enable since local_bh_disable will prevent
   preemption as well, feedback from Jason Wang.

Fixes: 761876c857 ("tap: XDP support")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 11:00:14 -04:00
David S. Miller 5b79c2af66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of easy overlapping changes in the confict
resolutions here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26 19:46:15 -04:00
David S. Miller 90fed9c946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2018-05-24

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).

2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.

3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.

4) Jiong Wang adds support for indirect and arithmetic shifts to NFP

5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.

6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
   to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.

7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.

8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.

9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24 22:20:51 -04:00
Jesper Dangaard Brouer 735fc4054b xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.

When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.

Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
 for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
 for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps

With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.

Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.

The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.

V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 18:36:15 -07:00
Jason Wang 2f3ab6221e tuntap: correctly set SOCKWQ_ASYNC_NOSPACE
When link is down, writes to the device might fail with
-EIO. Userspace needs an indication when the status is resolved.  As a
fix, tun_net_open() attempts to wake up writers - but that is only
effective if SOCKWQ_ASYNC_NOSPACE has been set in the past. This is
not the case of vhost_net which only poll for EPOLLOUT after it meets
errors during sendmsg().

This patch fixes this by making sure SOCKWQ_ASYNC_NOSPACE is set when
socket is not writable or device is down to guarantee EPOLLOUT will be
raised in either tun_chr_poll() or tun_sock_write_space() after device
is up.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Fixes: 1bd4978a88 ("tun: honor IFF_UP in tun_get_user()")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 14:32:12 -04:00
David S. Miller 6f6e434aa2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.

TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.

The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.

Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21 16:01:54 -04:00
Jason Wang 7063efd33b tuntap: fix use after free during release
After commit b196d88aba ("tun: fix use after free for ptr_ring") we
need clean up tx ring during release(). But unfortunately, it tries to
do the cleanup blindly after socket were destroyed which will lead
another use-after-free. Fix this by doing the cleanup before dropping
the last reference of the socket in __tun_detach().

Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Fixes: b196d88aba ("tun: fix use after free for ptr_ring")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 14:53:10 -04:00
Jason Wang b196d88aba tun: fix use after free for ptr_ring
We used to initialize ptr_ring during TUNSETIFF, this is because its
size depends on the tx_queue_len of netdevice. And we try to clean it
up when socket were detached from netdevice. A race were spotted when
trying to do uninit during a read which will lead a use after free for
pointer ring. Solving this by always initialize a zero size ptr_ring
in open() and do resizing during TUNSETIFF, and then we can safely do
cleanup during close(). With this, there's no need for the workaround
that was introduced by commit 4df0bfc799 ("tun: fix a memory leak
for tfile->tx_array").

Reported-by: syzbot+e8b902c3c3fadf0a9dba@syzkaller.appspotmail.com
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Fixes: 1576d98605 ("tun: switch to use skb array for tx")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-14 14:47:08 -04:00
Kirill Tkhai f663706a33 tun: Do SIOCGSKNS out of rtnl_lock()
Since net ns of tun device is assigned on the device creation,
and it never changes, we do not need to use any lock to get it
from alive tun.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 15:16:49 -04:00
David S. Miller e0ada51db9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simple overlapping changes in microchip
driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:32:48 -04:00
David S. Miller 1b80f86ed6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-04-21

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 15:56:15 -04:00
Paolo Abeni cf1a1e07fc tun: do not compute the rxhash, if not needed
Currently, the tun driver, in absence of an eBPF steering program,
always compute the rxhash in its rx path, even when such value
is later unused due to additional checks (

This changeset moves the all the related checks just before the
__skb_get_hash_symmetric(), so that the latter is no more computed
when unneeded.

Also replace an unneeded RCU section with rcu_access_pointer().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:51:28 -04:00
Nikita V. Shirokov 8fb58f1ecf bpf: make tun compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for tun driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
(in case if after xdp's prog run data_end pointer was adjusted)

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18 23:34:16 +02:00
Bjørn Mork 81c895072d tun: fix vlan packet truncation
Bogus trimming in tun_net_xmit() causes truncated vlan packets.

skb->len is correct whether or not skb_vlan_tag_present() is true. There
is no more reason to adjust the skb length on xmit in this driver than
any other driver. tun_put_user() adds 4 bytes to the total for tagged
packets because it transmits the tag inline to userspace.  This is
similar to a nic transmitting the tag inline on the wire.

Reproducing the bug by sending any tagged packet through back-to-back
connected tap interfaces:

 socat TUN,tun-type=tap,iff-up,tun-name=in TUN,tun-type=tap,iff-up,tun-name=out &
 ip link add link in name in.20 type vlan id 20
 ip addr add 10.9.9.9/24 dev in.20
 ip link set in.20 up
 tshark -nxxi in -f arp -c1 2>/dev/null &
 tshark -nxxi out -f arp -c1 2>/dev/null &
 ping -c 1 10.9.9.5 >/dev/null 2>&1

The output from the 'in' and 'out' interfaces are different when the
bug is present:

 Capturing on 'in'
 0000  ff ff ff ff ff ff 76 cf 76 37 d5 0a 81 00 00 14   ......v.v7......
 0010  08 06 00 01 08 00 06 04 00 01 76 cf 76 37 d5 0a   ..........v.v7..
 0020  0a 09 09 09 00 00 00 00 00 00 0a 09 09 05         ..............

 Capturing on 'out'
 0000  ff ff ff ff ff ff 76 cf 76 37 d5 0a 81 00 00 14   ......v.v7......
 0010  08 06 00 01 08 00 06 04 00 01 76 cf 76 37 d5 0a   ..........v.v7..
 0020  0a 09 09 09 00 00 00 00 00 00                     ..........

Fixes: aff3d70a07 ("tun: allow to attach ebpf socket filter")
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 13:50:29 -04:00
Jesper Dangaard Brouer 44fa2dbd47 xdp: transition into using xdp_frame for ndo_xdp_xmit
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:30 -04:00
Jesper Dangaard Brouer 039930945a xdp: transition into using xdp_frame for return API
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer 8d5d885275 xdp: rhashtable with allocator ID to pointer mapping
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer 1ffcbc8537 tun: convert to use generic xdp_frame and xdp_return_frame API
The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.

Convert it over to use the more generic xdp_frame structure.  The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.

V3: Remove check based on feedback from Jason

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:28 -04:00
Sabrina Dubroca 83c1f36f98 tun: send netlink notification when the device is modified
I added dumping of link information about tun devices over netlink in
commit 1ec010e705 ("tun: export flags, uid, gid, queue information
over netlink"), but didn't add the missing netlink notifications when
the device's exported properties change.

This patch adds notifications when owner/group or flags are modified,
when queues are attached/detached, and when a tun fd is closed.

Reported-by: Thomas Haller <thaller@redhat.com>
Fixes: 1ec010e705 ("tun: export flags, uid, gid, queue information over netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11 14:48:05 -04:00
Sabrina Dubroca 9fffc5c6dd tun: set the flags before registering the netdevice
Otherwise, register_netdevice advertises the creation of the device with
the default flags, instead of what the user requested.

Reported-by: Thomas Haller <thaller@redhat.com>
Fixes: 1ec010e705 ("tun: export flags, uid, gid, queue information over netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11 14:48:05 -04:00
David S. Miller 03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
Jason Wang 59655a5b6c tuntap: XDP_TX can use native XDP
Now we have ndo_xdp_xmit, switch to use it instead of the slow generic
XDP TX routine. XDP_TX on TAP gets ~20% improvements from ~1.5Mpps to
~1.8Mpps on 2.60GHz Core(TM) i7-5600U.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-14 13:20:48 -04:00
Jason Wang 3a4030761e vhost_net: examine pointer types during un-producing
After commit fc72d1d54d ("tuntap: XDP transmission"), we can
actually queueing XDP pointers in the pointer ring, so we should
examine the pointer type before freeing the pointer.

Fixes: fc72d1d54d ("tuntap: XDP transmission")
Reported-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 12:02:59 -05:00
David S. Miller 0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Jason Wang 1bb4f2e868 tuntap: correctly add the missing XDP flush
We don't flush batched XDP packets through xdp_do_flush_map(), this
will cause packets stall at TX queue. Consider we don't do XDP on NAPI
poll(), the only possible fix is to call xdp_do_flush_map()
immediately after xdp_do_redirect().

Note, this in fact won't try to batch packets through devmap, we could
address in the future.

Reported-by: Christoffer Dall <christoffer.dall@linaro.org>
Fixes: 761876c857 ("tap: XDP support")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26 13:49:49 -05:00
Jason Wang 23e43f07f8 tuntap: disable preemption during XDP processing
Except for tuntap, all other drivers' XDP was implemented at NAPI
poll() routine in a bh. This guarantees all XDP operation were done at
the same CPU which is required by e.g BFP_MAP_TYPE_PERCPU_ARRAY. But
for tuntap, we do it in process context and we try to protect XDP
processing by RCU reader lock. This is insufficient since
CONFIG_PREEMPT_RCU can preempt the RCU reader critical section which
breaks the assumption that all XDP were processed in the same CPU.

Fixing this by simply disabling preemption during XDP processing.

Fixes: 761876c857 ("tap: XDP support")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26 13:49:48 -05:00
Jason Wang f249be4d2c Revert "tuntap: add missing xdp flush"
This reverts commit 762c330d67. The
reason is we try to batch packets for devmap which causes calling
xdp_do_flush() in the process context. Simply disabling preemption
may not work since process may move among processors which lead
xdp_do_flush() to miss some flushes on some processors.

So simply revert the patch, a follow-up patch will add the xdp flush
correctly.

Reported-by: Christoffer Dall <christoffer.dall@linaro.org>
Fixes: 762c330d67 ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26 13:49:48 -05:00
David S. Miller f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Sabrina Dubroca 1ec010e705 tun: export flags, uid, gid, queue information over netlink
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:58:22 -05:00
Eric Dumazet 43a08e0f58 tun: fix tun_napi_alloc_frags() frag allocator
<Mark Rutland reported>
    While fuzzing arm64 v4.16-rc1 with Syzkaller, I've been hitting a
    misaligned atomic in __skb_clone:

        atomic_inc(&(skb_shinfo(skb)->dataref));

   where dataref doesn't have the required natural alignment, and the
   atomic operation faults. e.g. i often see it aligned to a single
   byte boundary rather than a four byte boundary.

   AFAICT, the skb_shared_info is misaligned at the instant it's
   allocated in __napi_alloc_skb()  __napi_alloc_skb()
</end of report>

Problem is caused by tun_napi_alloc_frags() using
napi_alloc_frag() with user provided seg sizes,
leading to other users of this API getting unaligned
page fragments.

Since we would like to not necessarily add paddings or alignments to
the frags that tun_napi_alloc_frags() attaches to the skb, switch to
another page frag allocator.

As a bonus skb_page_frag_refill() can use GFP_KERNEL allocations,
meaning that we can not deplete memory reserves as easily.

Fixes: 90e33d4594 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:20:46 -05:00
Kirill Tkhai f2780d6d74 tun: Add ioctl() SIOCGSKNS cmd to allow obtaining net ns of tun device
This patch adds possibility to get tun device's net namespace fd
in the same way we allow to do that for sockets.

Socket ioctl numbers do not intersect with tun-specific, and there
is already SIOCSIFHWADDR used in tun code. So, SIOCGSKNS number
is choosen instead of custom-made for this functionality.

Note, that open_related_ns() uses plain get_net_ns() and it's safe
(net can't be already dead at this moment):

  tun socket is allocated via sk_alloc() with zero last arg (kern = 0).
  So, each alive socket increments net::count, and the socket is definitely
  alive during ioctl syscall.

Also, common variable net is introduced, so small cleanup in TUNSETIFF
is made.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-15 15:34:42 -05:00
Linus Torvalds a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Jason Wang 762c330d67 tuntap: add missing xdp flush
When using devmap to redirect packets between interfaces,
xdp_do_flush() is usually a must to flush any batched
packets. Unfortunately this is missed in current tuntap
implementation.

Unlike most hardware driver which did XDP inside NAPI loop and call
xdp_do_flush() at then end of each round of poll. TAP did it in the
context of process e.g tun_get_user(). So fix this by count the
pending redirected packets and flush when it exceeds NAPI_POLL_WEIGHT
or MSG_MORE was cleared by sendmsg() caller.

With this fix, xdp_redirect_map works again between two TAPs.

Fixes: 761876c857 ("tap: XDP support")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08 14:10:30 -05:00
Linus Torvalds b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Linus Torvalds 168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Cong Wang c13da21cdb tun: avoid calling xdp_rxq_info_unreg() twice
Similarly to tx ring, xdp_rxq_info is only registered
when !tfile->detached, so we need to avoid calling
xdp_rxq_info_unreg() twice too. The helper tun_cleanup_tx_ring()
already checks for this properly, so it is correct to put
xdp_rxq_info_unreg() just inside there.

Reported-by: syzbot+1c788d7ce0f0888f1d7f@syzkaller.appspotmail.com
Fixes: 8565d26bcb ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:55:18 -05:00
Jason Wang 9d6474e458 tun: add missing rcu annotation
This patch fixes the following sparse warnings:

drivers/net/tun.c:2241:15: error: incompatible types in comparison expression (different address spaces)

Fixes: cd5681d7d8 ("tuntap: rename struct tun_steering_prog to struct tun_prog")
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 09:16:18 -05:00
David S. Miller 8565d26bcb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The BPF verifier conflict was some minor contextual issue.

The TUN conflict was less trivial.  Cong Wang fixed a memory leak of
tfile->tx_array in 'net'.  This is an skb_array.  But meanwhile in
net-next tun changed tfile->tx_arry into tfile->tx_ring which is a
ptr_ring.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 22:59:33 -05:00
Jason Wang aff3d70a07 tun: allow to attach ebpf socket filter
This patch allows userspace to attach eBPF filter to tun. This will
allow to implement VM dataplane filtering in a more efficient way
compared to cBPF filter by allowing either qemu or libvirt to
attach eBPF filter to tun.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:32:10 -05:00
Jason Wang cd5681d7d8 tuntap: rename struct tun_steering_prog to struct tun_prog
To be reused by other eBPF program other than queue selection.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:32:10 -05:00
Cong Wang 4df0bfc799 tun: fix a memory leak for tfile->tx_array
tfile->tun could be detached before we close the tun fd,
via tun_detach_all(), so it should not be used to check for
tfile->tx_array.

As Jason suggested, we probably have to clean it up
unconditionally both in __tun_deatch() and tun_detach_all(),
but this requires to check if it is initialized or not.
Currently skb_array_cleanup() doesn't have such a check,
so I check it in the caller and introduce a helper function,
it is a bit ugly but we can always improve it in net-next.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 1576d98605 ("tun: switch to use skb array for tx")
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:08:28 -05:00
Jason Wang fc72d1d54d tuntap: XDP transmission
This patch implements XDP transmission for TAP. Since we can't create
new queues for TAP during XDP set, exist ptr_ring was reused for
queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
(0x1UL) was encoded into lowest bit of xpd_buff pointer during
ptr_ring_produce, and was decoded during consuming. XDP metadata was
stored in the headroom of the packet which should work in most of
cases since driver usually reserve enough headroom. Very minor changes
were done for vhost_net: it just need to peek the length depends on
the type of pointer.

Tests were done on two Intel E5-2630 2.40GHz machines connected back
to back through two 82599ES. Traffic were generated/received through
MoonGen/testpmd(rxonly). It reports ~20% improvements when
xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps
to 3.05Mpps)

Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 10:57:08 -05:00
Jason Wang 5990a30510 tun/tap: use ptr_ring instead of skb_array
This patch switches to use ptr_ring instead of skb_array. This will be
used to enqueue different types of pointers by encoding type into
lower bits.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 10:56:10 -05:00
Jesper Dangaard Brouer 8bf5c4ee18 tun: setup xdp_rxq_info
Driver hook points for xdp_rxq_info:
 * reg  : tun_attach
 * unreg: __tun_detach

I've done some manual testing of this tun driver, but I would
appriciate good review and someone else running their use-case tests,
as I'm not 100% sure I understand the tfile->detached semantics.

V2: Removed the skb_array_cleanup() call from V1 by request from Jason Wang.

Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05 15:21:22 -08:00
Jason Wang 124da8f611 tuntap: fix possible deadlock when fail to register netdev
Private destructor could be called when register_netdev() fail with
rtnl lock held. This will lead deadlock in tun_free_netdev() who tries
to hold rtnl_lock. Fixing this by switching to use spinlock to
synchronize.

Fixes: 96f8406162 ("tun: add eBPF based queue selection method")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 10:44:27 -05:00
Willem de Bruijn cc166427dc tun: avoid unnecessary READ_ONCE in tun_net_xmit
The statement no longer serves a purpose.

Commit fa35864e0b ("tuntap: Fix for a race in accessing numqueues")
added the ACCESS_ONCE to avoid a race condition with skb_queue_len.

Commit 436accebb5 ("tuntap: remove unnecessary sk_receive_queue
length check during xmit") removed the affected skb_queue_len check.

Commit 96f8406162 ("tun: add eBPF based queue selection method")
split the function, reading the field a second time in the callee.
The temp variable is now only read once, so just remove it.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 15:09:13 -05:00
Jason Wang 96f8406162 tun: add eBPF based queue selection method
This patch introduces an eBPF based queue selection method. With this,
the policy could be offloaded to userspace completely through a new
ioctl TUNSETSTEERINGEBPF.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 12:01:49 -05:00
David S. Miller 7cda4cee13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 10:44:19 -05:00
Wei Xu c33ee15b38 tun: free skb in early errors
tun_recvmsg() supports accepting skb by msg_control after
commit ac77cfd425 ("tun: support receiving skb through msg_control"),
the skb if presented should be freed no matter how far it can go
along, otherwise it would be leaked.

This patch fixes several missed cases.

Signed-off-by: Wei Xu <wexu@redhat.com>
Reported-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-02 21:31:03 -05:00
Jakub Kicinski bd0b2e7fe6 net: xdp: make the stack take care of the tear down
Since day one of XDP drivers had to remember to free the program
on the remove path.  This leads to code duplication and is error
prone.  Make the stack query the installed programs on unregister
and if something is installed, remove the program.  Freeing of
program attached to XDP generic is moved from free_netdev() as well.

Because the remove will now be called before notifiers are
invoked, BPF offload state of the program will not get destroyed
before uninstall.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-03 00:27:57 +01:00
Al Viro afc9a42b74 the rest of drivers/*: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-28 11:06:58 -05:00
Linus Torvalds 844056fd74 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:

 - The final conversion of timer wheel timers to timer_setup().

   A few manual conversions and a large coccinelle assisted sweep and
   the removal of the old initialization mechanisms and the related
   code.

 - Remove the now unused VSYSCALL update code

 - Fix permissions of /proc/timer_list. I still need to get rid of that
   file completely

 - Rename a misnomed clocksource function and remove a stale declaration

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  m68k/macboing: Fix missed timer callback assignment
  treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts
  timer: Remove redundant __setup_timer*() macros
  timer: Pass function down to initialization routines
  timer: Remove unused data arguments from macros
  timer: Switch callback prototype to take struct timer_list * argument
  timer: Pass timer_list pointer to callbacks unconditionally
  Coccinelle: Remove setup_timer.cocci
  timer: Remove setup_*timer() interface
  timer: Remove init_timer() interface
  treewide: setup_timer() -> timer_setup() (2 field)
  treewide: setup_timer() -> timer_setup()
  treewide: init_timer() -> setup_timer()
  treewide: Switch DEFINE_TIMER callbacks to struct timer_list *
  s390: cmm: Convert timers to use timer_setup()
  lightnvm: Convert timers to use timer_setup()
  drivers/net: cris: Convert timers to use timer_setup()
  drm/vc4: Convert timers to use timer_setup()
  block/laptop_mode: Convert timers to use timer_setup()
  net/atm/mpc: Avoid open-coded assignment of timer callback function
  ...
2017-11-25 08:37:16 -10:00
Willem de Bruijn 0c19f846d5 net: accept UFO datagrams from tuntap and packet
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216f ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Tested
  Booted a v4.13 guest kernel with QEMU. On a host kernel before this
  patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
  enabled, same as on a v4.13 host kernel.

  A UFO packet sent from the guest appears on the tap device:
    host:
      nc -l -p -u 8000 &
      tcpdump -n -i tap0

    guest:
      dd if=/dev/zero of=payload.txt bs=1 count=2000
      nc -u 192.16.1.1 8000 < payload.txt

  Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
  packets arriving fragmented:

    ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
    (from https://github.com/wdebruij/kerneltools/tree/master/tests)

Changes
  v1 -> v2
    - simplified set_offload change (review comment)
    - documented test procedure

Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe8 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 01:37:35 +09:00
Kees Cook e99e88a9d2 treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.

Casting from unsigned long:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

become:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

Direct function assignments:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

The conversion is done with the following Coccinelle script:

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup.cocci

@fix_address_of@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
 _E->_timer@_stl.function = _callback;
|
 _E->_timer@_stl.function = &_callback;
|
 _E->_timer@_stl.function = (_cast_func)_callback;
|
 _E->_timer@_stl.function = (_cast_func)&_callback;
|
 _E._timer@_stl.function = _callback;
|
 _E._timer@_stl.function = &_callback;
|
 _E._timer@_stl.function = (_cast_func)_callback;
|
 _E._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:07 -08:00
Xin Long 654d573845 tun: fix rcu_read_lock imbalance in tun_build_skb
rcu_read_lock in tun_build_skb is used to rcu_dereference tun->xdp_prog
safely, rcu_read_unlock should be done in every return path.

Now I could see one place missing it, where it returns NULL in switch-case
XDP_REDIRECT,  another palce using rcu_read_lock wrongly, where it returns
NULL in if (xdp_xmit) chunk.

So fix both in this patch.

Fixes: 761876c857 ("tap: XDP support")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-19 21:22:31 +09:00
Linus Torvalds 5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Ingo Molnar 8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Jakub Kicinski f4e63525ee net: bpf: rename ndo_xdp to ndo_bpf
ndo_xdp is a control path callback for setting up XDP in the
driver.  We can reuse it for other forms of communication
between the eBPF stack and the drivers.  Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
David S. Miller ed29668d1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Smooth Cong Wang's bug fix into 'net-next'.  Basically put
the bulk of the tcf_block_put() logic from 'net' into
tcf_block_put_ext(), but after the offload unbind.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 15:23:39 +09:00
Craig Gallek 93161922c6 tun/tap: sanitize TUNSETSNDBUF input
Syzkaller found several variants of the lockup below by setting negative
values with the TUNSETSNDBUF ioctl.  This patch adds a sanity check
to both the tun and tap versions of this ioctl.

  watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [repro:2389]
  Modules linked in:
  irq event stamp: 329692056
  hardirqs last  enabled at (329692055): [<ffffffff824b8381>] _raw_spin_unlock_irqrestore+0x31/0x75
  hardirqs last disabled at (329692056): [<ffffffff824b9e58>] apic_timer_interrupt+0x98/0xb0
  softirqs last  enabled at (35659740): [<ffffffff824bc958>] __do_softirq+0x328/0x48c
  softirqs last disabled at (35659731): [<ffffffff811c796c>] irq_exit+0xbc/0xd0
  CPU: 0 PID: 2389 Comm: repro Not tainted 4.14.0-rc7 #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  task: ffff880009452140 task.stack: ffff880006a20000
  RIP: 0010:_raw_spin_lock_irqsave+0x11/0x80
  RSP: 0018:ffff880006a27c50 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10
  RAX: ffff880009ac68d0 RBX: ffff880006a27ce0 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff880006a27ce0 RDI: ffff880009ac6900
  RBP: ffff880006a27c60 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000001 R11: 000000000063ff00 R12: ffff880009ac6900
  R13: ffff880006a27cf8 R14: 0000000000000001 R15: ffff880006a27cf8
  FS:  00007f4be4838700(0000) GS:ffff88000cc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000020101000 CR3: 0000000009616000 CR4: 00000000000006f0
  Call Trace:
   prepare_to_wait+0x26/0xc0
   sock_alloc_send_pskb+0x14e/0x270
   ? remove_wait_queue+0x60/0x60
   tun_get_user+0x2cc/0x19d0
   ? __tun_get+0x60/0x1b0
   tun_chr_write_iter+0x57/0x86
   __vfs_write+0x156/0x1e0
   vfs_write+0xf7/0x230
   SyS_write+0x57/0xd0
   entry_SYSCALL_64_fastpath+0x1f/0xbe
  RIP: 0033:0x7f4be4356df9
  RSP: 002b:00007ffc18101c08 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4be4356df9
  RDX: 0000000000000046 RSI: 0000000020101000 RDI: 0000000000000005
  RBP: 00007ffc18101c40 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000000000001 R11: 0000000000000293 R12: 0000559c75f64780
  R13: 00007ffc18101d30 R14: 0000000000000000 R15: 0000000000000000

Fixes: 33dccbb050 ("tun: Limit amount of queued packets per device")
Fixes: 20d29d7a91 ("net: macvtap driver")
Signed-off-by: Craig Gallek <kraig@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 20:52:36 +09:00
David S. Miller e1ea2f9856 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts here.

NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
nfp_fl_output() needed some adjustments because the code block is in
an else block now.

Parallel additions to net/pkt_cls.h and net/sch_generic.h

A bug fix in __tcp_retransmit_skb() conflicted with some of
the rbtree changes in net-next.

The tc action RCU callback fixes in 'net' had some overlap with some
of the recent tcf_block reworking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-30 21:09:24 +09:00
Jason Wang 63b9ab65bd tuntap: properly align skb->head before building skb
An unaligned alloc_frag->offset caused by previous allocation will
result an unaligned skb->head. This will lead unaligned
skb_shared_info and then unaligned dataref which requires to be
aligned for accessing on some architecture. Fix this by aligning
alloc_frag->offset before the frag refilling.

Fixes: 0bbd7dad34 ("tun: make tun_build_skb() thread safe")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Wei Wei <dotweiba@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Reported-by: Wei Wei <dotweiba@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:05:28 +09:00
Julien Gomes 5c25f65fd1 tun: allow positive return values on dev_get_valid_name() call
If the name argument of dev_get_valid_name() contains "%d", it will try
to assign it a unit number in __dev__alloc_name() and return either the
unit number (>= 0) or an error code (< 0).
Considering positive values as error values prevent tun device creations
relying this mechanism, therefor we should only consider negative values
as errors here.

Signed-off-by: Julien Gomes <julien@arista.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26 10:10:03 +09:00
Mark Rutland 6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
David S. Miller f8ddadc4db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
There were quite a few overlapping sets of changes here.

Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.

Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly.  If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.

In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().

Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.

The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 13:39:14 +01:00
Eric Dumazet ee74d9967b tun: do not arm flow_gc_timer in tun_flow_init()
Timer is properly armed on demand from tun_flow_update(),
so there is no need to arm it at tun init.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 03:13:19 +01:00
Eric Dumazet 81d98fa4df tun: avoid extra timer schedule in tun_flow_cleanup()
If tun_flow_cleanup() deleted all flows, no need to
arm the timer again. It will be armed next time
tun_flow_update() is called.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 03:13:19 +01:00
Eric Dumazet 7dbfb4ef77 tun: do not block BH again in tun_flow_cleanup()
tun_flow_cleanup() being a timer callback, it is already
running in BH context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 03:13:19 +01:00
Eric Dumazet aec72f3392 net-tun: fix panics at dismantle time
syzkaller got crashes at dismantle time [1]

It is not correct to test (tun->flags & IFF_NAPI) in tun_napi_disable()
and tun_napi_del() : Each tun_file can have different mode, depending
on how they were created.

Similarly I have changed tun_get_user() and tun_poll_controller()
to use the new tfile->napi_enabled boolean.

[  154.331360] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  154.339220] IP: [<ffffffff9634cad6>] hrtimer_active+0x26/0x60
[  154.344983] PGD 0
[  154.347009] Oops: 0000 [#1] SMP
[  154.350680] gsmi: Log Shutdown Reason 0x03
[  154.379572] task: ffff994719150dc0 ti: ffff99475c0ae000 task.ti: ffff99475c0ae000
[  154.387043] RIP: 0010:[<ffffffff9634cad6>]  [<ffffffff9634cad6>] hrtimer_active+0x26/0x60
[  154.395232] RSP: 0018:ffff99475c0afce8  EFLAGS: 00010246
[  154.400542] RAX: ffff994754850ac0 RBX: ffff994753e65408 RCX: ffff994753e65388
[  154.407666] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff994753e65408
[  154.414790] RBP: ffff99475c0afce8 R08: 0000000000000000 R09: 0000000000000000
[  154.421921] R10: ffff99475f6f5910 R11: 0000000000000001 R12: 0000000000000000
[  154.429044] R13: ffff99417deab668 R14: ffff99417deaa780 R15: ffff99475f45dde0
[  154.436174] FS:  0000000000000000(0000) GS:ffff994767a00000(0000) knlGS:0000000000000000
[  154.444249] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  154.449986] CR2: 0000000000000000 CR3: 00000005a8a0e000 CR4: 0000000000022670
[  154.457110] Stack:
[  154.459120]  ffff99475c0afd28 ffffffff9634d614 1000000000000000 0000000000000000
[  154.466598]  ffffe54240000000 ffff994753e65408 ffff994753e653a8 ffff99417deab668
[  154.474067]  ffff99475c0afd48 ffffffff9634d6fd ffff99474c2be678 ffff994753e65398
[  154.481537] Call Trace:
[  154.483985]  [<ffffffff9634d614>] hrtimer_try_to_cancel+0x24/0xf0
[  154.490074]  [<ffffffff9634d6fd>] hrtimer_cancel+0x1d/0x30
[  154.495563]  [<ffffffff96860b3c>] napi_disable+0x3c/0x70
[  154.500875]  [<ffffffff9678ae62>] __tun_detach+0xd2/0x360
[  154.506272]  [<ffffffff9678b117>] tun_chr_close+0x27/0x40
[  154.511669]  [<ffffffff9646ebe6>] __fput+0xd6/0x1e0
[  154.516548]  [<ffffffff9646ed3e>] ____fput+0xe/0x10
[  154.521429]  [<ffffffff963035a2>] task_work_run+0x72/0x90
[  154.526827]  [<ffffffff962e9407>] do_exit+0x317/0xb60
[  154.531879]  [<ffffffff962e9c8f>] do_group_exit+0x3f/0xa0
[  154.537275]  [<ffffffff962e9d07>] SyS_exit_group+0x17/0x20
[  154.542769]  [<ffffffff969784be>] entry_SYSCALL_64_fastpath+0x12/0x17

Fixes: 943170998b ("net-tun: enable NAPI for TUN/TAP driver")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:31:26 +01:00
Eric Dumazet 010f245b9d tun: relax check on eth_get_headlen() return value
syzkaller hit the WARN() in tun_get_user(), providing skb
with payload in fragments only, and nothing in skb->head

GRO layer is fine with this, so relax the check.

Fixes: 90e33d4594 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 13:17:39 +01:00
Cong Wang 0ad646c81b tun: call dev_get_valid_name() before register_netdevice()
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.

We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().

Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.

Fixes: 96442e4242 ("tuntap: choose the txq based on rxq")
Reported-by: Dmitry Alexeev <avekceeb@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:02:54 +01:00
David S. Miller 53954cf8c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 18:19:22 -07:00
Alexander Potapenko 2580c4c17a tun: bail out from tun_get_user() if the skb is empty
KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
skb->data[0] in the case the skb is empty (i.e. skb->len is 0):

================================================
BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
...
 __msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
 tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
 tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
 call_write_iter ./include/linux/fs.h:1743
 new_sync_write fs/read_write.c:457
 __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
 vfs_write+0x3e4/0x770 fs/read_write.c:518
 SYSC_write+0x12f/0x2b0 fs/read_write.c:565
 SyS_write+0x55/0x80 fs/read_write.c:557
 do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
 entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
...
origin:
...
 kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
 slab_alloc_node mm/slub.c:2732
 __kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
 __kmalloc_reserve net/core/skbuff.c:138
 __alloc_skb+0x26a/0x810 net/core/skbuff.c:231
 alloc_skb ./include/linux/skbuff.h:903
 alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
 sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
 tun_alloc_skb drivers/net/tun.c:1144
 tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
 tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
 call_write_iter ./include/linux/fs.h:1743
 new_sync_write fs/read_write.c:457
 __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
 vfs_write+0x3e4/0x770 fs/read_write.c:518
 SYSC_write+0x12f/0x2b0 fs/read_write.c:565
 SyS_write+0x55/0x80 fs/read_write.c:557
 do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
 return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
================================================

Make sure tun_get_user() doesn't touch skb->data[0] unless there is
actual data.

C reproducer below:
==========================
    // autogenerated by syzkaller (http://github.com/google/syzkaller)

    #define _GNU_SOURCE

    #include <fcntl.h>
    #include <linux/if_tun.h>
    #include <netinet/ip.h>
    #include <net/if.h>
    #include <string.h>
    #include <sys/ioctl.h>

    int main()
    {
      int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
      int tun_fd = open("/dev/net/tun", O_RDWR);
      struct ifreq req;
      memset(&req, 0, sizeof(struct ifreq));
      strcpy((char*)&req.ifr_name, "gre0");
      req.ifr_flags = IFF_UP | IFF_MULTICAST;
      ioctl(tun_fd, TUNSETIFF, &req);
      ioctl(sock, SIOCSIFFLAGS, "gre0");
      write(tun_fd, "hi", 0);
      return 0;
    }
==========================

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 08:37:28 -07:00
Daniel Borkmann de8f3a83b0 bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
yuan linyu 9484dc74fc tun: delete original tun_get() and rename __tun_get() to tun_get()
it seems no need to keep tun_get() and __tun_get() at same time.

Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 20:30:08 -07:00
Petar Penkov 90e33d4594 tun: enable napi_gro_frags() for TUN/TAP driver
Add a TUN/TAP receive mode that exercises the napi_gro_frags()
interface. This mode is available only in TAP mode, as the interface
expects packets with Ethernet headers.

Furthermore, packets follow the layout of the iovec_iter that was
received. The first iovec is the linear data, and every one after the
first is a fragment. If there are more fragments than the max number,
drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
dissector code and to verify that the header resides in the linear data.

The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
This is imposed because this mode is intended for testing via tools like
syzkaller and packetdrill, and the increased flexibility it provides can
introduce security vulnerabilities. This flag is accepted only if the
device is in TAP mode and has the IFF_NAPI flag set as well. This is
done because both of these are explicit requirements for correct
operation in this mode.

Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: davem@davemloft.net
Cc: ppenkov@stanford.edu
Acked-by: Mahesh Bandewar <maheshb@google,com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 20:16:13 -07:00
Petar Penkov 943170998b tun: enable NAPI for TUN/TAP driver
Changes TUN driver to use napi_gro_receive() upon receiving packets
rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
changes and operation is not affected if the flag is disabled.  SKBs
are constructed upon packet arrival and are queued to be processed
later.

The new path was evaluated with a benchmark with the following setup:
Open two tap devices and a receiver thread that reads in a loop for
each device. Start one sender thread and pin all threads to different
CPUs. Send 1M minimum UDP packets to each device and measure sending
time for each of the sending methods:
	napi_gro_receive():	4.90s
	netif_rx_ni():		4.90s
	netif_receive_skb():	7.20s

Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: davem@davemloft.net
Cc: ppenkov@stanford.edu
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 20:16:13 -07:00