Commit Graph

823 Commits

Author SHA1 Message Date
Willem de Bruijn 2dfd184abd flow_dissector: fix build failure without CONFIG_NET
If boolean CONFIG_BPF_SYSCALL is enabled, kernel/bpf/syscall.c will
call flow_dissector functions from net/core/flow_dissector.c.

This causes this build failure if CONFIG_NET is disabled:

    kernel/bpf/syscall.o: In function `__x64_sys_bpf':
    syscall.c:(.text+0x3278): undefined reference to
    `skb_flow_dissector_bpf_prog_attach'
    syscall.c:(.text+0x3310): undefined reference to
    `skb_flow_dissector_bpf_prog_detach'
    kernel/bpf/syscall.o:(.rodata+0x3f0): undefined reference to
    `flow_dissector_prog_ops'
    kernel/bpf/verifier.o:(.rodata+0x250): undefined reference to
    `flow_dissector_verifier_ops'

Analogous to other optional BPF program types in syscall.c, add stubs
if the relevant functions are not compiled and move the BPF_PROG_TYPE
definition in the #ifdef CONFIG_NET block.

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-19 23:46:44 +02:00
Petar Penkov d58e468b11 flow_dissector: implements flow dissector BPF hook
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-14 12:04:33 -07:00
David S. Miller 992cba7e27 net: Add and use skb_list_del_init().
It documents what is happening, and eliminates the spurious list
pointer poisoning.

In the long term, in order to get proper list head debugging, we
might want to use the list poison value as the indicator that
an SKB is a singleton and not on a list.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:54 -07:00
David S. Miller a8305bff68 net: Add and use skb_mark_not_on_list().
An SKB is not on a list if skb->next is NULL.

Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:54 -07:00
David S. Miller 8b69bd7d8a ppp: Remove direct skb_queue_head list pointer access.
Add a helper, __skb_peek(), and use it in ppp_mp_reconstruct().

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:53 -07:00
David S. Miller c1617fb4c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-08-13

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add driver XDP support for veth. This can be used in conjunction with
   redirect of another XDP program e.g. sitting on NIC so the xdp_frame
   can be forwarded to the peer veth directly without modification,
   from Toshiaki.

2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
   in order to provide more control and visibility on where a SO_REUSEPORT
   sk should be located, and the latter enables to directly select a sk
   from the bpf map. This also enables map-in-map for application migration
   use cases, from Martin.

3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
   of cgroup v2 that is the ancestor of the cgroup associated with the
   skb at the ancestor_level, from Andrey.

4) Implement BPF fs map pretty-print support based on BTF data for regular
   hash table and LRU map, from Yonghong.

5) Decouple the ability to attach BTF for a map from the key and value
   pretty-printer in BPF fs, and enable further support of BTF for maps for
   percpu and LPM trie, from Daniel.

6) Implement a better BPF sample of using XDP's CPU redirect feature for
   load balancing SKB processing to remote CPU. The sample implements the
   same XDP load balancing as Suricata does which is symmetric hash based
   on IP and L4 protocol, from Jesper.

7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
   critical path as it is ensured that the allocator is present, from Björn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 10:07:23 -07:00
Toshiaki Makita b0768a8658 net: Export skb_headers_offset_update
This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:20 +02:00
YueHaibing 15693fd37f net: skbuff.h: fix using plain integer as NULL warning
Fixes the following sparse warning:
./include/linux/skbuff.h:2365:58: warning: Using plain integer as NULL pointer

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-09 14:10:01 -07:00
Peter Oskolkov fa0f527358 ip: use rb trees for IP frag queue.
Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
    netstat --statistics
    Ip:
        282078937 total packets received
        0 forwarded
        0 incoming packets discarded
        946760 incoming packets delivered
        18743456 requests sent out
        101 fragments dropped after timeout
        282077129 reassemblies required
        944952 packets reassembled ok
        262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
    reassemblies vs a kernel without this patchset. More
    comprehensive performance testing TBD).

Reported-by: Jann Horn <jannh@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:16:46 -07:00
Peter Oskolkov 385114dec8 net: modify skb_rbtree_purge to return the truesize of all purged skbs.
Tested: see the next patch is the series.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:16:46 -07:00
David S. Miller c4c5551df1 Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux
All conflicts were trivial overlapping changes, so reasonably
easy to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 21:17:12 -07:00
Stefano Brivio a48d189ef5 net: Move skb decrypted field, avoid explicity copy
Commit 784abe24c9 ("net: Add decrypted field to skb")
introduced a 'decrypted' field that is explicitly copied on skb
copy and clone.

Move it between headers_start[0] and headers_end[0], so that we
don't need to copy it explicitly as it's copied by the memcpy()
in __copy_skb_header().

While at it, drop the assignment in __skb_clone(), it was
already redundant.

This doesn't change the size of sk_buff or cacheline boundaries.

The 15-bits hole before tc_index becomes a 14-bits hole, and
will be again a 15-bits hole when this change is merged with
commit 8b7008620b ("net: Don't copy pfmemalloc flag in
__copy_skb_header()").

v2: as reported by kbuild test robot (oops, I forgot to build
    with CONFIG_TLS_DEVICE it seems), we can't use
    CHECK_SKB_FIELD() on a bit-field member. Just drop the
    check for the moment being, perhaps we could think of some
    magic to also check bit-field members one day.

Fixes: 784abe24c9 ("net: Add decrypted field to skb")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18 13:42:08 -07:00
Boris Pismenny 784abe24c9 net: Add decrypted field to skb
The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:12:09 -07:00
Stefano Brivio 8b7008620b net: Don't copy pfmemalloc flag in __copy_skb_header()
The pfmemalloc flag indicates that the skb was allocated from
the PFMEMALLOC reserves, and the flag is currently copied on skb
copy and clone.

However, an skb copied from an skb flagged with pfmemalloc
wasn't necessarily allocated from PFMEMALLOC reserves, and on
the other hand an skb allocated that way might be copied from an
skb that wasn't.

So we should not copy the flag on skb copy, and rather decide
whether to allow an skb to be associated with sockets unrelated
to page reclaim depending only on how it was allocated.

Move the pfmemalloc flag before headers_start[0] using an
existing 1-bit hole, so that __copy_skb_header() doesn't copy
it.

When cloning, we'll now take care of this flag explicitly,
contravening to the warning comment of __skb_clone().

While at it, restore the newline usage introduced by commit
b193722731 ("net: reorganize sk_buff for faster
__copy_skb_header()") to visually separate bytes used in
bitfields after headers_start[0], that was gone after commit
a9e419dc7b ("netfilter: merge ctinfo into nfct pointer storage
area"), and describe the pfmemalloc flag in the kernel-doc
structure comment.

This doesn't change the size of sk_buff or cacheline boundaries,
but consolidates the 15 bits hole before tc_index into a 2 bytes
hole before csum, that could now be filled more easily.

Reported-by: Patrick Talbert <ptalbert@redhat.com>
Fixes: c93bdd0e03 ("netvm: allow skb allocation to use PFMEMALLOC reserves")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 15:15:16 -07:00
David S. Miller 5cd3da4ba2 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Simple overlapping changes in stmmac driver.

Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-03 10:29:26 +09:00
Linus Torvalds a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
David Miller d4546c2509 net: Convert GRO SKB handling to list_head.
Manage pending per-NAPI GRO packets via list_head.

Return an SKB pointer from the GRO receive handlers.  When GRO receive
handlers return non-NULL, it means that this SKB needs to be completed
at this time and removed from the NAPI queue.

Several operations are greatly simplified by this transformation,
especially timing out the oldest SKB in the list when gro_count
exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
in reverse order.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 11:33:04 +09:00
Linus Torvalds 1c8c5a9d38 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add Maglev hashing scheduler to IPVS, from Inju Song.

 2) Lots of new TC subsystem tests from Roman Mashak.

 3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

 5) Add ttl inherit support to vxlan, from Hangbin Liu.

 6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

 8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

11) Plumb extack down into fib_rules, from Roopa Prabhu.

12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

13) Add UDP GSO support, from Willem de Bruijn.

14) Add documentation for eBPF helpers, from Quentin Monnet.

15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

24) Add TCP SACK compression, from Eric Dumazet.

25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

28) Add lots of forwarding selftests, from Petr Machata.

29) Add generic network device failover driver, from Sridhar Samudrala.

* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
  strparser: Add __strp_unpause and use it in ktls.
  rxrpc: Fix terminal retransmission connection ID to include the channel
  net: hns3: Optimize PF CMDQ interrupt switching process
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Fix for VF mailbox cannot receiving PF response
  bnx2x: use the right constant
  Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
  net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
  enic: fix UDP rss bits
  netdev-FAQ: clarify DaveM's position for stable backports
  rtnetlink: validate attributes in do_setlink()
  mlxsw: Add extack messages for port_{un, }split failures
  netdevsim: Add extack error message for devlink reload
  devlink: Add extack to reload and port_{un, }split operations
  net: metrics: add proper netlink validation
  ipmr: fix error path when ipmr_new_table fails
  ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
  net: hns3: remove unused hclgevf_cfg_func_mta_filter
  netfilter: provide udp*_lib_lookup for nf_tproxy
  qed*: Utilize FW 8.37.2.0
  ...
2018-06-06 18:39:49 -07:00
Randy Dunlap 1f4c741321 net: skbuff.h: drop unneeded <linux/slab.h>
<linux/skbuff.h> does not use nor need <linux/slab.h>, so drop this
header file from skbuff.h.

<linux/skbuff.h> is currently #included in around 1200 C source and
header files, making it the 31st most-used header file.

Build tested [allmodconfig] on 20 arch-es.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 17:02:06 -04:00
Christoph Hellwig db5051ead6 net: convert datagram_poll users tp ->poll_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-26 09:16:44 +02:00
Paolo Abeni 72a338bcc6 net: core: rework basic flow dissection helper
When the core networking needs to detect the transport offset in a given
packet and parse it explicitly, a full-blown flow_keys struct is used for
storage.
This patch introduces a smaller keys store, rework the basic flow dissect
helper to use it, and apply this new helper where possible - namely in
skb_probe_transport_header(). The used flow dissector data structures
are renamed to match more closely the new role.

The above gives ~50% performance improvement in micro benchmarking around
skb_probe_transport_header() and ~30% around eth_get_headlen(), mostly due
to the smaller memset. Small, but measurable improvement is measured also
in macro benchmarking.

v1 -> v2: use the new helper in eth_get_headlen() and skb_get_poff(),
  as per DaveM suggestion

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08 00:02:36 -04:00
Ilya Lesokhin 08303c1895 net: Rename and export copy_skb_header
copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 09:42:46 -04:00
Willem de Bruijn ee80d1ebe5 udp: add udp gso
Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.

UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.

The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).

IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.

[ Export __udp_gso_segment for ipv6. -DaveM ]

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:07:42 -04:00
Eric Dumazet 88078d98d1 net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:44:11 -04:00
Eric Dumazet bf66337140 inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca92c ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:40 -04:00
David S. Miller 03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
David S. Miller cfda06d736 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-03-08

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix various BPF helpers which adjust the skb and its GSO information
   with regards to SCTP GSO. The latter is a special case where gso_size
   is of value GSO_BY_FRAGS, so mangling that will end up corrupting
   the skb, thus bail out when seeing SCTP GSO packets, from Daniel(s).

2) Fix a compilation error in bpftool where BPF_FS_MAGIC is not defined
   due to too old kernel headers in the system, from Jiri.

3) Increase the number of x64 JIT passes in order to allow larger images
   to converge instead of punting them to interpreter or having them
   rejected when the interpreter is not built into the kernel, from Daniel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 20:27:51 -05:00
David S. Miller 0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Daniel Axtens a4a77718ee net: make skb_gso_*_seglen functions private
They're very hard to use properly as they do not consider the
GSO_BY_FRAGS case. Code should use skb_gso_validate_network_len
and skb_gso_validate_mac_len as they do consider this case.

Make the seglen functions static, which stops people using them
outside of skbuff.c

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
Daniel Axtens 779b7931b2 net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
If you take a GSO skb, and split it into packets, will the network
length (L3 headers + L4 headers + payload) of those packets be small
enough to fit within a given MTU?

skb_gso_validate_mtu gives you the answer to that question. However,
we recently added to add a way to validate the MAC length of a split GSO
skb (L2+L3+L4+payload), and the names get confusing, so rename
skb_gso_validate_mtu to skb_gso_validate_network_len

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
Daniel Axtens d02f51cbcf bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat to deal with gso sctp skbs
SCTP GSO skbs have a gso_size of GSO_BY_FRAGS, so any sort of
unconditionally mangling of that will result in nonsense value
and would corrupt the skb later on.

Therefore, i) add two helpers skb_increase_gso_size() and
skb_decrease_gso_size() that would throw a one time warning and
bail out for such skbs and ii) refuse and return early with an
error in those BPF helpers that are affected. We do need to bail
out as early as possible from there before any changes on the
skb have been performed.

Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Co-authored-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-03 13:01:11 -08:00
David S. Miller f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Sowmini Varadhan 6f89dbce8e skbuff: export mm_[un]account_pinned_pages for other modules
RDS would like to use the helper functions for managing pinned pages
added by Commit a91dbff551 ("sock: ulimit on MSG_ZEROCOPY pages")

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:04:16 -05:00
David S. Miller da27988766 skbuff: Fix comment mis-spelling.
'peform' --> 'perform'

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:52:42 -05:00
Daniel Axtens 2b16f04872 net: create skb_gso_validate_mac_len()
If you take a GSO skb, and split it into packets, will the MAC
length (L2 + L3 + L4 headers + payload) of those packets be small
enough to fit within a given length?

Move skb_gso_mac_seglen() to skbuff.h with other related functions
like skb_gso_network_seglen() so we can use it, and then create
skb_gso_validate_mac_len to do the full calculation.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01 09:36:03 -05:00
Linus Torvalds b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Linus Torvalds 168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Simon Horman 62b32379fd flow_dissector: dissect tunnel info outside __skb_flow_dissect()
Move dissection of tunnel info to outside of the main flow dissection
function, __skb_flow_dissect(). The sole user of this feature, the flower
classifier, is updated to call tunnel info dissection directly, using
skb_flow_dissect_tunnel_info().

This results in a slightly less complex implementation of
__skb_flow_dissect(), in particular removing logic from that call path
which is not used by the majority of users. The expense of this is borne by
the flower classifier which now has to make an extra call for tunnel info
dissection.

This patch should not result in any behavioural change.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 12:09:18 -05:00
Geert Uytterhoeven f8821f96ae skbuff: Grammar s/are can/can/, s/change/changes/
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-30 09:15:53 -05:00
Al Viro ade994f4f6 net: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:04 -05:00
Willem de Bruijn 0c19f846d5 net: accept UFO datagrams from tuntap and packet
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216f ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Tested
  Booted a v4.13 guest kernel with QEMU. On a host kernel before this
  patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
  enabled, same as on a v4.13 host kernel.

  A UFO packet sent from the guest appears on the tap device:
    host:
      nc -l -p -u 8000 &
      tcpdump -n -i tap0

    guest:
      dd if=/dev/zero of=payload.txt bs=1 count=2000
      nc -u 192.16.1.1 8000 < payload.txt

  Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
  packets arriving fragmented:

    ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
    (from https://github.com/wdebruij/kerneltools/tree/master/tests)

Changes
  v1 -> v2
    - simplified set_offload change (review comment)
    - documented test procedure

Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe8 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 01:37:35 +09:00
Linus Torvalds 7c225c69f8 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2 updates

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
  memory hotplug: fix comments when adding section
  mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
  mm: simplify nodemask printing
  mm,oom_reaper: remove pointless kthread_run() error check
  mm/page_ext.c: check if page_ext is not prepared
  writeback: remove unused function parameter
  mm: do not rely on preempt_count in print_vma_addr
  mm, sparse: do not swamp log with huge vmemmap allocation failures
  mm/hmm: remove redundant variable align_end
  mm/list_lru.c: mark expected switch fall-through
  mm/shmem.c: mark expected switch fall-through
  mm/page_alloc.c: broken deferred calculation
  mm: don't warn about allocations which stall for too long
  fs: fuse: account fuse_inode slab memory as reclaimable
  mm, page_alloc: fix potential false positive in __zone_watermark_ok
  mm: mlock: remove lru_add_drain_all()
  mm, sysctl: make NUMA stats configurable
  shmem: convert shmem_init_inodecache() to void
  Unify migrate_pages and move_pages access checks
  mm, pagevec: rename pagevec drained field
  ...
2017-11-15 19:42:40 -08:00
Mel Gorman 453f85d43f mm: remove __GFP_COLD
As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of.  Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact.  Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache.  Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path.  A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Levin, Alexander (Sasha Levin) 4950276672 kmemcheck: remove annotations
Patch series "kmemcheck: kill kmemcheck", v2.

As discussed at LSF/MM, kill kmemcheck.

KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow).  KASan is already upstream.

We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).

The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.

Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.

This patch (of 4):

Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
  Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Mat Martineau 39b1752110 net: Remove unused skb_shared_info member
ip6_frag_id was only used by UFO, which has been removed.
ipv6_proxy_select_ident() only existed to set ip6_frag_id and has no
in-tree callers.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 22:09:40 +09:00
David S. Miller 4dc6758d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Simple cases of overlapping changes in the packet scheduler.

Must easier to resolve this time.

Which probably means that I screwed it up somehow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 10:00:18 +09:00
Ye Yin 2b5ec1a5f9 netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
When run ipvs in two different network namespace at the same host, and one
ipvs transport network traffic to the other network namespace ipvs.
'ipvs_property' flag will make the second ipvs take no effect. So we should
clear 'ipvs_property' when SKB network namespace changed.

Fixes: 621e84d6f3 ("dev: introduce skb_scrub_packet()")
Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Wei Zhou <chouryzhou@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 22:37:42 +09:00
Eric Dumazet 18a4c0eab2 net: add rb_to_skb() and other rb tree helpers
Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
Eric Dumazet e2080072ed tcp: new list for sent but unacked skbs for RACK recovery
This patch adds a new queue (list) that tracks the sent but not yet
acked or SACKed skbs for a TCP connection. The list is chronologically
ordered by skb->skb_mstamp (the head is the oldest sent skb).

This list will be used to optimize TCP Rack recovery, which checks
an skb's timestamp to judge if it has been lost and needs to be
retransmitted. Since TCP write queue is ordered by sequence instead
of sent time, RACK has to scan over the write queue to catch all
eligible packets to detect lost retransmission, and iterates through
SACKed skbs repeatedly.

Special cares for rare events:
1. TCP repair fakes skb transmission so the send queue needs adjusted
2. SACK reneging would require re-inserting SACKed skbs into the
   send queue. For now I believe it's not worth the complexity to
   make RACK work perfectly on SACK reneging, so we do nothing here.
3. Fast Open: currently for non-TFO, send-queue correctly queues
   the pure SYN packet. For TFO which queues a pure SYN and
   then a data packet, send-queue only queues the data packet but
   not the pure SYN due to the structure of TFO code. This is okay
   because the SYN receiver would never respond with a SACK on a
   missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).

In order to not grow sk_buff, we use an union for the new list and
_skb_refdst/destructor fields. This is a bit complicated because
we need to make sure _skb_refdst and destructor are properly zeroed
before skb is cloned/copied at transmit, and before being freed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 21:24:47 -07:00
Yotam Gigi abf4bb6b63 skbuff: Add the offload_mr_fwd_mark field
Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
used to allow partial offloading of MFC multicast routes.

Switchdev drivers can offload MFC multicast routes to the hardware by
registering to the FIB notification chain. When one of the route output
interfaces is not offload-able, i.e. has different parent ID, the route
cannot be fully offloaded by the hardware. Examples to non-offload-able
devices are a management NIC, dummy device, pimreg device, etc.

Similar problem exists in the bridge module, as one bridge can hold
interfaces with different parent IDs. At the bridge, the problem is solved
by the offload_fwd_mark skb field.

Currently, when a route cannot go through full offload, the only solution
for a switchdev driver is not to offload it at all and let the packet go
through slow path.

Using the offload_mr_fwd_mark field, a driver can indicate that a packet
was already forwarded by hardware to all the devices with the same parent
ID as the input device. Further patches in this patch-set are going to
enhance ipmr to skip multicast forwarding to devices with the same parent
ID if a packets is marked with that field.

The reason why the already existing "offload_fwd_mark" bit cannot be used
is that a switchdev driver would want to make the distinction between a
packet that has already gone through L2 forwarding but did not go through
multicast forwarding, and a packet that has already gone through both L2
and multicast forwarding.

For example: when a packet is ingressing from a switchport enslaved to a
bridge, which is configured with multicast forwarding, the following
scenarios are possible:
 - The packet can be trapped to the CPU due to exception while multicast
   forwarding (for example, MTU error). In that case, it had already gone
   through L2 forwarding in the hardware, thus A switchdev driver would
   want to set the skb->offload_fwd_mark and not the
   skb->offload_mr_fwd_mark.
 - The packet can also be trapped due to a pimreg/dummy device used as one
   of the output interfaces. In that case, it can go through both L2 and
   (partial) multicast forwarding inside the hardware, thus a switchdev
   driver would want to set both the skb->offload_fwd_mark and
   skb->offload_mr_fwd_mark.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellaox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03 10:06:30 -07:00
Daniel Borkmann de8f3a83b0 bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
Gao Feng 242c1a28eb net: Remove useless function skb_header_release
There is no one which would invokes the function skb_header_release.
So just remove it now.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-22 20:43:13 -07:00
Eric Dumazet bffa72cf7f net: sk_buff rbnode reorg
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19 15:20:22 -07:00
Paolo Abeni ca2c1418ef udp: drop head states only when all skb references are gone
After commit 0ddf3fb2c4 ("udp: preserve skb->dst if required
for IP options processing") we clear the skb head state as soon
as the skb carrying them is first processed.

Since the same skb can be processed several times when MSG_PEEK
is used, we can end up lacking the required head states, and
eventually oopsing.

Fix this clearing the skb head state only when processing the
last skb reference.

Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 0ddf3fb2c4 ("udp: preserve skb->dst if required for IP options processing")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-07 20:02:39 -07:00
Eric Dumazet c1d1b43781 net: convert (struct ubuf_info)->refcnt to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

v2: added the change in drivers/vhost/net.c as spotted
by Willem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 20:22:03 -07:00
David S. Miller 6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Florian Fainelli cd0a137acb net: core: Specify skb_pad()/skb_put_padto() SKB freeing
Rename skb_pad() into __skb_pad() and make it take a third argument:
free_on_error which controls whether kfree_skb() should be called or
not, skb_pad() directly makes use of it and passes true to preserve its
existing behavior. Do exactly the same thing with __skb_put_padto() and
skb_put_padto().

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Woojung Huh <Woojung.Huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 20:33:49 -07:00
Willem de Bruijn 0a4a060bb2 sock: fix zerocopy_success regression with msg_zerocopy
Do not use uarg->zerocopy outside msg_zerocopy. In other paths the
field is not explicitly initialized and aliases another field.

Those paths have only one reference so do not need this intermediate
variable. Call uarg->callback directly.

Fixes: 1f8b977ab3 ("sock: enable MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:49:17 -07:00
Willem de Bruijn a91dbff551 sock: ulimit on MSG_ZEROCOPY pages
Bound the number of pages that a user may pin.

Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6 ("perf_counter: per user mlock gift")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn 4ab6c99d99 sock: MSG_ZEROCOPY notification coalescing
In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause send() calls to append new data to an
existing sk_buff and, thus, ubuf_info. In that case the notification
must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
and add skb_zerocopy_realloc() to optionally extend an existing range.

Also coalesce notifications in this common case: if a notification
[1, 1] is about to be queued while [0, 0] is the queue tail, just modify
the head of the queue to read [0, 1].

Coalescing is limited to a few TSO frames worth of data to bound
notification latency.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn 1f8b977ab3 sock: enable MSG_ZEROCOPY
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn 52267790ef sock: add MSG_ZEROCOPY
The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00
Willem de Bruijn 3ece782693 sock: skb_copy_ubufs support for compound pages
Refine skb_copy_ubufs to support compound pages. With upcoming TCP
zerocopy sendmsg, such fragments may appear.

The existing code replaces each page one for one. Splitting each
compound page into an independent number of regular pages can result
in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.

Instead, fill all destination pages but the last to PAGE_SIZE.
Split the existing alloc + copy loop into separate stages:
1. compute bytelength and minimum number of pages to store this.
2. allocate
3. copy, filling each page except the last to PAGE_SIZE bytes
4. update skb frag array

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00
WANG Cong b2f9d432de flow_dissector: remove unused functions
They are introduced by commit f70ea018da
("net: Add functions to get skb->hash based on flow structures")
but never gets used in tree.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:50:03 -07:00
Willem de Bruijn c613c209c3 net: add skb_frag_foreach_page and use with kmap_atomic
Skb frags may contain compound pages. Various operations map frags
temporarily using kmap_atomic, but this function works on single
pages, not whole compound pages. The distinction is only relevant
for high mem pages that require temporary mappings.

Introduce a looping mechanism that for compound highmem pages maps
one page at a time, does not change behavior on other pages.
Use the loop in the kmap_atomic callers in net/core/skbuff.c.

Verified by triggering skb_copy_bits with

    tcpdump -n -c 100 -i ${DEV} -w /dev/null &
    netperf -t TCP_STREAM -H ${HOST}

  and by triggering __skb_checksum with

    ethtool -K ${DEV} tx off

  repeated the tests with looping on a non-highmem platform
  (x86_64) by making skb_frag_must_loop always return true.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 16:07:10 -07:00
Tom Herbert 20bf50de30 skbuff: Function to send an skbuf on a socket
Add skb_send_sock to send an skbuff on a socket within the kernel.
Arguments include an offset so that an skbuf might be sent in mulitple
calls (e.g. send buffer limit is hit).

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 15:26:18 -07:00
Florian Westphal 06dc75ab06 net: Revert "net: add function to allocate sk_buff head without data area"
It was added for netlink mmap tx, there are no callers in the tree.
The commit also added a check for skb->head != NULL in kfree_skb path,
remove that too -- all skbs ought to have skb->head set.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17 10:34:21 -07:00
David S. Miller d9d30adf56 net: Kill NETIF_F_UFO and SKB_GSO_UDP.
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17 09:52:58 -07:00
Linus Torvalds 5518b69b76 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
  merge window:

   1) Several optimizations for UDP processing under high load from
      Paolo Abeni.

   2) Support pacing internally in TCP when using the sch_fq packet
      scheduler for this is not practical. From Eric Dumazet.

   3) Support mutliple filter chains per qdisc, from Jiri Pirko.

   4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

   5) Add batch dequeueing to vhost_net, from Jason Wang.

   6) Flesh out more completely SCTP checksum offload support, from
      Davide Caratti.

   7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
      Neira Ayuso, and Matthias Schiffer.

   8) Add devlink support to nfp driver, from Simon Horman.

   9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
      Prabhu.

  10) Add stack depth tracking to BPF verifier and use this information
      in the various eBPF JITs. From Alexei Starovoitov.

  11) Support XDP on qed device VFs, from Yuval Mintz.

  12) Introduce BPF PROG ID for better introspection of installed BPF
      programs. From Martin KaFai Lau.

  13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

  14) For loads, allow narrower accesses in bpf verifier checking, from
      Yonghong Song.

  15) Support MIPS in the BPF selftests and samples infrastructure, the
      MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
      Daney.

  16) Support kernel based TLS, from Dave Watson and others.

  17) Remove completely DST garbage collection, from Wei Wang.

  18) Allow installing TCP MD5 rules using prefixes, from Ivan
      Delalande.

  19) Add XDP support to Intel i40e driver, from Björn Töpel

  20) Add support for TC flower offload in nfp driver, from Simon
      Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
      Kicinski, and Bert van Leeuwen.

  21) IPSEC offloading support in mlx5, from Ilan Tayari.

  22) Add HW PTP support to macb driver, from Rafal Ozieblo.

  23) Networking refcount_t conversions, From Elena Reshetova.

  24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
      for tuning the TCP sockopt settings of a group of applications,
      currently via CGROUPs"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
  net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
  dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
  cxgb4: Support for get_ts_info ethtool method
  cxgb4: Add PTP Hardware Clock (PHC) support
  cxgb4: time stamping interface for PTP
  nfp: default to chained metadata prepend format
  nfp: remove legacy MAC address lookup
  nfp: improve order of interfaces in breakout mode
  net: macb: remove extraneous return when MACB_EXT_DESC is defined
  bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
  bpf: fix return in load_bpf_file
  mpls: fix rtm policy in mpls_getroute
  net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
  net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
  ...
2017-07-05 12:31:59 -07:00
Daniel Borkmann 0daf434940 bpf, net: add skb_mac_header_len helper
Add a small skb_mac_header_len() helper similarly as the
skb_network_header_len() we have and replace open coded
places in BPF's bpf_skb_change_proto() helper. Will also
be used in upcoming work.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:51 -07:00
Reshetova, Elena 2638595afc net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena 633547973f net: convert sk_buff.users from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
Reshetova, Elena 53869cebce net: convert nf_bridge_info.use from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
yuan linyu de77b966ce net: introduce __skb_put_[zero, data, u8]
follow Johannes Berg, semantic patch file as below,
@@
identifier p, p2;
expression len;
expression skb;
type t, t2;
@@
(
-p = __skb_put(skb, len);
+p = __skb_put_zero(skb, len);
|
-p = (t)__skb_put(skb, len);
+p = __skb_put_zero(skb, len);
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, len);
|
-memset(p, 0, len);
)

@@
identifier p;
expression len;
expression skb;
type t;
@@
(
-t p = __skb_put(skb, len);
+t p = __skb_put_zero(skb, len);
)
... when != p
(
-memset(p, 0, len);
)

@@
type t, t2;
identifier p, p2;
expression skb;
@@
t *p;
...
(
-p = __skb_put(skb, sizeof(t));
+p = __skb_put_zero(skb, sizeof(t));
|
-p = (t *)__skb_put(skb, sizeof(t));
+p = __skb_put_zero(skb, sizeof(t));
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, sizeof(*p));
|
-memset(p, 0, sizeof(*p));
)

@@
expression skb, len;
@@
-memset(__skb_put(skb, len), 0, len);
+__skb_put_zero(skb, len);

@@
expression skb, len, data;
@@
-memcpy(__skb_put(skb, len), data, len);
+__skb_put_data(skb, data, len);

@@
expression SKB, C, S;
typedef u8;
identifier fn = {__skb_put};
fresh identifier fn2 = fn ## "_u8";
@@
- *(u8 *)fn(SKB, S) = C;
+ fn2(SKB, C);

Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:30:14 -04:00
Johannes Berg 634fef6107 networking: add and use skb_put_u8()
Joe and Bjørn suggested that it'd be nicer to not have the
cast in the fairly common case of doing
	*(u8 *)skb_put(skb, 1) = c;

Add skb_put_u8() for this case, and use it across the code,
using the following spatch:

    @@
    expression SKB, C, S;
    typedef u8;
    identifier fn = {skb_put};
    fresh identifier fn2 = fn ## "_u8";
    @@
    - *(u8 *)fn(SKB, S) = C;
    + fn2(SKB, C);

Note that due to the "S", the spatch isn't perfect, it should
have checked that S is 1, but there's also places that use a
sizeof expression like sizeof(var) or sizeof(u8) etc. Turns
out that nobody ever did something like
	*(u8 *)skb_put(skb, 2) = c;

which would be wrong anyway since the second byte wouldn't be
initialized.

Suggested-by: Joe Perches <joe@perches.com>
Suggested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:40 -04:00
Johannes Berg d58ff35122 networking: make skb_push & __skb_push return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:40 -04:00
Johannes Berg af72868b90 networking: make skb_pull & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = {
            skb_pull,
            __skb_pull,
            skb_pull_inline,
            __pskb_pull_tail,
            __pskb_pull,
            pskb_pull
    };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = {
            skb_pull,
            __skb_pull,
            skb_pull_inline,
            __pskb_pull_tail,
            __pskb_pull,
            pskb_pull
    };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
Johannes Berg 4df864c1d9 networking: make skb_put & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
Johannes Berg 59ae1d127a networking: introduce and use skb_put_data()
A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.

An spatch similar to the one for skb_put_zero() converts many
of the places using it:

    @@
    identifier p, p2;
    expression len, skb, data;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, len);
    |
    -memcpy(p, data, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb, data;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, sizeof(*p));
    |
    -memcpy(p, data, sizeof(*p));
    )

    @@
    expression skb, len, data;
    @@
    -memcpy(skb_put(skb, len), data, len);
    +skb_put_data(skb, data, len);

(again, manually post-processed to retain some comments)

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:37 -04:00
Johannes Berg 83ad357dee skbuff: make skb_put_zero() return void
It's nicer to return void, since then there's no need to
cast to any structures. Currently none of the users have
a cast, but a number of future conversions do.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 12:17:06 -04:00
David S. Miller 0e74008b66 A couple of weeks worth of updates - looks like things are quiet:
* merged net-next back to get a patch from net that another patch
    here depends on
  * various small improvements/cleanups across the board
  * 4-way handshake offload (many thanks to Arend for shepherding that)
  * mesh CSA/DFS support in mac80211
  * the skb_put_zero() we discussed previously
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlk/2HoACgkQa3t4Rpy0
 AB3psA/8CVT+cJHH6fQoP2ev17LMB5CF/bBaRh8jeYRg/RslofwptLaG6CVi/Eri
 RSf036q1pUqpS7BlBguCUwqtNGIKvhr3AUIuN0nQsrH4iPJMl8DaCHM4a7BigdtG
 Cq4N7GTS5gJcUcjpxcOIoCsrpdkp8Lvnz6z7nBIxemYAyGuxrW2Z9ES38fh4TTlS
 k+8h+c8+K0Q3WsT5BB3i7zTTBLLhpR9r1YcbNf4Y984vF/Blc4M1ggbWMPZZG/y8
 CdOMH3dM9FHrzyHeyRC2ppVah6GBUgeccSlJP5KcF2vsMi2fVRwfxWTFXaQzgJy7
 lS2bKuqAiLopaYAmq/fSMBygxm2GPSsKtc2lz+TLXXTL18fqpIq7ZTjZLE+gYTCv
 DB0GamoaFciEKJ+jOvy95y2WRMnYia2whBrzsUzQ4Uful6vXbr5Q5ue5xCj4t4Qe
 bbveAdVl7n7m1pqtq9A3YP0m/lX2f7BIv2DF5bM1XoHohZHDdvETDF7NE2BIsT/I
 QFem5ffcBQRZPmdg7Tkh3K79tA4JA/ML4cx8W7Te9k+aOtaFR+ojA4pnH/8fI9d/
 6hIPuLwxI+OWGYNglxyIbuzZ4KiQr5JnZe4OFk4+/Y2g01ALY3DAbXnCVIXJIh8e
 bqUf+1Bai6EnxLDWx4qehB+bPVHzHYmvlZeJud+KJPUU/NZ9YSw=
 =x2vs
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2017-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A couple of weeks worth of updates - looks like things are quiet:
 * merged net-next back to get a patch from net that another patch
   here depends on
 * various small improvements/cleanups across the board
 * 4-way handshake offload (many thanks to Arend for shepherding that)
 * mesh CSA/DFS support in mac80211
 * the skb_put_zero() we discussed previously
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-13 13:52:37 -04:00
Paolo Abeni 0a463c78d2 udp: avoid a cache miss on dequeue
Since UDP no more uses sk->destructor, we can clear completely
the skb head state before enqueuing. Amend and use
skb_release_head_state() for that.

All head states share a single cacheline, which is not
normally used/accesses on dequeue. We can avoid entirely accessing
such cacheline implementing and using in the UDP code a specialized
skb free helper which ignores the skb head state.

This saves a cacheline miss at skb deallocation time.

v1 -> v2:
  replaced secpath_reset() with skb_release_head_state()

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12 10:01:29 -04:00
Paolo Abeni 3889a803e1 net: factor out a helper to decrement the skb refcount
The same code is replicated in 3 different places; move it to a
common helper.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12 10:01:29 -04:00
Johannes Berg a43e61842e Merge remote-tracking branch 'net-next/master' into mac80211-next
This brings in commit 7a7c0a6438 ("mac80211: fix TX aggregation
start/stop callback race") to allow the follow-up cleanup.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-06-08 14:14:45 +02:00
Jason A. Donenfeld 48a1df6533 skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
This is a defense-in-depth measure in response to bugs like
4d6fa57b4d ("macsec: avoid heap overflow in skb_to_sgvec"). There's
not only a potential overflow of sglist items, but also a stack overflow
potential, so we fix this by limiting the amount of recursion this function
is allowed to do. Not actually providing a bounded base case is a future
disaster that we can easily avoid here.

As a small matter of house keeping, we take this opportunity to move the
documentation comment over the actual function the documentation is for.

While this could be implemented by using an explicit stack of skbuffs,
when implementing this, the function complexity increased considerably,
and I don't think such complexity and bloat is actually worth it. So,
instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
and measured the stack usage there. I also reverted the recent MIPS
changes that give it a separate IRQ stack, so that I could experience
some worst-case situations. I found that limiting it to 24 layers deep
yielded a good stack usage with room for safety, as well as being much
deeper than any driver actually ever creates.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 23:01:47 -04:00
Johannes Berg e45a79da86 skbuff/mac80211: introduce and use skb_put_zero()
This pattern was introduced a number of times in mac80211 just now,
and since it's present in a number of other places it makes sense
to add a little helper for it.

This just adds the helper and transforms the mac80211 code, a later
patch will transform other places.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-05-30 09:14:30 +02:00
Miroslav Lichvar b50a5c70ff net: allow simultaneous SW and HW transmit timestamping
Add SOF_TIMESTAMPING_OPT_TX_SWHW option to allow an outgoing packet to
be looped to the socket's error queue with a software timestamp even
when a hardware transmit timestamp is expected to be provided by the
driver.

Applications using this option will receive two separate messages from
the error queue, one with a software timestamp and the other with a
hardware timestamp. As the hardware timestamp is saved to the shared skb
info, which may happen before the first message with software timestamp
is received by the application, the hardware timestamp is copied to the
SCM_TIMESTAMPING control message only when the skb has no software
timestamp or it is an incoming packet.

While changing sw_tx_timestamp(), inline it in skb_tx_timestamp() as
there are no other users.

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:37:32 -04:00
Miroslav Lichvar 90b602f803 net: add function to retrieve original skb device using NAPI ID
Since commit b68581778c ("net: Make skb->skb_iif always track
skb->dev") skbs don't have the original index of the interface which
received the packet. This information is now needed for a new control
message related to hardware timestamping.

Instead of adding a new field to skb, we can find the device by the NAPI
ID if it is available, i.e. CONFIG_NET_RX_BUSY_POLL is enabled and the
driver is using NAPI. Add dev_get_by_napi_id() and also skb_napi_id() to
hide the CONFIG_NET_RX_BUSY_POLL ifdef.

CC: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:37:32 -04:00
Davide Caratti b4759dcdcd sk_buff.h: improve description of CHECKSUM_{COMPLETE, UNNECESSARY}
Add FCoE to the list of protocols that can set CHECKSUM_UNNECESSARY; add a
note to CHECKSUM_COMPLETE section to specify that it does not apply to SCTP
and FCoE protocols.

Suggested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
Davide Caratti 43c26a1a45 net: more accurate checksumming in validate_xmit_skb()
skb_csum_hwoffload_help() uses netdev features and skb->csum_not_inet to
determine if skb needs software computation of Internet Checksum or crc32c
(or nothing, if this computation can be done by the hardware). Use it in
place of skb_checksum_help() in validate_xmit_skb() to avoid corruption
of non-GSO SCTP packets having skb->ip_summed equal to CHECKSUM_PARTIAL.

While at it, remove references to skb_csum_off_chk* functions, since they
are not present anymore in Linux  _ see commit cf53b1da73 ("Revert
 "net: Add driver helper functions to determine checksum offloadability"").

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
Davide Caratti dba003067a net: use skb->csum_not_inet to identify packets needing crc32c
skb->csum_not_inet carries the indication on which algorithm is needed to
compute checksum on skb in the transmit path, when skb->ip_summed is equal
to CHECKSUM_PARTIAL. If skb carries a SCTP packet and crc32c hasn't been
yet written in L4 header, skb->csum_not_inet is assigned to 1; otherwise,
assume Internet Checksum is needed and thus set skb->csum_not_inet to 0.

Suggested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
Davide Caratti 219f1d7987 sk_buff: remove support for csum_bad in sk_buff
This bit was introduced with commit 5a21232983 ("net: Support for
csum_bad in skbuff") to reduce the stack workload when processing RX
packets carrying a wrong Internet Checksum. Up to now, only one driver and
GRO core are setting it.

Suggested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
Davide Caratti b72b5bf6a8 net: introduce skb_crc32c_csum_help
skb_crc32c_csum_help is like skb_checksum_help, but it is designed for
checksumming SCTP packets using crc32c (see RFC3309), provided that
libcrc32c.ko has been loaded before. In case libcrc32c is not loaded,
invoking skb_crc32c_csum_help on a skb results in one the following
printouts:

warn_crc32c_csum_update: attempt to compute crc32c without libcrc32c.ko
warn_crc32c_csum_combine: attempt to compute crc32c without libcrc32c.ko

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
Davide Caratti 9617813dba skbuff: add stub to help computing crc32c on SCTP packets
sctp_compute_checksum requires crc32c symbol (provided by libcrc32c), so
it can't be used in net core. Like it has been done previously with other
symbols (e.g. ipv6_dst_lookup), introduce a stub struct skb_checksum_ops
to allow computation of crc32c checksum in net core after sctp.ko (and thus
libcrc32c) has been loaded.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
Eric Dumazet 9a568de481 tcp: switch TCP TS option (RFC 7323) to 1ms clock
TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.

For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]

Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.

This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Paolo Abeni 65101aeca5 net/sock: factor out dequeue/peek with offset code
And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:41:29 -04:00
Mauro Carvalho Chehab 771b00a84b net: skbuff.h: properly escape a macro name on kernel-doc
The "%" escape code of kernel-doc only handle letters. It
doesn't handle special chars. So, use the ``literal``
notation. That fixes this warning:

./include/linux/skbuff.h:2695: WARNING: Inline literal start-string without end-string.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:13 -03:00
Linus Torvalds 8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Al Viro 3073f070a1 switch memcpy_from_msg() to copy_from_iter_full()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 13:57:16 -04:00
Steffen Klassert c7ef8f0c02 net: Add ESP offload features
This patch adds netdev features to configure IPsec offloads.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:05:36 +02:00
Steffen Klassert 7f564528a4 skbuff: Extend gso_type to unsigned int.
All available gso_type flags are currently in use, so
extend gso_type from 'unsigned short' to 'unsigned int'
to be able to add further flags.

We reorder the struct skb_shared_info to use
two bytes of the four byte hole before dataref.
All fields before dataref are cleared, i.e.
four bytes more than before the change.

The remaining two byte hole is moved to the
beginning of the structure, this protects us
from immediate overwites on out of bound writes
to the sk_buff head.

Structure layout on x86-64 before the change:

struct skb_shared_info {
	unsigned char              nr_frags;             /*     0     1 */
	__u8                       tx_flags;             /*     1     1 */
	short unsigned int         gso_size;             /*     2     2 */
	short unsigned int         gso_segs;             /*     4     2 */
	short unsigned int         gso_type;             /*     6     2 */
	struct sk_buff *           frag_list;            /*     8     8 */
	struct skb_shared_hwtstamps hwtstamps;           /*    16     8 */
	u32                        tskey;                /*    24     4 */
	__be32                     ip6_frag_id;          /*    28     4 */
	atomic_t                   dataref;              /*    32     4 */

	/* XXX 4 bytes hole, try to pack */

	void *                     destructor_arg;       /*    40     8 */
	skb_frag_t                 frags[17];            /*    48   272 */
	/* --- cacheline 5 boundary (320 bytes) --- */

	/* size: 320, cachelines: 5, members: 12 */
	/* sum members: 316, holes: 1, sum holes: 4 */
};

Structure layout on x86-64 after the change:

struct skb_shared_info {
	short unsigned int         _unused;              /*     0     2 */
	unsigned char              nr_frags;             /*     2     1 */
	__u8                       tx_flags;             /*     3     1 */
	short unsigned int         gso_size;             /*     4     2 */
	short unsigned int         gso_segs;             /*     6     2 */
	struct sk_buff *           frag_list;            /*     8     8 */
	struct skb_shared_hwtstamps hwtstamps;           /*    16     8 */
	unsigned int               gso_type;             /*    24     4 */
	u32                        tskey;                /*    28     4 */
	__be32                     ip6_frag_id;          /*    32     4 */
	atomic_t                   dataref;              /*    36     4 */
	void *                     destructor_arg;       /*    40     8 */
	skb_frag_t                 frags[17];            /*    48   272 */
	/* --- cacheline 5 boundary (320 bytes) --- */

	/* size: 320, cachelines: 5, members: 13 */
};

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 12:32:07 -07:00
Ingo Molnar e601757102 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Amir Vadai ea6da4fd38 net/skbuff: Introduce skb_mac_offset()
Introduce skb_mac_offset() that could be used to get mac header offset.

Signed-off-by: Amir Vadai <amir@vadai.me>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:18:33 -05:00
Julian Anastasov 4ff0620354 net: add dst_pending_confirm flag to skbuff
Add new skbuff flag to allow protocols to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

Add sock_confirm_neigh() helper to confirm the neighbour and
use it for IPv4, IPv6 and VRF before dst_neigh_output.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07 13:07:46 -05:00
David S. Miller 52e01b84a2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, they are:

1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
   sk_buff so we only access one single cacheline in the conntrack
   hotpath. Patchset from Florian Westphal.

2) Don't leak pointer to internal structures when exporting x_tables
   ruleset back to userspace, from Willem DeBruijn. This includes new
   helper functions to copy data to userspace such as xt_data_to_user()
   as well as conversions of our ip_tables, ip6_tables and arp_tables
   clients to use it. Not surprinsingly, ebtables requires an ad-hoc
   update. There is also a new field in x_tables extensions to indicate
   the amount of bytes that we copy to userspace.

3) Add nf_log_all_netns sysctl: This new knob allows you to enable
   logging via nf_log infrastructure for all existing netnamespaces.
   Given the effort to provide pernet syslog has been discontinued,
   let's provide a way to restore logging using netfilter kernel logging
   facilities in trusted environments. Patch from Michal Kubecek.

4) Validate SCTP checksum from conntrack helper, from Davide Caratti.

5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
   a copy&paste from the original helper, from Florian Westphal.

6) Reset netfilter state when duplicating packets, also from Florian.

7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
   nft_meta, from Liping Zhang.

8) Add missing code to deal with loopback packets from nft_meta when
   used by the netdev family, also from Liping.

9) Several cleanups on nf_tables, one to remove unnecessary check from
   the netlink control plane path to add table, set and stateful objects
   and code consolidation when unregister chain hooks, from Gao Feng.

10) Fix harmless reference counter underflow in IPVS that, however,
    results in problems with the introduction of the new refcount_t
    type, from David Windsor.

11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
    from Davide Caratti.

12) Missing documentation on nf_tables uapi header, from Liping Zhang.

13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 16:58:20 -05:00
Florian Westphal a9e419dc7b netfilter: merge ctinfo into nfct pointer storage area
After this change conntrack operations (lookup, creation, matching from
ruleset) only access one instead of two sk_buff cache lines.

This works for normal conntracks because those are allocated from a slab
that guarantees hw cacheline or 8byte alignment (whatever is larger)
so the 3 bits needed for ctinfo won't overlap with nf_conn addresses.

Template allocation now does manual address alignment (see previous change)
on arches that don't have sufficent kmalloc min alignment.

Some spots intentionally use skb->_nfct instead of skb_nfct() helpers,
this is to avoid undoing the skb_nfct() use when we remove untracked
conntrack object in the future.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:56 +01:00
Florian Westphal cb9c68363e skbuff: add and use skb_nfct helper
Followup patch renames skb->nfct and changes its type so add a helper to
avoid intrusive rename change later.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:53 +01:00
David S. Miller 02ac5d1487 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two AF_* families adding entries to the lockdep tables
at the same time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 14:43:39 -05:00
Alexander Duyck 8c2dd3e4a4 mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
Patch series "Page fragment updates", v4.

This patch series takes care of a few cleanups for the page fragments
API.

First we do some renames so that things are much more consistent.  First
we move the page_frag_ portion of the name to the front of the functions
names.  Secondly we split out the cache specific functions from the
other page fragment functions by adding the word "cache" to the name.

Finally I added a bit of documentation that will hopefully help to
explain some of this.  I plan to revisit this later as we get things
more ironed out in the near future with the changes planned for the DMA
setup to support eXpress Data Path.

This patch (of 3):

This patch renames the page frag functions to be more consistent with
other APIs.  Specifically we place the name page_frag first in the name
and then have either an alloc or free call name that we append as the
suffix.  This makes it a bit clearer in terms of naming.

In addition we drop the leading double underscores since we are
technically no longer a backing interface and instead the front end that
is called from the networking APIs.

Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:55 -08:00
Willem de Bruijn bc31c905e9 net-tc: convert tc_from to tc_from_ingress and tc_redirected
The tc_from field fulfills two roles. It encodes whether a packet was
redirected by an act_mirred device and, if so, whether act_mirred was
called on ingress or egress. Split it into separate fields.

The information is needed by the special IFB loop, where packets are
taken out of the normal path by act_mirred, forwarded to IFB, then
reinjected at their original location (ingress or egress) by IFB.

The IFB device cannot use skb->tc_at_ingress, because that may have
been overwritten as the packet travels from act_mirred to ifb_xmit,
when it passes through tc_classify on the IFB egress path. Cache this
value in skb->tc_from_ingress.

That field is valid only if a packet arriving at ifb_xmit came from
act_mirred. Other packets can be crafted to reach ifb_xmit. These
must be dropped. Set tc_redirected on redirection and drop all packets
that do not have this bit set.

Both fields are set only on cloned skbs in tc actions, so original
packet sources do not have to clear the bit when reusing packets
(notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 20:58:52 -05:00
Willem de Bruijn 8dc07fdbf2 net-tc: convert tc_at to tc_at_ingress
Field tc_at is used only within tc actions to distinguish ingress from
egress processing. A single bit is sufficient for this purpose.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 20:58:52 -05:00
Willem de Bruijn a5135bcfba net-tc: convert tc_verd to integer bitfields
Extract the remaining two fields from tc_verd and remove the __u16
completely. TC_AT and TC_FROM are converted to equivalent two-bit
integer fields tc_at and tc_from. Where possible, use existing
helper skb_at_tc_ingress when reading tc_at. Introduce helper
skb_reset_tc to clear fields.

Not documenting tc_from and tc_at, because they will be replaced
with single bit fields in follow-on patches.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 20:58:52 -05:00
Willem de Bruijn e7246e122a net-tc: extract skip classify bit from tc_verd
Packets sent by the IFB device skip subsequent tc classification.
A single bit governs this state. Move it out of tc_verd in
anticipation of removing that __u16 completely.

The new bitfield tc_skip_classify temporarily uses one bit of a
hole, until tc_verd is removed completely in a follow-up patch.

Remove the bit hole comment. It could be 2, 3, 4 or 5 bits long.
With that many options, little value in documenting it.

Introduce a helper function to deduplicate the logic in the two
sites that check this bit.

The field tc_skip_classify is set only in IFB on skbs cloned in
act_mirred, so original packet sources do not have to clear the
bit when reusing packets (notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 20:58:52 -05:00
Thomas Gleixner 8b0e195314 ktime: Cleanup ktime_set() usage
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Linus Torvalds 9a19a6db37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - more ->d_init() stuff (work.dcache)

 - pathname resolution cleanups (work.namei)

 - a few missing iov_iter primitives - copy_from_iter_full() and
   friends. Either copy the full requested amount, advance the iterator
   and return true, or fail, return false and do _not_ advance the
   iterator. Quite a few open-coded callers converted (and became more
   readable and harder to fuck up that way) (work.iov_iter)

 - several assorted patches, the big one being logfs removal

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  logfs: remove from tree
  vfs: fix put_compat_statfs64() does not handle errors
  namei: fold should_follow_link() with the step into not-followed link
  namei: pass both WALK_GET and WALK_MORE to should_follow_link()
  namei: invert WALK_PUT logics
  namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
  namei: saner calling conventions for mountpoint_last()
  namei.c: get rid of user_path_parent()
  switch getfrag callbacks to ..._full() primitives
  make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
  [iov_iter] new primitives - copy_from_iter_full() and friends
  don't open-code file_inode()
  ceph: switch to use of ->d_init()
  ceph: unify dentry_operations instances
  lustre: switch to use of ->d_init()
2016-12-16 10:24:44 -08:00
Eric Dumazet c84d949057 udp: copy skb->truesize in the first cache line
In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.

Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09 22:12:21 -05:00
Eric Dumazet c8c8b12709 udp: under rx pressure, try to condense skbs
Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.

Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.

If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.

Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.

Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.

This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)

This patch gave me about 5 % increase in throughput in my tests.

skb_condense() helper could probably used in other contexts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 13:25:07 -05:00
Al Viro 15e6cb46c9 make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:34:30 -05:00
Alexey Dobriyan c72d8cdaa5 net: fix bogus cast in skb_pagelen() and use unsigned variables
1) cast to "int" is unnecessary:
   u8 will be promoted to int before decrementing,
   small positive numbers fit into "int", so their values won't be changed
   during promotion.

   Once everything is int including loop counters, signedness doesn't
   matter: 32-bit operations will stay 32-bit operations.

   But! Someone tried to make this loop smart by making everything of
   the same type apparently in an attempt to optimise it.
   Do the optimization, just differently.
   Do the cast where it matters. :^)

2) frag size is unsigned entity and sum of fragments sizes is also
   unsigned.

Make everything unsigned, leave no MOVSX instruction behind.

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
	function                                     old     new   delta
	skb_cow_data                                 835     834      -1
	ip_do_fragment                              2549    2548      -1
	ip6_fragment                                3130    3128      -2
	Total: Before=154865032, After=154865028, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:11:25 -05:00
Paolo Abeni 7c13f97ffd udp: do fwd memory scheduling on dequeue
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks	vanilla		patched
1		440		560
3		2150		2300
6		3650		3800
9		4450		4600
12		6250		6450

v1 -> v2:
 - do rmem and allocated memory scheduling under the receive lock
 - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
 - avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
David S. Miller 27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Florian Westphal b917783c7b flow_dissector: __skb_get_hash_symmetric arg can be const
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:10:21 -04:00
Stephen Hemminger 293de7dee4 doc: update docbook annotations for socket and skb
The skbuff and sock structure both had missing parameter annotation
values.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:31:23 -04:00
Linus Torvalds d1f5323370 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS splice updates from Al Viro:
 "There's a bunch of branches this cycle, both mine and from other folks
  and I'd rather send pull requests separately.

  This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
  (and introduction of such). Gets rid of a lot of code in fs/splice.c
  and elsewhere; there will be followups, but these are for the next
  cycle...  Some pipe/splice-related cleanups from Miklos in the same
  branch as well"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pipe: fix comment in pipe_buf_operations
  pipe: add pipe_buf_steal() helper
  pipe: add pipe_buf_confirm() helper
  pipe: add pipe_buf_release() helper
  pipe: add pipe_buf_get() helper
  relay: simplify relay_file_read()
  switch default_file_splice_read() to use of pipe-backed iov_iter
  switch generic_file_splice_read() to use of ->read_iter()
  new iov_iter flavour: pipe-backed
  fuse_dev_splice_read(): switch to add_to_pipe()
  skb_splice_bits(): get rid of callback
  new helper: add_to_pipe()
  splice: lift pipe_lock out of splice_to_pipe()
  splice: switch get_iovec_page_array() to iov_iter
  splice_to_pipe(): don't open-code wakeup_pipe_readers()
  consistent treatment of EFAULT on O_DIRECT read/write
2016-10-07 15:36:58 -07:00
Al Viro 25869262ef skb_splice_bits(): get rid of callback
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:56 -04:00
Shmulik Ladkani bfca4c520f net: skbuff: Export __skb_vlan_pop
This exports the functionality of extracting the tag from the payload,
without moving next vlan tag into hw accel tag.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:34:20 -04:00
Daniel Borkmann 36bbef52c7 bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.

For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.

At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a220 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.

For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:32:11 -04:00
Yaogong Wang 9f5afeae51 tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 17:25:58 -07:00
Ido Schimmel 6bc506b4fb bridge: switchdev: Add forward mark support for stacked devices
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.

It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.

This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.

The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.

Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).

Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 13:13:36 -07:00
Daniel Borkmann 5293efe62d bpf: add bpf_skb_change_tail helper
This work adds a bpf_skb_change_tail() helper for tc BPF programs. The
basic idea is to expand or shrink the skb in a controlled manner. The
eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(),
bpf_lX_csum_replace() and others rather than passing a raw buffer for
writing here.

bpf_skb_change_tail() is really a slow path helper and intended for
replies with f.e. ICMP control messages. Concept is similar to other
helpers like bpf_skb_change_proto() helper to keep the helper without
protocol specifics and let the BPF program mangle the remaining parts.
A flags field has been added and is reserved for now should we extend
the helper in future.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:38:16 -07:00
Daniel Borkmann 479ffcccef bpf: fix checksum fixups on bpf_skb_store_bytes
bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
Where we ran into an issue with bpf_skb_store_bytes() is when we did a
single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
underlying issue has been tracked down to a buffer alignment issue.

Meaning, that csum_partial() computations via skb_postpull_rcsum() and
skb_postpush_rcsum() pair invoked had a wrong result since they operated on
an odd address for the hoplimit, while other computations were done on an
even address. This mix doesn't work as-is with skb_postpull_rcsum(),
skb_postpush_rcsum() pair as it always expects at least half-word alignment
of input buffers, which is normally the case. Thus, instead of these helpers
using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
csum_block_add(), respectively. For unaligned offsets, they rotate the sum
to align it to a half-word boundary again, otherwise they work the same as
csum_sub() and csum_add().

Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
use a 0 constant for offset so that the compiler optimizes the offset & 1
test away and generates the same code as with csum_sub()/_add().

Fixes: 608cd71a9c ("tc: bpf: generalize pedit action")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 13:11:43 -07:00
David S. Miller 30d0844bdc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx5/core/en.h
	drivers/net/ethernet/mellanox/mlx5/core/en_main.c
	drivers/net/usb/r8152.c

All three conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 10:35:22 -07:00
Jamal Hadi Salim 8b10cab64c net: simplify and make pkt_type_ok() available for other users
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 15:11:13 -07:00
WANG Cong 82a31b9231 net_sched: fix mirrored packets checksum
Similar to commit 9b368814b3 ("net: fix bridge multicast packet checksum validation")
we need to fixup the checksum for CHECKSUM_COMPLETE when
pushing skb on RX path. Otherwise we get similar splats.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:19:34 -04:00
David S. Miller eb70db8756 packet: Use symmetric hash for PACKET_FANOUT_HASH.
People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
they want packets going in both directions on a flow to hash to the
same bucket.

The core kernel SKB hash became non-symmetric when the ipv6 flow label
and other entities were incorporated into the standard flow hash order
to increase entropy.

But there are no users of PACKET_FANOUT_HASH who want an assymetric
hash, they all want a symmetric one.

Therefore, use the flow dissector to compute a flat symmetric hash
over only the protocol, addresses and ports.  This hash does not get
installed into and override the normal skb hash, so this change has
no effect whatsoever on the rest of the stack.

Reported-by: Eric Leblond <eric@regit.org>
Tested-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:07:50 -04:00
Marcelo Ricardo Leitner 90017accff sctp: Add GSO support
SCTP has this pecualiarity that its packets cannot be just segmented to
(P)MTU. Its chunks must be contained in IP segments, padding respected.
So we can't just generate a big skb, set gso_size to the fragmentation
point and deliver it to IP layer.

This patch takes a different approach. SCTP will now build a skb as it
would be if it was received using GRO. That is, there will be a cover
skb with protocol headers and children ones containing the actual
segments, already segmented to a way that respects SCTP RFCs.

With that, we can tell skb_segment() to just split based on frag_list,
trusting its sizes are already in accordance.

This way SCTP can benefit from GSO and instead of passing several
packets through the stack, it can pass a single large packet.

v2:
- Added support for receiving GSO frames, as requested by Dave Miller.
- Clear skb->cb if packet is GSO (otherwise it's not used by SCTP)
- Added heuristics similar to what we have in TCP for not generating
  single GSO packets that fills cwnd.
v3:
- consider sctphdr size in skb_gso_transport_seglen()
- rebased due to 5c7cdf339a ("gso: Remove arbitrary checks for
  unsupported GSO")

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03 19:37:21 -04:00
Marcelo Ricardo Leitner ae7ef81ef0 skbuff: introduce skb_gso_validate_mtu
skb_gso_network_seglen is not enough for checking fragment sizes if
skb is using GSO_BY_FRAGS as we have to check frag per frag.

This patch introduces skb_gso_validate_mtu, based on the former, which
will wrap the use case inside it as all calls to skb_gso_network_seglen
were to validate if it fits on a given TMU, and improve the check.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03 19:37:21 -04:00
Marcelo Ricardo Leitner 3953c46c3a sk_buff: allow segmenting based on frag sizes
This patch allows segmenting a skb based on its frags sizes instead of
based on a fixed value.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03 19:37:21 -04:00
Neil Horman 95829b3a9c net: suppress warnings on dev_alloc_skb
Noticed an allocation failure in a network driver the other day on a 32 bit
system:

DMA-API: debugging out of memory - disabling
bnx2fc: adapter_lookup: hba NULL
lldpad: page allocation failure. order:0, mode:0x4120
Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
Call Trace:
 [<c08a4086>] ? printk+0x19/0x23
 [<c05166a4>] ? __alloc_pages_nodemask+0x664/0x830
 [<c0649d02>] ? free_object+0x82/0xa0
 [<fb4e2c9b>] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
 [<fb4e2fff>] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
 [<fb4e228c>] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
 [<fb4e3709>] ? ixgbe_configure+0x589/0xc00 [ixgbe]
 [<fb4e7be7>] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
 [<fb503ce6>] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
 [<fb4e8e54>] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
 [<fb505a9f>] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
 [<c088d80d>] ? dcb_doit+0x10ed/0x16d0
...

Thought that perhaps the big splat in the logs wasn't really necessecary, as
all call sites for dev_alloc_skb:

a) check the return code for the function

and

b) either print their own error message or have a recovery path that makes the
warning moot.

Fix it by modifying dev_alloc_pages to pass __GFP_NOWARN as a gfp flag to
suppress the warning

applies to the net tree

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 19:58:32 -04:00
Tom Herbert 7e13318daa net: define gso types for IPx over IPv4 and IPv6
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:15 -04:00
Eric Dumazet 9580bf2edb net: relax expensive skb_unclone() in iptunnel_handle_offloads()
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:22:19 -04:00
Soheil Hassas Yeganeh 0a2cf20c3f tcp: remove SKBTX_ACK_TSTAMP since it is redundant
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.

Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.

Note that this frees one bit in shinfo->tx_flags.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
Sowmini Varadhan 6fa01ccd88 skbuff: Add pskb_extract() helper function
A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.

The existing code uses the sequence
	clone = skb_clone(..);
	pskb_pull(clone, off, ..);
	pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().

To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 16:54:14 -04:00
Alexander Duyck 802ab55adc GSO: Support partial segmentation offload
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck cbc53e08a7 GSO: Add GSO type for fixed IPv4 ID
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling.  Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was.  This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:40 -04:00
samanthakumar 627d2d6b55 udp: enable MSG_PEEK at non-zero offset
Enable peeking at UDP datagrams at the offset specified with socket
option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up
to the end of the given datagram.

Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f6e
("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset
on peek, decrease it on regular reads.

When peeking, always checksum the packet immediately, to avoid
recomputation on subsequent peeks and final read.

The socket lock is not held for the duration of udp_recvmsg, so
peek and read operations can run concurrently. Only the last store
to sk_peek_off is preserved.

Signed-off-by: Sam Kumar <samanthakumar@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-05 16:29:37 -04:00
David S. Miller 810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Benjamin Poirier 1837b2e2bc mld, igmp: Fix reserved tailroom calculation
The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^                                               ^
head                                            skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^                                               ^
head                                            skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
	reserved_tailroom = skb_tailroom - mtu
else:
	reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 15:41:07 -05:00
WANG Cong 64d4e3431e net: remove skb_sender_cpu_clear()
After commit 52bd2d62ce ("net: better skb->sender_cpu and skb->napi_id cohabitation")
skb_sender_cpu_clear() becomes empty and can be removed.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 17:36:47 -05:00
David S. Miller b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00