As part of y2038 solution, all internal uses of
struct timeval are replaced by struct __kernel_old_timeval
and struct compat_timeval by struct old_timeval32.
Make socket timestamps use these new types.
This is mainly to be able to verify that the kernel build
is y2038 safe when such non y2038 safe types are not
supported anymore.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: isdn@linux-pingi.de
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-01-29
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Teach verifier dead code removal, this also allows for optimizing /
removing conditional branches around dead code and to shrink the
resulting image. Code store constrained architectures like nfp would
have hard time doing this at JIT level, from Jakub.
2) Add JMP32 instructions to BPF ISA in order to allow for optimizing
code generation for 32-bit sub-registers. Evaluation shows that this
can result in code reduction of ~5-20% compared to 64 bit-only code
generation. Also add implementation for most JITs, from Jiong.
3) Add support for __int128 types in BTF which is also needed for
vmlinux's BTF conversion to work, from Yonghong.
4) Add a new command to bpftool in order to dump a list of BPF-related
parameters from the system or for a specific network device e.g. in
terms of available prog/map types or helper functions, from Quentin.
5) Add AF_XDP sock_diag interface for querying sockets from user
space which provides information about the RX/TX/fill/completion
rings, umem, memory usage etc, from Björn.
6) Add skb context access for skb_shared_info->gso_segs field, from Eric.
7) Add support for testing flow dissector BPF programs by extending
existing BPF_PROG_TEST_RUN infrastructure, from Stanislav.
8) Split BPF kselftest's test_verifier into various subgroups of tests
in order better deal with merge conflicts in this area, from Jakub.
9) Add support for queue/stack manipulations in bpftool, from Stanislav.
10) Document BTF, from Yonghong.
11) Dump supported ELF section names in libbpf on program load
failure, from Taeung.
12) Silence a false positive compiler warning in verifier's BTF
handling, from Peter.
13) Fix help string in bpftool's feature probing, from Prashant.
14) Remove duplicate includes in BPF kselftests, from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This way, we can reuse it for flow dissector in BPF_PROG_TEST_RUN.
No functional changes.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In certain cases, pskb_trim_rcsum() may change skb pointers.
Reinitialize header pointers afterwards to avoid potential
use-after-frees. Add a note in the documentation of
pskb_trim_rcsum(). Found by KASAN.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3
FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt
IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX
16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb
BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi
3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng
Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy
ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC
1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA
SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS
eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os
2p4aHs6ktw==
=TRdW
-----END PGP SIGNATURE-----
Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"This is the main pull request for block/storage for 4.21.
Larger than usual, it was a busy round with lots of goodies queued up.
Most notable is the removal of the old IO stack, which has been a long
time coming. No new features for a while, everything coming in this
week has all been fixes for things that were previously merged.
This contains:
- Use atomic counters instead of semaphores for mtip32xx (Arnd)
- Cleanup of the mtip32xx request setup (Christoph)
- Fix for circular locking dependency in loop (Jan, Tetsuo)
- bcache (Coly, Guoju, Shenghui)
* Optimizations for writeback caching
* Various fixes and improvements
- nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
* host and target support for NVMe over TCP
* Error log page support
* Support for separate read/write/poll queues
* Much improved polling
* discard OOM fallback
* Tracepoint improvements
- lightnvm (Hans, Hua, Igor, Matias, Javier)
* Igor added packed metadata to pblk. Now drives without metadata
per LBA can be used as well.
* Fix from Geert on uninitialized value on chunk metadata reads.
* Fixes from Hans and Javier to pblk recovery and write path.
* Fix from Hua Su to fix a race condition in the pblk recovery
code.
* Scan optimization added to pblk recovery from Zhoujie.
* Small geometry cleanup from me.
- Conversion of the last few drivers that used the legacy path to
blk-mq (me)
- Removal of legacy IO path in SCSI (me, Christoph)
- Removal of legacy IO stack and schedulers (me)
- Support for much better polling, now without interrupts at all.
blk-mq adds support for multiple queue maps, which enables us to
have a map per type. This in turn enables nvme to have separate
completion queues for polling, which can then be interrupt-less.
Also means we're ready for async polled IO, which is hopefully
coming in the next release.
- Killing of (now) unused block exports (Christoph)
- Unification of the blk-rq-qos and blk-wbt wait handling (Josef)
- Support for zoned testing with null_blk (Masato)
- sx8 conversion to per-host tag sets (Christoph)
- IO priority improvements (Damien)
- mq-deadline zoned fix (Damien)
- Ref count blkcg series (Dennis)
- Lots of blk-mq improvements and speedups (me)
- sbitmap scalability improvements (me)
- Make core inflight IO accounting per-cpu (Mikulas)
- Export timeout setting in sysfs (Weiping)
- Cleanup the direct issue path (Jianchao)
- Export blk-wbt internals in block debugfs for easier debugging
(Ming)
- Lots of other fixes and improvements"
* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
kyber: use sbitmap add_wait_queue/list_del wait helpers
sbitmap: add helpers for add/del wait queue handling
block: save irq state in blkg_lookup_create()
dm: don't reuse bio for flushes
nvme-pci: trace SQ status on completions
nvme-rdma: implement polling queue map
nvme-fabrics: allow user to pass in nr_poll_queues
nvme-fabrics: allow nvmf_connect_io_queue to poll
nvme-core: optionally poll sync commands
block: make request_to_qc_t public
nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
nvme-tcp: fix endianess annotations
nvmet-tcp: fix endianess annotations
nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
nvme-pci: only set nr_maps to 2 if poll queues are supported
nvmet: use a macro for default error location
nvmet: fix comparison of a u16 with -1
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
...
Such helper is currently unused, and skb extension users are
better off using skb_ext_add()/skb_ext_del(). So let's drop
it.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove skb->sp and allocate secpath storage via extension
infrastructure. This also reduces sk_buff by 8 bytes on x86_64.
Total size of allyesconfig kernel is reduced slightly, as there is
less inlined code (one conditional atomic op instead of two on
skb_clone).
No differences in throughput in following ipsec performance tests:
- transport mode with aes on 10GB link
- tunnel mode between two network namespaces with aes and null cipher
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Future patch will remove skb->sp pointer.
To reduce noise in those patches, move existing helper to
sk_buff and use it in more places to ease skb->sp replacement later.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure.
The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
objdiff shows no changes if kernel is built without xfrm and br_netfilter
support.
The third (planned future) user is Multipath TCP which is still
out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
mapping for tx and rx processing.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies.
To do this, it would be needed to define a bitmask of all extensions that
need copy/cow semantics, and change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
->active_extensions to 0 on the new clone.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
v2:
Allocate entire extension space using kmem_cache.
Upside is that this allows better tracking of used memory,
downside is that we will allocate more space than strictly needed in
most cases (its unlikely that all extensions are active/needed at same
time for same skb).
The allocated memory (except the small extension header) is not cleared,
so no additonal overhead aside from memory usage.
Avoid atomic_dec_and_test operation on skb_ext_put()
by using similar trick as kfree_skbmem() does with fclone_ref:
If recount is 1, there is no concurrent user and we can free right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a helper to copy datagram into an iovec iterator
but also update a predefined hash. This is useful for
consumers of skb_copy_datagram_iter to also support inflight
data digest without having to finish to copy and only then
traverse the iovec and calculate the digest hash.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Commit abf4bb6b63 ("skbuff: Add the offload_mr_fwd_mark field") added
the 'offload_mr_fwd_mark' field to indicate that a packet has already
undergone L3 multicast routing by a capable device. The field is used to
prevent the kernel from forwarding a packet through a netdev through
which the device has already forwarded the packet.
Currently, no unicast packet is routed by both the device and the
kernel, but this is about to change by subsequent patches and we need to
be able to mark such packets, so that they will no be forwarded twice.
Instead of adding yet another field to 'struct sk_buff', we can just
rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
either has a multicast or a unicast destination IP.
While at it, add a comment about both 'offload_fwd_mark' and
'offload_l3_fwd_mark'.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.
The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.
The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.
Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.
This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.
Changes
v3 -> v4
- Move skb_zcopy_set below the only kfree_skb that might cause
a premature uarg destroy before skb_zerocopy_put_abort
- Move the entire skb_shinfo assignment block, to keep that
cacheline access in one place
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.
This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.
tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
ipv4 udp -t 1
tx=191312 (11938 MB) txc=0 zc=n
rx=191312 (11938 MB)
ipv4 udp -z -t 1
tx=304507 (19002 MB) txc=304507 zc=y
rx=304507 (19002 MB)
ok
ipv6 udp -t 1
tx=174485 (10888 MB) txc=0 zc=n
rx=174485 (10888 MB)
ipv6 udp -z -t 1
tx=294801 (18396 MB) txc=294801 zc=y
rx=294801 (18396 MB)
ok
Changes
v1 -> v2
- Fixup reverse christmas tree violation
v2 -> v3
- Split refcount avoidance optimization into separate patch
- Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
- Fix refcount inc on zero
- Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
This is needed since commit 5cf4a8532c ("tcp: really ignore
MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I do not see how one can effectively use skb_insert() without holding
some kind of lock. Otherwise other cpus could have changed the list
right before we have a chance of acquiring list->lock.
Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
one probably meant to use __skb_insert() since it appears nesqp->pau_list
is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
could be removed, since nesqp->pau_list.lock could be used instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Faisal Latif <faisal.latif@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma <linux-rdma@vger.kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tpacket_snd sends packets with user pages linked into skb frags. It
notifies that pages can be reused when the skb is released by setting
skb->destructor to tpacket_destruct_skb.
This can cause data corruption if the skb is orphaned (e.g., on
transmit through veth) or cloned (e.g., on mirror to another psock).
Create a kernel-private copy of data in these cases, same as tun/tap
zerocopy transmission. Reuse that infrastructure: mark the skb as
SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
Unlike other zerocopy packets, do not set shinfo destructor_arg to
struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
when the original skb is released and a timestamp is recorded. Do not
change this timestamp behavior. The ubuf_info->callback is not needed
anyway, as no zerocopy notification is expected.
Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
resulting value is not a valid ubuf_info pointer, nor a valid
tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
The fix relies on features introduced in commit 52267790ef ("sock:
add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
Tested with from `./in_netns.sh ./txring_overwrite` from
http://github.com/wdebruij/kerneltools/tests
Fixes: 69e3c75f4d ("net: TX_RING and packet mmap")
Reported-by: Anand H. Krishnan <anandhkrishnan@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace VLAN_TAG_PRESENT with single bit flag and free up
VLAN.CFI overload. Now VLAN.CFI is visible in networking stack
and can be passed around intact.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap VLAN_PRESENT bit using macro like PKT_TYPE_* and CLONED_*,
as used by BPF code.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
WARN_ON() already contains an unlikely(), so it's not necessary to use
unlikely.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper is unused since commit 988cf74deb ("inet:
Stop generating UFO packets.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-09-25
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow for RX stack hardening by implementing the kernel's flow
dissector in BPF. Idea was originally presented at netconf 2017 [0].
Quote from merge commit:
[...] Because of the rigorous checks of the BPF verifier, this
provides significant security guarantees. In particular, the BPF
flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate.
It cannot read outside of packet bounds, because all memory accesses
are checked. Also, with BPF the administrator can decide which
protocols to support, reducing potential attack surface. Rarely
encountered protocols can be excluded from dissection and the
program can be updated without kernel recompile or reboot if a
bug is discovered. [...]
Also, a sample flow dissector has been implemented in BPF as part
of this work, from Petar and Willem.
[0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
2) Add support for bpftool to list currently active attachment
points of BPF networking programs providing a quick overview
similar to bpftool's perf subcommand, from Yonghong.
3) Fix a verifier pruning instability bug where a union member
from the register state was not cleared properly leading to
branches not being pruned despite them being valid candidates,
from Alexei.
4) Various smaller fast-path optimizations in XDP's map redirect
code, from Jesper.
5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
in bpftool, from Roman.
6) Remove a duplicate check in libbpf that probes for function
storage, from Taeung.
7) Fix an issue in test_progs by avoid checking for errno since
on success its value should not be checked, from Mauricio.
8) Fix unused variable warning in bpf_getsockopt() helper when
CONFIG_INET is not configured, from Anders.
9) Fix a compilation failure in the BPF sample code's use of
bpf_flow_keys, from Prashant.
10) Minor cleanups in BPF code, from Yue and Zhong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.
Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate
Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If boolean CONFIG_BPF_SYSCALL is enabled, kernel/bpf/syscall.c will
call flow_dissector functions from net/core/flow_dissector.c.
This causes this build failure if CONFIG_NET is disabled:
kernel/bpf/syscall.o: In function `__x64_sys_bpf':
syscall.c:(.text+0x3278): undefined reference to
`skb_flow_dissector_bpf_prog_attach'
syscall.c:(.text+0x3310): undefined reference to
`skb_flow_dissector_bpf_prog_detach'
kernel/bpf/syscall.o:(.rodata+0x3f0): undefined reference to
`flow_dissector_prog_ops'
kernel/bpf/verifier.o:(.rodata+0x250): undefined reference to
`flow_dissector_verifier_ops'
Analogous to other optional BPF program types in syscall.c, add stubs
if the relevant functions are not compiled and move the BPF_PROG_TYPE
definition in the #ifdef CONFIG_NET block.
Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It documents what is happening, and eliminates the spurious list
pointer poisoning.
In the long term, in order to get proper list head debugging, we
might want to use the list poison value as the indicator that
an SKB is a singleton and not on a list.
Signed-off-by: David S. Miller <davem@davemloft.net>
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-13
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add driver XDP support for veth. This can be used in conjunction with
redirect of another XDP program e.g. sitting on NIC so the xdp_frame
can be forwarded to the peer veth directly without modification,
from Toshiaki.
2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
in order to provide more control and visibility on where a SO_REUSEPORT
sk should be located, and the latter enables to directly select a sk
from the bpf map. This also enables map-in-map for application migration
use cases, from Martin.
3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
of cgroup v2 that is the ancestor of the cgroup associated with the
skb at the ancestor_level, from Andrey.
4) Implement BPF fs map pretty-print support based on BTF data for regular
hash table and LRU map, from Yonghong.
5) Decouple the ability to attach BTF for a map from the key and value
pretty-printer in BPF fs, and enable further support of BTF for maps for
percpu and LPM trie, from Daniel.
6) Implement a better BPF sample of using XDP's CPU redirect feature for
load balancing SKB processing to remote CPU. The sample implements the
same XDP load balancing as Suricata does which is symmetric hash based
on IP and L4 protocol, from Jesper.
7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
critical path as it is ensured that the allocator is present, from Björn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is needed for veth XDP which does skb_copy_expand()-like operation.
v2:
- Drop skb_copy_header part because it has already been exported now.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes the following sparse warning:
./include/linux/skbuff.h:2365:58: warning: Using plain integer as NULL pointer
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.
Tested:
- a follow-up patch contains a rather comprehensive ip defrag
self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
(The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested: see the next patch is the series.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 784abe24c9 ("net: Add decrypted field to skb")
introduced a 'decrypted' field that is explicitly copied on skb
copy and clone.
Move it between headers_start[0] and headers_end[0], so that we
don't need to copy it explicitly as it's copied by the memcpy()
in __copy_skb_header().
While at it, drop the assignment in __skb_clone(), it was
already redundant.
This doesn't change the size of sk_buff or cacheline boundaries.
The 15-bits hole before tc_index becomes a 14-bits hole, and
will be again a 15-bits hole when this change is merged with
commit 8b7008620b ("net: Don't copy pfmemalloc flag in
__copy_skb_header()").
v2: as reported by kbuild test robot (oops, I forgot to build
with CONFIG_TLS_DEVICE it seems), we can't use
CHECK_SKB_FIELD() on a bit-field member. Just drop the
check for the moment being, perhaps we could think of some
magic to also check bit-field members one day.
Fixes: 784abe24c9 ("net: Add decrypted field to skb")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pfmemalloc flag indicates that the skb was allocated from
the PFMEMALLOC reserves, and the flag is currently copied on skb
copy and clone.
However, an skb copied from an skb flagged with pfmemalloc
wasn't necessarily allocated from PFMEMALLOC reserves, and on
the other hand an skb allocated that way might be copied from an
skb that wasn't.
So we should not copy the flag on skb copy, and rather decide
whether to allow an skb to be associated with sockets unrelated
to page reclaim depending only on how it was allocated.
Move the pfmemalloc flag before headers_start[0] using an
existing 1-bit hole, so that __copy_skb_header() doesn't copy
it.
When cloning, we'll now take care of this flag explicitly,
contravening to the warning comment of __skb_clone().
While at it, restore the newline usage introduced by commit
b193722731 ("net: reorganize sk_buff for faster
__copy_skb_header()") to visually separate bytes used in
bitfields after headers_start[0], that was gone after commit
a9e419dc7b ("netfilter: merge ctinfo into nfct pointer storage
area"), and describe the pfmemalloc flag in the kernel-doc
structure comment.
This doesn't change the size of sk_buff or cacheline boundaries,
but consolidates the 15 bits hole before tc_index into a 2 bytes
hole before csum, that could now be filled more easily.
Reported-by: Patrick Talbert <ptalbert@redhat.com>
Fixes: c93bdd0e03 ("netvm: allow skb allocation to use PFMEMALLOC reserves")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple overlapping changes in stmmac driver.
Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.
Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.
But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.
[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manage pending per-NAPI GRO packets via list_head.
Return an SKB pointer from the GRO receive handlers. When GRO receive
handlers return non-NULL, it means that this SKB needs to be completed
at this time and removed from the NAPI queue.
Several operations are greatly simplified by this transformation,
especially timing out the oldest SKB in the list when gro_count
exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
in reverse order.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
<linux/skbuff.h> does not use nor need <linux/slab.h>, so drop this
header file from skbuff.h.
<linux/skbuff.h> is currently #included in around 1200 C source and
header files, making it the 31st most-used header file.
Build tested [allmodconfig] on 20 arch-es.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the core networking needs to detect the transport offset in a given
packet and parse it explicitly, a full-blown flow_keys struct is used for
storage.
This patch introduces a smaller keys store, rework the basic flow dissect
helper to use it, and apply this new helper where possible - namely in
skb_probe_transport_header(). The used flow dissector data structures
are renamed to match more closely the new role.
The above gives ~50% performance improvement in micro benchmarking around
skb_probe_transport_header() and ~30% around eth_get_headlen(), mostly due
to the smaller memset. Small, but measurable improvement is measured also
in macro benchmarking.
v1 -> v2: use the new helper in eth_get_headlen() and skb_get_poff(),
as per DaveM suggestion
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.
UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.
The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).
IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.
[ Export __udp_gso_segment for ipv6. -DaveM ]
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>