At present the ipv6_renew_options_kern() function ends up calling into
access_ok() which is problematic if done from inside an interrupt as
access_ok() calls WARN_ON_IN_IRQ() on some (all?) architectures
(x86-64 is affected). Example warning/backtrace is shown below:
WARNING: CPU: 1 PID: 3144 at lib/usercopy.c:11 _copy_from_user+0x85/0x90
...
Call Trace:
<IRQ>
ipv6_renew_option+0xb2/0xf0
ipv6_renew_options+0x26a/0x340
ipv6_renew_options_kern+0x2c/0x40
calipso_req_setattr+0x72/0xe0
netlbl_req_setattr+0x126/0x1b0
selinux_netlbl_inet_conn_request+0x80/0x100
selinux_inet_conn_request+0x6d/0xb0
security_inet_conn_request+0x32/0x50
tcp_conn_request+0x35f/0xe00
? __lock_acquire+0x250/0x16c0
? selinux_socket_sock_rcv_skb+0x1ae/0x210
? tcp_rcv_state_process+0x289/0x106b
tcp_rcv_state_process+0x289/0x106b
? tcp_v6_do_rcv+0x1a7/0x3c0
tcp_v6_do_rcv+0x1a7/0x3c0
tcp_v6_rcv+0xc82/0xcf0
ip6_input_finish+0x10d/0x690
ip6_input+0x45/0x1e0
? ip6_rcv_finish+0x1d0/0x1d0
ipv6_rcv+0x32b/0x880
? ip6_make_skb+0x1e0/0x1e0
__netif_receive_skb_core+0x6f2/0xdf0
? process_backlog+0x85/0x250
? process_backlog+0x85/0x250
? process_backlog+0xec/0x250
process_backlog+0xec/0x250
net_rx_action+0x153/0x480
__do_softirq+0xd9/0x4f7
do_softirq_own_stack+0x2a/0x40
</IRQ>
...
While not present in the backtrace, ipv6_renew_option() ends up calling
access_ok() via the following chain:
access_ok()
_copy_from_user()
copy_from_user()
ipv6_renew_option()
The fix presented in this patch is to perform the userspace copy
earlier in the call chain such that it is only called when the option
data is actually coming from userspace; that place is
do_ipv6_setsockopt(). Not only does this solve the problem seen in
the backtrace above, it also allows us to simplify the code quite a
bit by removing ipv6_renew_options_kern() completely. We also take
this opportunity to cleanup ipv6_renew_options()/ipv6_renew_option()
a small amount as well.
This patch is heavily based on a rough patch by Al Viro. I've taken
his original patch, converted a kmemdup() call in do_ipv6_setsockopt()
to a memdup_user() call, made better use of the e_inval jump target in
the same function, and cleaned up the use ipv6_renew_option() by
ipv6_renew_options().
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the code paths calculating flow hash for IPv6 use flowlabel member
of struct flowi6 which, despite its name, encodes both flow label and
traffic class. If traffic class changes within a TCP connection (as e.g.
ssh does), ECMP route can switch between path. It's also inconsistent with
other code paths where ip6_flowlabel() (returning only flow label) is used
to feed the key.
Use only flow label everywhere, including one place where hash key is set
using ip6_flowinfo().
Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Fixes: f70ea018da ("net: Add functions to get skb->hash based on flow structures")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the code paths calculating flow hash for IPv6 use flowlabel member
of struct flowi6 which, despite its name, encodes both flow label and
traffic class. If traffic class changes within a TCP connection (as e.g.
ssh does), ECMP route can switch between path. It's also incosistent with
other code paths where ip6_flowlabel() (returning only flow label) is used
to feed the key.
Use only flow label everywhere, including one place where hash key is set
using ip6_flowinfo().
Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Fixes: f70ea018da ("net: Add functions to get skb->hash based on flow structures")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree, more relevant updates in this batch are:
1) Add Maglev support to IPVS. Moreover, store lastest server weight in
IPVS since this is needed by maglev, patches from from Inju Song.
2) Preparation works to add iptables flowtable support, patches
from Felix Fietkau.
3) Hand over flows back to conntrack slow path in case of TCP RST/FIN
packet is seen via new teardown state, also from Felix.
4) Add support for extended netlink error reporting for nf_tables.
5) Support for larger timeouts that 23 days in nf_tables, patch from
Florian Westphal.
6) Always set an upper limit to dynamic sets, also from Florian.
7) Allow number generator to make map lookups, from Laura Garcia.
8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat.
9) Extend ip6tables SRH match to support previous, next and last SID,
from Ahmed Abdelsalam.
10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez.
11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot.
12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables,
from Taehee Yoo.
13) Unify meta bridge with core nft_meta, then make nft_meta built-in.
Make rt and exthdr built-in too, again from Florian.
14) Missing initialization of tbl->entries in IPVS, from Cong Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP segmentation offload needs access to inet_cork in the udp layer.
Pass the struct to ip(6)_make_skb instead of allocating it on the
stack in that function itself.
This patch is a noop otherwise.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just like ip_dst_mtu_maybe_forward(), to avoid a dependency with ipv6.ko.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
BPF core gets access to __inet6_bind via ipv6_bpf_stub_impl, so it is
not invoked directly outside of af_inet6.c. Make it static and move
inet6_bind after to avoid forward declaration.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add 'connected' parameter to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.
The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-03-31
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add raw BPF tracepoint API in order to have a BPF program type that
can access kernel internal arguments of the tracepoints in their
raw form similar to kprobes based BPF programs. This infrastructure
also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
returns an anon-inode backed fd for the tracepoint object that allows
for automatic detach of the BPF program resp. unregistering of the
tracepoint probe on fd release, from Alexei.
2) Add new BPF cgroup hooks at bind() and connect() entry in order to
allow BPF programs to reject, inspect or modify user space passed
struct sockaddr, and as well a hook at post bind time once the port
has been allocated. They are used in FB's container management engine
for implementing policy, replacing fragile LD_PRELOAD wrapper
intercepting bind() and connect() calls that only works in limited
scenarios like glibc based apps but not for other runtimes in
containerized applications, from Andrey.
3) BPF_F_INGRESS flag support has been added to sockmap programs for
their redirect helper call bringing it in line with cls_bpf based
programs. Support is added for both variants of sockmap programs,
meaning for tx ULP hooks as well as recv skb hooks, from John.
4) Various improvements on BPF side for the nfp driver, besides others
this work adds BPF map update and delete helper call support from
the datapath, JITing of 32 and 64 bit XADD instructions as well as
offload support of bpf_get_prandom_u32() call. Initial implementation
of nfp packet cache has been tackled that optimizes memory access
(see merge commit for further details), from Jakub and Jiong.
5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
API has been done in order to prepare to use print_bpf_insn() soon
out of perf tool directly. This makes the print_bpf_insn() API more
generic and pushes the env into private data. bpftool is adjusted
as well with the print_bpf_insn() argument removal, from Jiri.
6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
Format). The latter will reuse the current BPF verifier log as
well, thus bpf_verifier_log() is further generalized, from Martin.
7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
and write support has been added in similar fashion to existing
IPv6 IPV6_TCLASS socket option we already have, from Nikita.
8) Fixes in recent sockmap scatterlist API usage, which did not use
sg_init_table() for initialization thus triggering a BUG_ON() in
scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
uses a small helper sg_init_marker() to properly handle the affected
cases, from Prashant.
9) Let the BPF core follow IDR code convention and therefore use the
idr_preload() and idr_preload_end() helpers, which would also help
idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
pressure, from Shaohua.
10) Last but not least, a spelling fix in an error message for the
BPF cookie UID helper under BPF sample code, from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405 ("inet: frag: don't account number
of fragment queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to simplify the API, add a pointer to struct inet_frags.
This will allow us to make things less complex.
These functions no longer have a struct inet_frags parameter :
inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
csum field in struct frag_queue is not used, remove it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.
New functions have two additional arguments.
`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.
`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In order to abstract away access to the
ipv6.sysctl.multipath_hash_policy variable, which is not available on
systems compiled without IPv6 support, introduce a wrapper function
ip6_multipath_hash_policy() that falls back to 0 on non-IPv6 systems.
Use this wrapper from mlxsw/spectrum_router instead of a direct
reference.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ran simple script to find/remove trailing whitespace and blank lines
at EOF because that kind of stuff git whines about and editors leave
behind.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes since v1:
Added changes in these files:
drivers/infiniband/hw/usnic/usnic_transport.c
drivers/staging/lustre/lnet/lnet/lib-socket.c
drivers/target/iscsi/iscsi_target_login.c
drivers/vhost/net.c
fs/dlm/lowcomms.c
fs/ocfs2/cluster/tcp.c
security/tomoyo/network.c
Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.
"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.
None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.
This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.
Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.
Userspace API is not changed.
text data bss dec hex filename
30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
30108109 2633612 873672 33615393 200ee21 vmlinux.o
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 513674b5a2 ("net: reevalulate autoflowlabel setting after
sysctl setting") removed the initialisation of
ipv6_pinfo::autoflowlabel and added a second flag to indicate
whether this field or the net namespace default should be used.
The getsockopt() handling for this case was not updated, so it
currently returns 0 for all sockets for which IPV6_AUTOFLOWLABEL is
not explicitly enabled. Fix it to return the effective value, whether
that has been set at the socket or net namespace level.
Fixes: 513674b5a2 ("net: reevalulate autoflowlabel setting after sysctl ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.
This patch exports ip6_dst_mtu_forward() that is required to check for
mtu to pass up packets that need PMTUD handling to the classic
forwarding path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch moves the udp[46]_portaddr_hash()
to net/ip[v6].h. The function name is renamed to
ipv[46]_portaddr_hash().
It will be used by a later patch which adds a second listener
hashtable hashed by the address and port.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.
Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.
It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216f ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643
("net: avoid skb_warn_bad_offload false positives on UFO").
(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.
Tested
Booted a v4.13 guest kernel with QEMU. On a host kernel before this
patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
enabled, same as on a v4.13 host kernel.
A UFO packet sent from the guest appears on the tap device:
host:
nc -l -p -u 8000 &
tcpdump -n -i tap0
guest:
dd if=/dev/zero of=payload.txt bs=1 count=2000
nc -u 192.16.1.1 8000 < payload.txt
Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
packets arriving fragmented:
./with_tap_pair.sh ./tap_send_ufo tap0 tap1
(from https://github.com/wdebruij/kerneltools/tree/master/tests)
Changes
v1 -> v2
- simplified set_offload change (review comment)
- documented test procedure
Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe8 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_frag_id was only used by UFO, which has been removed.
ipv6_proxy_select_ident() only existed to set ip6_frag_id and has no
in-tree callers.
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 8200 (IPv6) defines Hop-by-Hop options and Destination options
extension headers. Both of these carry a list of TLVs which is
only limited by the maximum length of the extension header (2048
bytes). By the spec a host must process all the TLVs in these
options, however these could be used as a fairly obvious
denial of service attack. I think this could in fact be
a significant DOS vector on the Internet, one mitigating
factor might be that many FWs drop all packets with EH (and
obviously this is only IPv6) so an Internet wide attack might not
be so effective (yet!).
By my calculation, the worse case packet with TLVs in a standard
1500 byte MTU packet that would be processed by the stack contains
1282 invidual TLVs (including pad TLVS) or 724 two byte TLVs. I
wrote a quick test program that floods a whole bunch of these
packets to a host and sure enough there is substantial time spent
in ip6_parse_tlv. These packets contain nothing but unknown TLVS
(that are ignored), TLV padding, and bogus UDP header with zero
payload length.
25.38% [kernel] [k] __fib6_clean_all
21.63% [kernel] [k] ip6_parse_tlv
4.21% [kernel] [k] __local_bh_enable_ip
2.18% [kernel] [k] ip6_pol_route.isra.39
1.98% [kernel] [k] fib6_walk_continue
1.88% [kernel] [k] _raw_write_lock_bh
1.65% [kernel] [k] dst_release
This patch adds configurable limits to Destination and Hop-by-Hop
options. There are three limits that may be set:
- Limit the number of options in a Hop-by-Hop or Destination options
extension header.
- Limit the byte length of a Hop-by-Hop or Destination options
extension header.
- Disallow unrecognized options in a Hop-by-Hop or Destination
options extension header.
The limits are set in corresponding sysctls:
ipv6.sysctl.max_dst_opts_cnt
ipv6.sysctl.max_hbh_opts_cnt
ipv6.sysctl.max_dst_opts_len
ipv6.sysctl.max_hbh_opts_len
If a max_*_opts_cnt is less than zero then unknown TLVs are disallowed.
The number of known TLVs that are allowed is the absolute value of
this number.
If a limit is exceeded when processing an extension header the packet is
dropped.
Default values are set to 8 for options counts, and set to INT_MAX
for maximum length. Note the choice to limit options to 8 is an
arbitrary guess (roughly based on the fact that the stack supports
three HBH options and just one destination option).
These limits have being proposed in draft-ietf-6man-rfc6434-bis.
Tested (by Martin Lau)
I tested out 1 thread (i.e. one raw_udp process).
I changed the net.ipv6.max_dst_(opts|hbh)_number between 8 to 2048.
With sysctls setting to 2048, the softirq% is packed to 100%.
With 8, the softirq% is almost unnoticable from mpstat.
v2;
- Code and documention cleanup.
- Change references of RFC2460 to be RFC8200.
- Add reference to RFC6434-bis where the limits will be in standard.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit cc71b7b071 ("net/ipv6: remove unused err variable on
icmpv6_push_pending_frames") exposed icmpv6_push_pending_frames
return value not being used.
Remove now unnecessary int err declarations and uses.
Miscellanea:
o Remove unnecessary goto and out: labels
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander reported various KASAN messages triggered in recent kernels
The problem is that ping sockets should not use udp_poll() in the first
place, and recent changes in UDP stack finally exposed this old bug.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Tested-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_make_flowlabel() determines the flow label for IPv6 packets. It's
supposed to be passed a flow label, which it returns as is if non-0 and
in some other cases, otherwise it calculates a new value.
The problem is callers often pass a flowi6.flowlabel, which may also
contain traffic class bits. If the traffic class is non-0
ip6_make_flowlabel() mistakes the non-0 it gets as a flow label and
returns the whole thing. Thus it can return a 'flow label' longer than
20b and the low 20b of that is typically 0 resulting in packets with 0
label. Moreover, different packets of a flow may be labeled differently.
For a TCP flow with ECN non-payload and payload packets get different
labels as exemplified by this pair of consecutive packets:
(pure ACK)
Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
0110 .... = Version: 6
.... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT)
.... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
.... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0)
.... .... .... 0001 1100 1110 0100 1001 = Flow Label: 0x1ce49
Payload Length: 32
Next Header: TCP (6)
(payload)
Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
0110 .... = Version: 6
.... 0000 0010 .... .... .... .... .... = Traffic Class: 0x02 (DSCP: CS0, ECN: ECT(0))
.... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
.... .... ..10 .... .... .... .... .... = Explicit Congestion Notification: ECN-Capable Transport codepoint '10' (2)
.... .... .... 0000 0000 0000 0000 0000 = Flow Label: 0x00000
Payload Length: 688
Next Header: TCP (6)
This patch allows ip6_make_flowlabel() to be passed more than just a
flow label and has it extract the part it really wants. This was simpler
than modifying the callers. With this patch packets like the above become
Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
0110 .... = Version: 6
.... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT)
.... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
.... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0)
.... .... .... 1010 1111 1010 0101 1110 = Flow Label: 0xafa5e
Payload Length: 32
Next Header: TCP (6)
Internet Protocol Version 6, Src: 2002:af5:11a3::, Dst: 2002:af5:11a2::
0110 .... = Version: 6
.... 0000 0010 .... .... .... .... .... = Traffic Class: 0x02 (DSCP: CS0, ECN: ECT(0))
.... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0)
.... .... ..10 .... .... .... .... .... = Explicit Congestion Notification: ECN-Capable Transport codepoint '10' (2)
.... .... .... 1010 1111 1010 0101 1110 = Flow Label: 0xafa5e
Payload Length: 688
Next Header: TCP (6)
Signed-off-by: Dimitris Michailidis <dmichail@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike ipv4, this control socket is shared by all cpus so we cannot use
it as scratchpad area to annotate the mark that we pass to ip6_xmit().
Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
family caches the flowi6 structure in the sctp_transport structure, so
we cannot use to carry the mark unless we later on reset it back, which
I discarded since it looks ugly to me.
Fixes: bf99b4ded5 ("tcp: fix mark propagation with fwmark_reflect enabled")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Couple conflicts resolved here:
1) In the MACB driver, a bug fix to properly initialize the
RX tail pointer properly overlapped with some changes
to support variable sized rings.
2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
overlapping with a reorganization of the driver to support
ACPI, OF, as well as PCI variants of the chip.
3) In 'net' we had several probe error path bug fixes to the
stmmac driver, meanwhile a lot of this code was cleaned up
and reorganized in 'net-next'.
4) The cls_flower classifier obtained a helper function in
'net-next' called __fl_delete() and this overlapped with
Daniel Borkamann's bug fix to use RCU for object destruction
in 'net'. It also overlapped with Jiri's change to guard
the rhashtable_remove_fast() call with a check against
tc_skip_sw().
5) In mlx4, a revert bug fix in 'net' overlapped with some
unrelated changes in 'net-next'.
6) In geneve, a stale header pointer after pskb_expand_head()
bug fix in 'net' overlapped with a large reorganization of
the same code in 'net-next'. Since the 'net-next' code no
longer had the bug in question, there was nothing to do
other than to simply take the 'net-next' hunks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.
This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch prepares for insertion of SRH through setsockopt().
The new source address argument is used when an HMAC field is
present in the SRH, which must be filled. The HMAC signature
process requires the source address as input text.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
CALIPSO is a hop-by-hop IPv6 option. A lot of this patch is based on
the equivalent CISPO code. The main difference is due to manipulating
the options in the hop-by-hop header.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The functionality is equivalent to ipv6_renew_options() except
that the newopt pointer is in kernel, not user, memory
The kernel memory implementation will be used by the CALIPSO network
labelling engine, which needs to be able to set IPv6 hop-by-hop
options.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.
Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.
This more closely matches convention used to update
percpu variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a release_cb for UDPv6. It does a route lookup
and updates sk->sk_dst_cache if it is needed. It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.
It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.
Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph). It is because a UDP socket can become
connected after sending out some datagrams in un-connected state. or
It can be connected multiple times to different destinations. Hence,
iph may not be related to where sk is currently connected to.
It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.
For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.
Test:
Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0 proto kernel metric 256 pref medium
2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium
A simple python code to create a connected UDP socket:
import socket
import errno
HOST = '2fac::1'
PORT = 8080
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()
Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300
Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0 metric 0
cache expires 463sec mtu 1300 pref medium
Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message. Here is the scapy script I have
used:
>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.
Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri mentioned that flowi6_tos of struct flowi6 is never used/read
anywhere. In fact, rest of the kernel uses the flowi6's flowlabel,
where the traffic class _and_ the flowlabel (aka flowinfo) is encoded.
For example, for policy routing, fib6_rule_match() uses ip6_tclass()
that is applied on the flowlabel member for matching on tclass. Similar
fix is needed for geneve, where flowi6_tos is set as well. Installing
a v6 blackhole rule that f.e. matches on tos is now working with vxlan.
Fixes: 1400615d64 ("vxlan: allow setting ipv6 traffic class")
Reported-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
follows up commit 45f6fad84c ("ipv6: add complete rcu protection around
np->opt") which added mixed rcu/refcount protection to np->opt.
Given the current implementation of rcu_pointer_handoff(), this has no
effect at runtime.
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a static inline function ipv6_addr_prefix_copy which
copies a ipv6 address prefix(argument pfx) into the ipv6 address prefix.
The prefix len is given by plen as bits. This function mainly based on
ipv6_addr_prefix which copies one address prefix from address into a new
ipv6 address destination and zero all other address bits.
The difference is that ipv6_addr_prefix_copy don't get a prefix from an
ipv6 address, it sets a prefix to an ipv6 address with keeping other
address bits. The use case is for context based address compression
inside 6LoWPAN IPHC header which keeping ipv6 prefixes inside a context
table to lookup address-bits without sending them.
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Łukasz Duda <lukasz.duda@nordicsemi.no>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch addresses multiple problems :
UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
while socket is not locked : Other threads can change np->opt
concurrently. Dmitry posted a syzkaller
(http://github.com/google/syzkaller) program desmonstrating
use-after-free.
Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
and dccp_v6_request_recv_sock() also need to use RCU protection
to dereference np->opt once (before calling ipv6_dup_options())
This patch adds full RCU protection to np->opt
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.
To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.
Note: similar patch has been already submitted by Yoshifuji Hideaki in
http://patchwork.ozlabs.org/patch/220979/
but got lost and forgotten for some reason.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network namespace is already passed into dst_output pass it into
dst->output lwt->output and friends.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop hidding the sk parameter with an inline helper function and make
all of the callers pass it, so that it is clear what the function is
doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only __ip6_local_out_sk has callers so rename __ip6_local_out_sk
__ip6_local_out and remove the previous __ip6_local_out.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with the other similar methods in the kernel pass a
struct sock into the dst_ops .local_out method.
Simplifying the socket passing case is needed a prequel to passing a
struct net reference into .local_out.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to document that socket lock might not be held at this point.
skb_set_owner_w() and ipv6_local_error() are using proper atomic ops
or spinlocks, so we promote the socket to non const when calling them.
netfilter hooks should never assume socket lock is held,
we also promote the socket to non const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_dst_lookup_flow() and ip6_dst_lookup_tail() do not touch
socket, lets add a const qualifier.
This will permit the same change in inet6_csk_route_req()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.
As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.
To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize auto_flowlabels to one. This enables automatic flow labels,
individual socket may disable them using the IPV6_AUTOFLOWLABEL socket
option.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
automatic flow labels generation. There are four modes:
0: flow labels are disabled
1: flow labels are enabled, sockets can opt-out
2: flow labels are allowed, sockets can opt-in
3: flow labels are enabled and enforced, no opt-out for sockets
np->autoflowlabel is initialized according to the sysctl value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't call skb_get_hash here since the packet is not complete to do
flow_dissector. Create hash based on flowi6 instead.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.
All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates sk_set_txhash and eliminates protocol specific
inet_set_txhash and ip6_set_txhash. sk_set_txhash simply sets a
random number instead of performing flow dissection. sk_set_txash
is also allowed to be called multiple times for the same socket,
we'll need this when redoing the hash for negative routing advice.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds full IPv6 addresses into flow_keys and uses them as
input to the flow hash function. The implementation supports either
IPv4 or IPv6 addresses in a union, and selector is used to determine
how may words to input to jhash2.
We also add flow_get_u32_dst and flow_get_u32_src functions which are
used to get a u32 representation of the source and destination
addresses. For IPv6, ipv6_addr_hash is called. These functions retain
getting the legacy values of src and dst in flow_keys.
With this patch, Ethertype and IP protocol are now included in the
flow hash input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes flow hashing to use jhash2 over the flow_keys
structure instead just doing jhash_3words over src, dst, and ports.
This method will allow us take more input into the hashing function
so that we can include full IPv6 addresses, VLAN, flow labels etc.
without needing to resort to xor'ing which makes for a poor hash.
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_select_ident() returns a 32bit value in network order.
Fixes: 286c2349f6 ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address. The dst and src can be recovered from
the calling site.
We may consider to rename (rt6i_dst, rt6i_src) to
(rt6i_key_dst, rt6i_key_src) later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the ipv6_select_ident() signature to return a
fragment id instead of taking a whole frag_hdr as a param to
only set the frag_hdr->identification.
It also cleans up ip6_fragment() to obtain the fragment id at the
beginning instead of using multiple "if" later to check fragment id
has been generated or not.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch divides the IPv6 flow label space into two ranges:
0-7ffff is reserved for flow label manager, 80000-fffff will be
used for creating auto flow labels (per RFC6438). This only affects how
labels are set on transmit, it does not affect receive. This range split
can be disbaled by systcl.
Background:
IPv6 flow labels have been an unmitigated disappointment thus far
in the lifetime of IPv6. Support in HW devices to use them for ECMP
is lacking, and OSes don't turn them on by default. If we had these
we could get much better hashing in IPv6 networks without resorting
to DPI, possibly eliminating some of the motivations to to define new
encaps in UDP just for getting ECMP.
Unfortunately, the initial specfications of IPv6 did not clarify
how they are to be used. There has always been a vague concept that
these can be used for ECMP, flow hashing, etc. and we do now have a
good standard how to this in RFC6438. The problem is that flow labels
can be either stateful or stateless (as in RFC6438), and we are
presented with the possibility that a stateless label may collide
with a stateful one. Attempts to split the flow label space were
rejected in IETF. When we added support in Linux for RFC6438, we
could not turn on flow labels by default due to this conflict.
This patch splits the flow label space and should give us
a path to enabling auto flow labels by default for all IPv6 packets.
This is an API change so we need to consider compatibility with
existing deployment. The stateful range is chosen to be the lower
values in hopes that most uses would have chosen small numbers.
Once we resolve the stateless/stateful issue, we can proceed to
look at enabling RFC6438 flow labels by default (starting with
scaled testing).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.
Based upon a patch by Jiri Pirko.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.
And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.
We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.
The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.
Signed-off-by: David S. Miller <davem@davemloft.net>
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on the igmp v4 changes from Eric Dumazet.
959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()")
These changes are needed to perform igmp v6 join/leave while
RTNL is held.
Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around
__ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid
proliferation of work queues.
Signed-off-by: Madhu Challa <challa@noironetworks.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make __ipv6_select_ident() static as it isn't used outside
the file.
Fixes: 0508c07f5e (ipv6: Select fragment id during UFO segmentation if not set.)
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
include/net/ipv6.h:713:22: warning: incorrect type in assignment (different base types)
include/net/ipv6.h:713:22: expected restricted __be32 [usertype] hash
include/net/ipv6.h:713:22: got unsigned int
include/net/ipv6.h:719:25: warning: restricted __be32 degrades to integer
include/net/ipv6.h:719:22: warning: invalid assignment: ^=
include/net/ipv6.h:719:22: left side has type restricted __be32
include/net/ipv6.h:719:22: right side has type unsigned int
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the IPv6 fragment id has not been set and we perform
fragmentation due to UFO, select a new fragment id.
We now consider a fragment id of 0 as unset and if id selection
process returns 0 (after all the pertrubations), we set it to
0x80000000, thus giving us ample space not to create collisions
with the next packet we may have to fragment.
When doing UFO integrity checking, we also select the
fragment id if it has not be set yet. This is stored into
the skb_shinfo() thus allowing UFO to function correclty.
This patch also removes duplicate fragment id generation code
and moves ipv6_select_ident() into the header as it may be
used during GSO.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is very similar to
commit 1c32c5ad6f
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue Mar 1 02:36:47 2011 +0000
inet: Add ip_make_skb and ip_finish_skb
It adds IPv6 version of the helpers ip6_make_skb and ip6_finish_skb.
The job of ip6_make_skb is to collect messages into an ipv6 packet
and poplulate ipv6 eader. The job of ip6_finish_skb is to transmit
the generated skb. Together they replicated the job of
ip6_push_pending_frames() while also provide the capability to be
called independently. This will be needed to add lockless UDP sendmsg
support.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UFO is now disabled on all drivers that work with virtio net headers,
but userland may try to send UFO/IPv6 packets anyway. Instead of
sending with ID=0, we should select identifiers on their behalf (as we
used to).
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm
that it needs. Lets not assume this, as TCP stack might use a different
place.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query
robustness variable. It specifies how many retransmit of unsolicited mld
retransmit should happen. Admins might want to tune this on lossy links.
Also reset mld state on interface down/up, so we pick up new sysctl
settings during interface up event.
IPv6 certification requests this knob to be available.
I didn't make this knob netns specific, as it is mostly a setting in a
physical environment and should be per host.
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'nqueues' counter is protected by the lru list lock,
once thats removed this needs to be converted to atomic
counter. Given this isn't used for anything except for
reporting it to userspace via /proc, just remove it.
We still report the memory currently used by fragment
reassembly queues.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
sparse is throwing warnings when building sunrpc modules due to some
endianness shenanigans in ipv6.h. Specifically:
CHECK net/sunrpc/addr.c
include/net/ipv6.h:573:17: warning: restricted __be64 degrades to integer
include/net/ipv6.h:577:34: warning: restricted __be32 degrades to integer
include/net/ipv6.h:573:17: warning: restricted __be64 degrades to integer
include/net/ipv6.h:577:34: warning: restricted __be32 degrades to integer
Sprinkle some endianness fixups to silence them. These should all get
fixed up at compile time, so I don't think this will add any extra work
to be done at runtime.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cb1ce2ef38 ("ipv6: Implement automatic flow label generation
on transmit") introduced ip6_make_flowlabel, while commit b73c3d0e4f
("net: Save TX flow hash in sock and set in skbuf on xmit") introduced
ip6_set_txhash.
ip6_set_tx_hash() uses sk_v6_daddr which references
__sk_common.skc_v6_daddr from struct sock_common, which is gated with
IS_ENABLED(CONFIG_IPV6).
ip6_make_flowlabel() uses the ipv6 member from struct net which is
also gated with IS_ENABLED(CONFIG_IPV6).
When CONFIG_IPV6 is disabled, we will hit a build failure that looks
like this when the compiler attempts inlining these functions:
CC [M] drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.o
In file included from include/net/inet_sock.h:27:0,
from include/net/ip.h:30,
from drivers/net/ethernet/broadcom/cnic.c:37:
include/net/ipv6.h: In function 'ip6_set_txhash':
include/net/sock.h:327:33: error: 'struct sock_common' has no member named 'skc_v6_daddr'
#define sk_v6_daddr __sk_common.skc_v6_daddr
^
include/net/ipv6.h:696:49: note: in expansion of macro 'sk_v6_daddr'
keys.dst = (__force __be32)ipv6_addr_hash(&sk->sk_v6_daddr);
^
In file included from include/net/inetpeer.h:15:0,
from include/net/route.h:28,
from include/net/ip.h:31,
from drivers/net/ethernet/broadcom/cnic.c:37:
include/net/ipv6.h: In function 'ip6_make_flowlabel':
include/net/ipv6.h:706:37: error: 'struct net' has no member named 'ipv6'
if (!flowlabel && (autolabel || net->ipv6.sysctl.auto_flowlabels)) {
^
Fixes: cb1ce2ef38 ("ipv6: Implement automatic flow label generation on transmit")
Fixes: b73c3d0e4f ("net: Save TX flow hash in sock and set in skbuf on xmit")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Automatically generate flow labels for IPv6 packets on transmit.
The flow label is computed based on skb_get_hash. The flow label will
only automatically be set when it is zero otherwise (i.e. flow label
manager hasn't set one). This supports the transmit side functionality
of RFC 6438.
Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
functionality per socket.
By default, auto flowlabels are disabled to avoid possible conflicts
with flow label manager, however if this feature proves useful we
may want to enable it by default.
It should also be noted that FreeBSD has already implemented automatic
flow labels (including the sysctl and socket option). In FreeBSD,
automatic flow labels default to enabled.
Performance impact:
Running super_netperf with 200 flows for TCP_RR and UDP_RR for
IPv6. Note that in UDP case, __skb_get_hash will be called for
every packet with explains slight regression. In the TCP case
the hash is saved in the socket so there is no regression.
Automatic flow labels disabled:
TCP_RR:
86.53% CPU utilization
127/195/322 90/95/99% latencies
1.40498e+06 tps
UDP_RR:
90.70% CPU utilization
118/168/243 90/95/99% latencies
1.50309e+06 tps
Automatic flow labels enabled:
TCP_RR:
85.90% CPU utilization
128/199/337 90/95/99% latencies
1.40051e+06
UDP_RR
92.61% CPU utilization
115/164/236 90/95/99% latencies
1.4687e+06
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For a connected socket we can precompute the flow hash for setting
in skb->hash on output. This is a performance advantage over
calculating the skb->hash for every packet on the connection. The
computation is done using the common hash algorithm to be consistent
with computations done for packets of the connection in other states
where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).
This patch adds sk_txhash to the sock structure. inet_set_txhash and
ip6_set_txhash functions are added which are called from points in
TCP and UDP where socket moves to established state.
skb_set_hash_from_sk is a function which sets skb->hash from the
sock txhash value. This is called in UDP and TCP transmit path when
transmitting within the context of a socket.
Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
interface (in this case skb_get_hash called on every TX packet to
create a UDP source port).
Before fix:
95.02% CPU utilization
154/256/505 90/95/99% latencies
1.13042e+06 tps
Time in functions:
0.28% skb_flow_dissect
0.21% __skb_get_hash
After fix:
94.95% CPU utilization
156/254/485 90/95/99% latencies
1.15447e+06
Neither __skb_get_hash nor skb_flow_dissect appear in perf
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ideally, we would need to generate IP ID using a per destination IP
generator.
linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.
1) each inet_peer struct consumes 192 bytes
2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.
3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.
4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())
5) We garbage collect inet_peer aggressively.
IP ID generation do not have to be 'perfect'
Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.
We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.
ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)
secure_ip_id() and secure_ipv6_id() no longer are needed.
Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Add a sysctl to make them have
the same mark as the packet they are replying to.
This allows an administrator that wishes to do so to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.
Tested using user-mode linux:
- ICMP/ICMPv6 echo replies and errors.
- TCP RST packets (IPv4 and IPv6).
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces 6 identical code snippets with a call to a new
static inline function.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>