Steffen Klassert says:
====================
pull request (net): ipsec 2018-03-13
1) Refuse to insert 32 bit userspace socket policies on 64
bit systems like we do it for standard policies. We don't
have a compat layer, so inserting socket policies from
32 bit userspace will lead to a broken configuration.
2) Make the policy hold queue work without the flowcache.
Dummy bundles are not chached anymore, so we need to
generate a new one on each lookup as long as the SAs
are not yet in place.
3) Fix the validation of the esn replay attribute. The
The sanity check in verify_replay() is bypassed if
the XFRM_STATE_ESN flag is not set. Fix this by doing
the sanity check uncoditionally.
From Florian Westphal.
4) After most of the dst_entry garbage collection code
is removed, we may leak xfrm_dst entries as they are
neither cached nor tracked somewhere. Fix this by
reusing the 'uncached_list' to track xfrm_dst entries
too. From Xin Long.
5) Fix a rcu_read_lock/rcu_read_unlock imbalance in
xfrm_get_tos() From Xin Long.
6) Fix an infinite loop in xfrm_get_dst_nexthop. On
transport mode we fetch the child dst_entry after
we continue, so this pointer is never updated.
Fix this by fetching it before we continue.
7) Fix ESN sequence number gap after IPsec GSO packets.
We accidentally increment the sequence number counter
on the xfrm_state by one packet too much in the ESN
case. Fix this by setting the sequence number to the
correct value.
8) Reset the ethernet protocol after decapsulation only if a
mac header was set. Otherwise it breaks configurations
with TUN devices. From Yossi Kuperman.
9) Fix __this_cpu_read() usage in preemptible code. Use
this_cpu_read() instead in ipcomp_alloc_tfms().
From Greg Hackmann.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
On unsuccesful ip6_datagram_connect(), if the failure is caused by
ip6_datagram_dst_update(), the sk peer information are cleared, but
the sk->sk_state is preserved.
If the socket was already in an established status, the overall sk
status is inconsistent and fouls later checks in datagram code.
Fix this saving the old peer information and restoring them in
case of failure. This also aligns ipv6 datagram connect() behavior
with ipv4.
v1 -> v2:
- added missing Fixes tag
Fixes: 85cb73ff9b ("net: ipv6: reset daddr and dport in sk if connect() fails")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adds skb_cow_header() to ensure enough headroom
at ip6erspan_tunnel_xmit before pushing the erspan header
to the skb.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When users fill in incorrect erspan version number through
the struct erspan_metadata uapi, current code skips pushing
the erspan header but continue pushing the gre header, which
is incorrect. The patch fixes it by returning error.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adds the erspan v2 proto in ip6gre_tunnel_lookup
so the erspan v2 tunnel can be found correctly.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the following slab-out-of-bounds kasan report in
ndisc_fill_redirect_hdr_option when the incoming ipv6 packet is not
linear and the accessed data are not in the linear data region of orig_skb.
[ 1503.122508] ==================================================================
[ 1503.122832] BUG: KASAN: slab-out-of-bounds in ndisc_send_redirect+0x94e/0x990
[ 1503.123036] Read of size 1184 at addr ffff8800298ab6b0 by task netperf/1932
[ 1503.123220] CPU: 0 PID: 1932 Comm: netperf Not tainted 4.16.0-rc2+ #124
[ 1503.123347] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[ 1503.123527] Call Trace:
[ 1503.123579] <IRQ>
[ 1503.123638] print_address_description+0x6e/0x280
[ 1503.123849] kasan_report+0x233/0x350
[ 1503.123946] memcpy+0x1f/0x50
[ 1503.124037] ndisc_send_redirect+0x94e/0x990
[ 1503.125150] ip6_forward+0x1242/0x13b0
[...]
[ 1503.153890] Allocated by task 1932:
[ 1503.153982] kasan_kmalloc+0x9f/0xd0
[ 1503.154074] __kmalloc_track_caller+0xb5/0x160
[ 1503.154198] __kmalloc_reserve.isra.41+0x24/0x70
[ 1503.154324] __alloc_skb+0x130/0x3e0
[ 1503.154415] sctp_packet_transmit+0x21a/0x1810
[ 1503.154533] sctp_outq_flush+0xc14/0x1db0
[ 1503.154624] sctp_do_sm+0x34e/0x2740
[ 1503.154715] sctp_primitive_SEND+0x57/0x70
[ 1503.154807] sctp_sendmsg+0xaa6/0x1b10
[ 1503.154897] sock_sendmsg+0x68/0x80
[ 1503.154987] ___sys_sendmsg+0x431/0x4b0
[ 1503.155078] __sys_sendmsg+0xa4/0x130
[ 1503.155168] do_syscall_64+0x171/0x3f0
[ 1503.155259] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 1503.155436] Freed by task 1932:
[ 1503.155527] __kasan_slab_free+0x134/0x180
[ 1503.155618] kfree+0xbc/0x180
[ 1503.155709] skb_release_data+0x27f/0x2c0
[ 1503.155800] consume_skb+0x94/0xe0
[ 1503.155889] sctp_chunk_put+0x1aa/0x1f0
[ 1503.155979] sctp_inq_pop+0x2f8/0x6e0
[ 1503.156070] sctp_assoc_bh_rcv+0x6a/0x230
[ 1503.156164] sctp_inq_push+0x117/0x150
[ 1503.156255] sctp_backlog_rcv+0xdf/0x4a0
[ 1503.156346] __release_sock+0x142/0x250
[ 1503.156436] release_sock+0x80/0x180
[ 1503.156526] sctp_sendmsg+0xbb0/0x1b10
[ 1503.156617] sock_sendmsg+0x68/0x80
[ 1503.156708] ___sys_sendmsg+0x431/0x4b0
[ 1503.156799] __sys_sendmsg+0xa4/0x130
[ 1503.156889] do_syscall_64+0x171/0x3f0
[ 1503.156980] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 1503.157158] The buggy address belongs to the object at ffff8800298ab600
which belongs to the cache kmalloc-1024 of size 1024
[ 1503.157444] The buggy address is located 176 bytes inside of
1024-byte region [ffff8800298ab600, ffff8800298aba00)
[ 1503.157702] The buggy address belongs to the page:
[ 1503.157820] page:ffffea0000a62a00 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
[ 1503.158053] flags: 0x4000000000008100(slab|head)
[ 1503.158171] raw: 4000000000008100 0000000000000000 0000000000000000 00000001800e000e
[ 1503.158350] raw: dead000000000100 dead000000000200 ffff880036002600 0000000000000000
[ 1503.158523] page dumped because: kasan: bad access detected
[ 1503.158698] Memory state around the buggy address:
[ 1503.158816] ffff8800298ab900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1503.158988] ffff8800298ab980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1503.159165] >ffff8800298aba00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1503.159338] ^
[ 1503.159436] ffff8800298aba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1503.159610] ffff8800298abb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1503.159785] ==================================================================
[ 1503.159964] Disabling lock debugging due to kernel taint
The test scenario to trigger the issue consists of 4 devices:
- H0: data sender, connected to LAN0
- H1: data receiver, connected to LAN1
- GW0 and GW1: routers between LAN0 and LAN1. Both of them have an
ethernet connection on LAN0 and LAN1
On H{0,1} set GW0 as default gateway while on GW0 set GW1 as next hop for
data from LAN0 to LAN1.
Moreover create an ip6ip6 tunnel between H0 and H1 and send 3 concurrent
data streams (TCP/UDP/SCTP) from H0 to H1 through ip6ip6 tunnel (send
buffer size is set to 16K). While data streams are active flush the route
cache on HA multiple times.
I have not been able to identify a given commit that introduced the issue
since, using the reproducer described above, the kasan report has been
triggered from 4.14 and I have not gone back further.
Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, administrative MTU changes on a given netdevice are
not reflected on route exceptions for MTU-less routes, with a
set PMTU value, for that device:
# ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a proto kernel src 2001:db8::a metric 256 pref medium
# ping6 -c 1 -q -s10000 2001:db8::b > /dev/null
# ip netns exec a ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
cache expires 571sec mtu 4926 pref medium
# ip link set dev vti_a mtu 3000
# ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
cache expires 571sec mtu 4926 pref medium
# ip link set dev vti_a mtu 9000
# ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
cache expires 571sec mtu 4926 pref medium
The first issue is that since commit fb56be83e4 ("net-ipv6: on
device mtu change do not add mtu to mtu-less routes") we don't
call rt6_exceptions_update_pmtu() from rt6_mtu_change_route(),
which handles administrative MTU changes, if the regular route
is MTU-less.
However, PMTU exceptions should be always updated, as long as
RTAX_MTU is not locked. Keep the check for MTU-less main route,
as introduced by that commit, but, for exceptions,
call rt6_exceptions_update_pmtu() regardless of that check.
Once that is fixed, one problem remains: MTU changes are not
reflected if the new MTU is higher than the previous one,
because rt6_exceptions_update_pmtu() doesn't allow that. We
should instead allow PMTU increase if the old PMTU matches the
local MTU, as that implies that the old MTU was the lowest in the
path, and PMTU discovery might lead to different results.
The existing check in rt6_mtu_change_route() correctly took that
case into account (for regular routes only), so factor it out
and re-use it also in rt6_exceptions_update_pmtu().
While at it, fix comments style and grammar, and try to be a bit
more descriptive.
Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: fb56be83e4 ("net-ipv6: on device mtu change do not add mtu to mtu-less routes")
Fixes: f5bbe7ee79 ("ipv6: prepare rt6_mtu_change() for exception table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Artem Savkov reported that commit 5efec5c655 leads to a packet loss under
IPSec configuration. It appears that his setup consists of a TUN device,
which does not have a MAC header.
Make sure MAC header exists.
Note: TUN device sets a MAC header pointer, although it does not have one.
Fixes: 5efec5c655 ("xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version")
Reported-by: Artem Savkov <artem.savkov@gmail.com>
Tested-by: Artem Savkov <artem.savkov@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Replace skb_gso_network_seglen() with
skb_gso_validate_network_len(), as it considers the GSO_BY_FRAGS
case.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If you take a GSO skb, and split it into packets, will the network
length (L3 headers + L4 headers + payload) of those packets be small
enough to fit within a given MTU?
skb_gso_validate_mtu gives you the answer to that question. However,
we recently added to add a way to validate the MAC length of a split GSO
skb (L2+L3+L4+payload), and the names get confusing, so rename
skb_gso_validate_mtu to skb_gso_validate_network_len
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Put back reference on CLUSTERIP configuration structure from the
error path, patch from Florian Westphal.
2) Put reference on CLUSTERIP configuration instead of freeing it,
another cpu may still be walking over it, also from Florian.
3) Refetch pointer to IPv6 header from nf_nat_ipv6_manip_pkt() given
packet manipulation may reallocation the skbuff header, from Florian.
4) Missing match size sanity checks in ebt_among, from Florian.
5) Convert BUG_ON to WARN_ON in ebtables, from Florian.
6) Sanity check userspace offsets from ebtables kernel, from Florian.
7) Missing checksum replace call in flowtable IPv4 DNAT, from Felix
Fietkau.
8) Bump the right stats on checksum error from bridge netfilter,
from Taehee Yoo.
9) Unset interface flag in IPv6 fib lookups otherwise we get
misleading routing lookup results, from Florian.
10) Missing sk_to_full_sk() in ip6_route_me_harder() from Eric Dumazet.
11) Don't allow devices to be part of multiple flowtables at the same
time, this may break setups.
12) Missing netlink attribute validation in flowtable deletion.
13) Wrong array index in nf_unregister_net_hook() call from error path
in flowtable addition path.
14) Fix FTP IPVS helper when NAT mangling is in place, patch from
Julian Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 128bb975dc ("ip6_gre: init dev->mtu and dev->hard_header_len
correctly") fixed IFLA_MTU ignored on NEWLINK for ip6_gre. The same
mtu fix is also needed for sit.
Note that dev->hard_header_len setting for sit works fine, no need to
fix it. sit is actually ipv4 tunnel, it can't call ip6_tnl_change_mtu
to set mtu.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 128bb975dc ("ip6_gre: init dev->mtu and dev->hard_header_len
correctly") fixed IFLA_MTU ignored on NEWLINK for ip6_gre. The same
mtu fix is also needed for ip6_tunnel.
Note that dev->hard_header_len setting for ip6_tunnel works fine,
no need to fix it.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some reason, Florian forgot to apply to ip6_route_me_harder
the fix that went in commit 29e09229d9 ("netfilter: use
skb_to_full_sk in ip_route_me_harder")
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
"fib" starts to behave strangely when an ipv6 default route is
added - the FIB lookup returns a route using 'oif' in this case.
This behaviour was inherited from ip6tables rpfilter so change
this as well.
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1221
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
l4proto->manip_pkt() can cause reallocation of skb head so pointer
to the ipv6 header must be reloaded.
Reported-and-tested-by: <syzbot+10005f4292fc9cc89de7@syzkaller.appspotmail.com>
Fixes: 58a317f106 ("netfilter: ipv6: add IPv6 NAT support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
gcc-8 has a new warning that detects overlapping input and output arguments
in memcpy(). It triggers for sit_init_net() calling ipip6_tunnel_clone_6rd(),
which is actually correct:
net/ipv6/sit.c: In function 'sit_init_net':
net/ipv6/sit.c:192:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict]
The problem here is that the logic detecting the memcpy() arguments finds them
to be the same, but the conditional that tests for the input and output of
ipip6_tunnel_clone_6rd() to be identical is not a compile-time constant.
We know that netdev_priv(t->dev) is the same as t for a tunnel device,
and comparing "dev" directly here lets the compiler figure out as well
that 'dev == sitn->fb_tunnel_dev' when called from sit_init_net(), so
it no longer warns.
This code is old, so Cc stable to make sure that we don't get the warning
for older kernels built with new gcc.
Cc: Martin Sebor <msebor@gmail.com>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83456
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains large batch with Netfilter fixes for
your net tree, mostly due to syzbot report fixups and pr_err()
ratelimiting, more specifically, they are:
1) Get rid of superfluous unnecessary check in x_tables before vmalloc(),
we don't hit BUG there anymore, patch from Michal Hock, suggested by
Andrew Morton.
2) Race condition in proc file creation in ipt_CLUSTERIP, from Cong Wang.
3) Drop socket lock that results in circular locking dependency, patch
from Paolo Abeni.
4) Drop packet if case of malformed blob that makes backpointer jump
in x_tables, from Florian Westphal.
5) Fix refcount leak due to race in ipt_CLUSTERIP in
clusterip_config_find_get(), from Cong Wang.
6) Several patches to ratelimit pr_err() for x_tables since this can be
a problem where CAP_NET_ADMIN semantics can protect us in untrusted
namespace, from Florian Westphal.
7) Missing .gitignore update for new autogenerated asn1 state machine
for the SNMP NAT helper, from Zhu Lingshan.
8) Missing timer initialization in xt_LED, from Paolo Abeni.
9) Do not allow negative port range in NAT, also from Paolo.
10) Lock imbalance in the xt_hashlimit rate match mode, patch from
Eric Dumazet.
11) Initialize workqueue before timer in the idletimer match,
from Eric Dumazet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since UDP-Lite is always using checksum, the following path is
triggered when calculating pseudo header for it:
udp4_csum_init() or udp6_csum_init()
skb_checksum_init_zero_check()
__skb_checksum_validate_complete()
The problem can appear if skb->len is less than CHECKSUM_BREAK. In
this particular case __skb_checksum_validate_complete() also invokes
__skb_checksum_complete(skb). If UDP-Lite is using partial checksum
that covers only part of a packet, the function will return bad
checksum and the packet will be dropped.
It can be fixed if we skip skb_checksum_init_zero_check() and only
set the required pseudo header checksum for UDP-Lite with partial
checksum before udp4_csum_init()/udp6_csum_init() functions return.
Fixes: ed70fcfcee ("net: Call skb_checksum_init in IPv4")
Fixes: e4f45b7f40 ("net: Call skb_checksum_init in IPv6")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In early time, when freeing a xdst, it would be inserted into
dst_garbage.list first. Then if it's refcnt was still held
somewhere, later it would be put into dst_busy_list in
dst_gc_task().
When one dev was being unregistered, the dev of these dsts in
dst_busy_list would be set with loopback_dev and put this dev.
So that this dev's removal wouldn't get blocked, and avoid the
kmsg warning:
kernel:unregister_netdevice: waiting for veth0 to become \
free. Usage count = 2
However after Commit 52df157f17 ("xfrm: take refcnt of dst
when creating struct xfrm_dst bundle"), the xdst will not be
freed with dst gc, and this warning happens.
To fix it, we need to find these xdsts that are still held by
others when removing the dev, and free xdst's dev and set it
with loopback_dev.
But unfortunately after flow_cache for xfrm was deleted, no
list tracks them anymore. So we need to save these xdsts
somewhere to release the xdst's dev later.
To make this easier, this patch is to reuse uncached_list to
track xdsts, so that the dev refcnt can be released in the
event NETDEV_UNREGISTER process of fib_netdev_notifier.
Thanks to Florian, we could move forward this fix quickly.
Fixes: 52df157f17 ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
Reported-by: Jianlin Shi <jishi@redhat.com>
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Tested-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The rationale for removing the check is only correct for rulesets
generated by ip(6)tables.
In iptables, a jump can only occur to a user-defined chain, i.e.
because we size the stack based on number of user-defined chains we
cannot exceed stack size.
However, the underlying binary format has no such restriction,
and the validation step only ensures that the jump target is a
valid rule start point.
IOW, its possible to build a rule blob that has no user-defined
chains but does contain a jump.
If this happens, no jump stack gets allocated and crash occurs
because no jumpstack was allocated.
Fixes: 7814b6ec6d ("netfilter: xtables: don't save/restore jumpstack offset")
Reported-by: syzbot+e783f671527912cd9403@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The Syzbot reported a possible deadlock in the netfilter area caused by
rtnl lock, xt lock and socket lock being acquired with a different order
on different code paths, leading to the following backtrace:
Reviewed-by: Xin Long <lucien.xin@gmail.com>
======================================================
WARNING: possible circular locking dependency detected
4.15.0+ #301 Not tainted
------------------------------------------------------
syzkaller233489/4179 is trying to acquire lock:
(rtnl_mutex){+.+.}, at: [<0000000048e996fd>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
but task is already holding lock:
(&xt[i].mutex){+.+.}, at: [<00000000328553a2>]
xt_find_table_lock+0x3e/0x3e0 net/netfilter/x_tables.c:1041
which lock already depends on the new lock.
===
Since commit 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock
only in the required scope"), we already acquire the socket lock in
the innermost scope, where needed. In such commit I forgot to remove
the outer-most socket lock from the getsockopt() path, this commit
addresses the issues dropping it now.
v1 -> v2: fix bad subj, added relavant 'fixes' tag
Fixes: 22265a5c3c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
Fixes: 202f59afd4 ("netfilter: ipt_CLUSTERIP: do not hold dev")
Fixes: 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock only in the required scope")
Reported-by: syzbot+ddde1c7b7ff7442d7f2d@syzkaller.appspotmail.com
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
tracepoint tcp_send_reset requires a full socket to work. However, it
may be called when in TCP_TIME_WAIT:
case TCP_TW_RST:
tcp_v6_send_reset(sk, skb);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
To avoid this problem, this patch checks the socket with sk_fullsock()
before calling trace_tcp_send_reset().
Fixes: c24b14c46b ("tcp: add tracepoint trace_tcp_send_reset")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of differences in how ipv4 and ipv6 handle fib lookups,
verification of nexthops with onlink flag need to default to the main
table rather than the local table used by IPv4. As it stands an
address within a connected route on device 1 can be used with
onlink on device 2. Updating the table properly rejects the route
due to the egress device mismatch.
Update the extack message as well to show it could be a device
mismatch for the nexthop spec.
Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Verification of nexthops with onlink flag need to handle unreachable
routes. The lookup is only intended to validate the gateway address
is not a local address and if the gateway resolves the egress device
must match the given device. Hence, hitting any default reject route
is ok.
Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for you net tree, they
are:
1) Restore __GFP_NORETRY in xt_table allocations to mitigate effects of
large memory allocation requests, from Michal Hocko.
2) Release IPv6 fragment queue in case of error in fragmentation header,
this is a follow up to amend patch 83f1999cae, from Subash Abhinov
Kasiviswanathan.
3) Flowtable infrastructure depends on NETFILTER_INGRESS as it registers
a hook for each flowtable, reported by John Crispin.
4) Missing initialization of info->priv in xt_cgroup version 1, from
Cong Wang.
5) Give a chance to garbage collector to run after scheduling flowtable
cleanup.
6) Releasing flowtable content on nft_flow_offload module removal is
not required at all, there is not dependencies between this module
and flowtables, remove it.
7) Fix missing xt_rateest_mutex grabbing for hash insertions, also from
Cong Wang.
8) Move nf_flow_table_cleanup() routine to flowtable core, this patch is
a dependency for the next patch in this list.
9) Flowtable resources are not properly released on removal from the
control plane. Fix this resource leak by scheduling removal of all
entries and explicit call to the garbage collector.
10) nf_ct_nat_offset() declaration is dead code, this function prototype
is not used anywhere, remove it. From Taehee Yoo.
11) Fix another flowtable resource leak on entry insertion failures,
this patch also fixes a possible use-after-free. Patch from Felix
Fietkau.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This macro is only used by net/ipv6/mcast.c, but there is no reason
why it must be BUILD_BUG_ON_NULL().
Replace it with BUILD_BUG_ON_ZERO(), and remove BUILD_BUG_ON_NULL()
definition from <linux/build_bug.h>.
Link: http://lkml.kernel.org/r/1515121833-3174-3-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Ian Abbott <abbotti@mev.co.uk>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every flow_offload entry is added into the table twice. Because of this,
rhashtable_free_and_destroy can't be used, since it would call kfree for
each flow_offload object twice.
This patch cleans up the flowtable via nf_flow_table_iterate() to
schedule removal of entries by setting on the dying bit, then there is
an explicitly invocation of the garbage collector to release resources.
Based on patch from Felix Fietkau.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When an erspan tunnel device receives an erpsan packet with different
tunnel metadata (ex: version, index, hwid, direction), existing code
overwrites the tunnel device's erspan configuration with the received
packet's metadata. The patch fixes it.
Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: f551c91de2 ("net: erspan: introduce erspan v2 for ip_gre")
Fixes: ef7baf5e08 ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d350a82302 ("net: erspan: create erspan metadata uapi header")
moves the erspan 'version' in front of the 'struct erspan_md2' for
later extensibility reason. This breaks the existing erspan metadata
extraction code because the erspan_md2 then has a 4-byte offset
to between the erspan_metadata and erspan_base_hdr. This patch
fixes it.
Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: ef7baf5e08 ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 1d7e2ed22f ("net: erspan: refactor existing erspan code")
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass all
hardened usercopy checks since these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over the
next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
JgOmUnQNJWCTwUUw5AS1
=tzmJ
-----END PGP SIGNATURE-----
Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy whitelisting from Kees Cook:
"Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs.
To further restrict what memory is available for copying, this creates
a way to whitelist specific areas of a given slab cache object for
copying to/from userspace, allowing much finer granularity of access
control.
Slab caches that are never exposed to userspace can declare no
whitelist for their objects, thereby keeping them unavailable to
userspace via dynamic copy operations. (Note, an implicit form of
whitelisting is the use of constant sizes in usercopy operations and
get_user()/put_user(); these bypass all hardened usercopy checks since
these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over
the next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage"
* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
lkdtm: Update usercopy tests for whitelisting
usercopy: Restrict non-usercopy caches to size 0
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
kvm: whitelist struct kvm_vcpu_arch
arm: Implement thread_struct whitelist for hardened usercopy
arm64: Implement thread_struct whitelist for hardened usercopy
x86: Implement thread_struct whitelist for hardened usercopy
fork: Provide usercopy whitelisting for task_struct
fork: Define usercopy region in thread_stack slab caches
fork: Define usercopy region in mm_struct slab caches
net: Restrict unwhitelisted proto caches to size 0
sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
sctp: Define usercopy region in SCTP proto slab cache
caif: Define usercopy region in caif proto slab cache
ip: Define usercopy region in IP proto slab cache
net: Define usercopy region in struct proto slab cache
scsi: Define usercopy region in scsi_sense_cache slab cache
cifs: Define usercopy region in cifs_request slab cache
vxfs: Define usercopy region in vxfs_inode slab cache
ufs: Define usercopy region in ufs_inode_cache slab cache
...
config NF_FLOW_TABLE depends on NETFILTER_INGRESS. If users forget to
enable this toggle, flowtable registration fails with EOPNOTSUPP.
Moreover, turn 'select NF_FLOW_TABLE' in every flowtable family flavour
into dependency instead, otherwise this new dependency on
NETFILTER_INGRESS causes a warning. This also allows us to remove the
explicit dependency between family flowtables <-> NF_TABLES and
NF_CONNTRACK, given they depend on the NF_FLOW_TABLE core that already
expresses the general dependencies for this new infrastructure.
Moreover, NF_FLOW_TABLE_INET depends on NF_FLOW_TABLE_IPV4 and
NF_FLOWTABLE_IPV6, which already depends on NF_FLOW_TABLE. So we can get
rid of direct dependency with NF_FLOW_TABLE.
In general, let's avoid 'select', it just makes things more complicated.
Reported-by: John Crispin <john@phrozen.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Failures were seen in ICMPv6 fragmentation timeout tests if they were
run after the RFC2460 failure tests. Kernel was not sending out the
ICMPv6 fragment reassembly time exceeded packet after the fragmentation
reassembly timeout of 1 minute had elapsed.
This happened because the frag queue was not released if an error in
IPv6 fragmentation header was detected by RFC2460.
Fixes: 83f1999cae ("netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Fix OOM that syskaller triggers with ipt_replace.size = -1 and
IPT_SO_SET_REPLACE socket option, from Dmitry Vyukov.
2) Check for too long extension name in xt_request_find_{match|target}
that result in out-of-bound reads, from Eric Dumazet.
3) Fix memory exhaustion bug in ipset hash:*net* types when adding ranges
that look like x.x.x.x-255.255.255.255, from Jozsef Kadlecsik.
4) Fix pointer leaks to userspace in x_tables, from Dmitry Vyukov.
5) Insufficient sanity checks in clusterip_tg_check(), also from Dmitry.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Significantly shrink the core networking routing structures. Result
of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf
2) Add netdevsim driver for testing various offloads, from Jakub
Kicinski.
3) Support cross-chip FDB operations in DSA, from Vivien Didelot.
4) Add a 2nd listener hash table for TCP, similar to what was done for
UDP. From Martin KaFai Lau.
5) Add eBPF based queue selection to tun, from Jason Wang.
6) Lockless qdisc support, from John Fastabend.
7) SCTP stream interleave support, from Xin Long.
8) Smoother TCP receive autotuning, from Eric Dumazet.
9) Lots of erspan tunneling enhancements, from William Tu.
10) Add true function call support to BPF, from Alexei Starovoitov.
11) Add explicit support for GRO HW offloading, from Michael Chan.
12) Support extack generation in more netlink subsystems. From Alexander
Aring, Quentin Monnet, and Jakub Kicinski.
13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
Russell King.
14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.
15) Many improvements and simplifications to the NFP driver bpf JIT,
from Jakub Kicinski.
16) Support for ipv6 non-equal cost multipath routing, from Ido
Schimmel.
17) Add resource abstration to devlink, from Arkadi Sharshevsky.
18) Packet scheduler classifier shared filter block support, from Jiri
Pirko.
19) Avoid locking in act_csum, from Davide Caratti.
20) devinet_ioctl() simplifications from Al viro.
21) More TCP bpf improvements from Lawrence Brakmo.
22) Add support for onlink ipv6 route flag, similar to ipv4, from David
Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
tls: Add support for encryption using async offload accelerator
ip6mr: fix stale iterator
net/sched: kconfig: Remove blank help texts
openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
tcp_nv: fix potential integer overflow in tcpnv_acked
r8169: fix RTL8168EP take too long to complete driver initialization.
qmi_wwan: Add support for Quectel EP06
rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
ipmr: Fix ptrdiff_t print formatting
ibmvnic: Wait for device response when changing MAC
qlcnic: fix deadlock bug
tcp: release sk_frag.page in tcp_disconnect
ipv4: Get the address of interface correctly.
net_sched: gen_estimator: fix lockdep splat
net: macb: Handle HRESP error
net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
ipv6: addrconf: break critical section in addrconf_verify_rtnl()
ipv6: change route cache aging logic
i40e/i40evf: Update DESC_NEEDED value to reflect larger value
bnxt_en: cleanup DIM work on device shutdown
...
Syzbot reported several deadlocks in the netfilter area caused by
rtnl lock and socket lock being acquired with a different order on
different code paths, leading to backtraces like the following one:
======================================================
WARNING: possible circular locking dependency detected
4.15.0-rc9+ #212 Not tainted
------------------------------------------------------
syzkaller041579/3682 is trying to acquire lock:
(sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock
include/net/sock.h:1463 [inline]
(sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>]
do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
but task is already holding lock:
(rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (rtnl_mutex){+.+.}:
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
find_check_entry.isra.7+0x935/0xcf0
net/ipv6/netfilter/ip6_tables.c:580
translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
SYSC_setsockopt net/socket.c:1849 [inline]
SyS_setsockopt+0x189/0x360 net/socket.c:1828
entry_SYSCALL_64_fastpath+0x29/0xa0
-> #0 (sk_lock-AF_INET6){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
lock_sock include/net/sock.h:1463 [inline]
do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
SYSC_setsockopt net/socket.c:1849 [inline]
SyS_setsockopt+0x189/0x360 net/socket.c:1828
entry_SYSCALL_64_fastpath+0x29/0xa0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(rtnl_mutex);
lock(sk_lock-AF_INET6);
lock(rtnl_mutex);
lock(sk_lock-AF_INET6);
*** DEADLOCK ***
1 lock held by syzkaller041579/3682:
#0: (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
The problem, as Florian noted, is that nf_setsockopt() is always
called with the socket held, even if the lock itself is required only
for very tight scopes and only for some operation.
This patch addresses the issues moving the lock_sock() call only
where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
does not need anymore to acquire both locks.
Fixes: 22265a5c3c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle were:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and in
kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends() and
read_barrier_depends().
- Torture-test updates.
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
torture: Save a line in stutter_wait(): while -> for
torture: Eliminate torture_runnable and perf_runnable
torture: Make stutter less vulnerable to compilers and races
locking/locktorture: Fix num reader/writer corner cases
locking/locktorture: Fix rwsem reader_delay
torture: Place all torture-test modules in one MAINTAINERS group
rcutorture/kvm-build.sh: Skip build directory check
rcutorture: Simplify functions.sh include path
rcutorture: Simplify logging
rcutorture/kvm-recheck-*: Improve result directory readability check
rcutorture/kvm.sh: Support execution from any directory
rcutorture/kvm.sh: Use consistent help text for --qemu-args
rcutorture/kvm.sh: Remove unused variable, `alldone`
rcutorture: Remove unused script, config2frag.sh
rcutorture/configinit: Fix build directory error message
rcutorture: Preempt RCU-preempt readers more vigorously
torture: Reduce #ifdefs for preempt_schedule()
rcu: Remove have_rcu_nocb_mask from tree_plugin.h
rcu: Add comment giving debug strategy for double call_rcu()
tracing, rcu: Hide trace event rcu_nocb_wake when not used
...
In current route cache aging logic, if a route has both RTF_EXPIRE and
RTF_GATEWAY set, the route will only be removed if the neighbor cache
has no NTF_ROUTER flag. Otherwise, even if the route has expired, it
won't get deleted.
Fix this logic to always check if the route has expired first and then
do the gateway neighbor cache check if previous check decide to not
remove the exception entry.
Fixes: 1859bac04f ("ipv6: remove from fib tree aged out RTF_CACHE dst")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unsolicited IPv6 neighbor advertisements should be sent after DAD
completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic
addresses and have those sent by addrconf_dad_completed after DAD.
Fixes: 4a6e3c5def ("net: ipv6: send unsolicited NA on admin up")
Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
implicitly implies it is an ipv6only socket. However, in inet6_bind(),
this addr_type checking and setting sk->sk_ipv6only to 1 are only done
after sk->sk_prot->get_port(sk, snum) has been completed successfully.
This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
the 'get_port()'.
In particular, when binding SO_REUSEPORT UDP sockets,
udp_reuseport_add_sock(sk,...) is called. udp_reuseport_add_sock()
checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
sk2->sk_reuseport_cb. In this case, ipv6_only_sock(sk2) could be
1 while ipv6_only_sock(sk) is still 0 here. The end result is,
reuseport_alloc(sk) is called instead of adding sk to the existing
sk2->sk_reuseport_cb.
It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
IPv6 address (!ANY and !MAPPED). Only one of the socket will
receive packet.
The fix is to set the implicit sk_ipv6only before calling get_port().
The original sk_ipv6only has to be saved such that it can be restored
in case get_port() failed. The situation is similar to the
inet_reset_saddr(sk) after get_port() has failed.
Thanks to Calvin Owens <calvinowens@fb.com> who created an easy
reproduction which leads to a fix.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to IPv4 allow routes to be added with the RTNH_F_ONLINK flag.
The onlink option requires a gateway and a nexthop device. Any unicast
gateway is allowed (including IPv4 mapped addresses and unresolved
ones) as long as the gateway is not a local address and if it resolves
it must match the given device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
onlink verification needs to do a lookup in potentially different
table than the table in fib6_config and without the RT6_LOOKUP_F_IFACE
flag. Change ip6_nh_lookup_table to take table id and flags as input
arguments. Both verifications want to ignore link state, so add that
flag can stay in the lookup helper.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move existing code to validate nexthop into a helper. Follow on patch
adds support for nexthops marked with onlink, and this helper keeps
the complexity of ip6_route_info_create in check.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field. This is more costly due to
calling ntohs/htons. The patch changes it to use bitfields.
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>