We need to put the policies when re-using the pcpu xdst entry, else
this leaks the reference.
Fixes: ec30d78c14 ("xfrm: add xdst pcpu cache")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
On policies with a transport mode template, we pass the addresses
from the flowi to xfrm_state_find(), assuming that the IP addresses
(and address family) don't change during transformation.
Unfortunately our policy template validation is not strict enough.
It is possible to configure policies with transport mode template
where the address family of the template does not match the selectors
address family. This lead to stack-out-of-bound reads because
we compare arddesses of the wrong family. Fix this by refusing
such a configuration, address family can not change on transport
mode.
We use the assumption that, on transport mode, the first templates
address family must match the address family of the policy selector.
Subsequent transport mode templates must mach the address family of
the previous template.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
copy geniv when cloning the xfrm state.
x->geniv was not copied to the new state and migration would fail.
xfrm_do_migrate
..
xfrm_state_clone()
..
..
esp_init_aead()
crypto_alloc_aead()
crypto_alloc_tfm()
crypto_find_alg() return EAGAIN and failed
Signed-off-by: Antony Antony <antony@phenome.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When we do tunnel or beet mode, we pass saddr and daddr from the
template to xfrm_state_find(), this is ok. On transport mode,
we pass the addresses from the flowi, assuming that the IP
addresses (and address family) don't change during transformation.
This assumption is wrong in the IPv4 mapped IPv6 case, packet
is IPv4 and template is IPv6.
Fix this by catching address family missmatches of the policy
and the flow already before we do the lookup.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Code path when (encap_type < 0) does not verify the state is valid
before progressing.
This will result in a crash if, for instance, x->km.state ==
XFRM_STATE_ACQ.
Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This seems to be an obvious typo, NLA_U32 is type of the attribute, not its
(minimal) length.
Fixes: 077fbac405 ("net: xfrm: support setting an output mark.")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
syzbot reported a kernel warning in xfrm_state_fini(), which
indicates that we have entries left in the list
net->xfrm.state_all whose proto is zero. And
xfrm_id_proto_match() doesn't consider them as a match with
IPSEC_PROTO_ANY in this case.
Proto with value 0 is probably not a valid value, at least
verify_newsa_info() doesn't consider it valid either.
This patch fixes it by checking the proto value in
validate_tmpl() and rejecting invalid ones, like what iproute2
does in xfrm_xfrmproto_getbyname().
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
q->link.block is not initialized, that leads to EINVAL when one tries to
add filter there. So initialize it properly.
This can be reproduced by:
$ tc qdisc add dev eth0 root handle 1: cbq avpkt 1000 rate 1000Mbit bandwidth 1000Mbit
$ tc filter add dev eth0 parent 1: protocol ip prio 100 u32 match ip protocol 0 0x00 flowid 1:1
Reported-by: Jaroslav Aster <jaster@redhat.com>
Reported-by: Ivan Vecera <ivecera@redhat.com>
Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent commit (3b4477d2dc) converted the sk_state to use
TCP constants. In that change, vmci_transport_handle_detach
was changed such that sk->sk_state was set to TCP_CLOSE before
we test whether it is TCP_SYN_SENT. This change moves the
sk_state change back to the original locations in that function.
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d04adf1b35 ("sctp: reset owner sk for data chunks on out queues
when migrating a sock") made a mistake that using 'list' as the param of
list_for_each_entry to traverse the retransmit, sacked and abandoned
queues, while chunks are using 'transmitted_list' to link into these
queues.
It could cause NULL dereference panic if there are chunks in any of these
queues when peeling off one asoc.
So use the chunk member 'transmitted_list' instead in this patch.
Fixes: d04adf1b35 ("sctp: reset owner sk for data chunks on out queues when migrating a sock")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While converting sch_sfq to use timer_setup(), the commit cdeabbb881
("net: sched: Convert timers to use timer_setup()") forgot to
initialize the 'sch' field. As a result, the timer callback tries to
dereference a NULL pointer, and the kernel does oops.
Fix it initializing such field at qdisc creation time.
Fixes: cdeabbb881 ("net: sched: Convert timers to use timer_setup()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When cls_bpf offload was added it seemed like a good idea to
call cls_bpf_delete_prog() instead of extending the error
handling path, since the software state is fully initialized
at that point. This handling of errors without jumping to
the end of the function is error prone, as proven by later
commit missing that extra call to __cls_bpf_delete_prog().
__cls_bpf_delete_prog() is now expected to be invoked with
a reference on exts->net or the field zeroed out. The call
on the offload's error patch does not fullfil this requirement,
leading to each error stealing a reference on net namespace.
Create a function undoing what cls_bpf_set_parms() did and
use it from __cls_bpf_delete_prog() and the error path.
Fixes: aae2c35ec8 ("cls_bpf: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller found a race condition fanout_demux_rollover() while removing
a packet socket from a fanout group.
po->rollover is read and operated on during packet_rcv_fanout(), via
fanout_demux_rollover(), but the pointer is currently cleared before the
synchronization in packet_release(). It is safer to delay the cleanup
until after synchronize_net() has been called, ensuring all calls to
packet_rcv_fanout() for this socket have finished.
To further simplify synchronization around the rollover structure, set
po->rollover in fanout_add() only if there are no errors. This removes
the need for rcu in the struct and in the call to
packet_getsockopt(..., PACKET_ROLLOVER_STATS, ...).
Crashing stack trace:
fanout_demux_rollover+0xb6/0x4d0 net/packet/af_packet.c:1392
packet_rcv_fanout+0x649/0x7c8 net/packet/af_packet.c:1487
dev_queue_xmit_nit+0x835/0xc10 net/core/dev.c:1953
xmit_one net/core/dev.c:2975 [inline]
dev_hard_start_xmit+0x16b/0xac0 net/core/dev.c:2995
__dev_queue_xmit+0x17a4/0x2050 net/core/dev.c:3476
dev_queue_xmit+0x17/0x20 net/core/dev.c:3509
neigh_connected_output+0x489/0x720 net/core/neighbour.c:1379
neigh_output include/net/neighbour.h:482 [inline]
ip6_finish_output2+0xad1/0x22a0 net/ipv6/ip6_output.c:120
ip6_finish_output+0x2f9/0x920 net/ipv6/ip6_output.c:146
NF_HOOK_COND include/linux/netfilter.h:239 [inline]
ip6_output+0x1f4/0x850 net/ipv6/ip6_output.c:163
dst_output include/net/dst.h:459 [inline]
NF_HOOK.constprop.35+0xff/0x630 include/linux/netfilter.h:250
mld_sendpack+0x6a8/0xcc0 net/ipv6/mcast.c:1660
mld_send_initial_cr.part.24+0x103/0x150 net/ipv6/mcast.c:2072
mld_send_initial_cr net/ipv6/mcast.c:2056 [inline]
ipv6_mc_dad_complete+0x99/0x130 net/ipv6/mcast.c:2079
addrconf_dad_completed+0x595/0x970 net/ipv6/addrconf.c:4039
addrconf_dad_work+0xac9/0x1160 net/ipv6/addrconf.c:3971
process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2113
worker_thread+0x223/0x1990 kernel/workqueue.c:2247
kthread+0x35e/0x430 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:432
Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")
Fixes: 509c7a1ecc ("packet: avoid panic in packet_getsockopt()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Mike Maloney <maloney@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now each stream sched ops is defined in different .c file and
added into the global ops in another .c file, it uses extern
to make this work.
However extern is not good coding style to get them in and
even make C=2 reports errors for this.
This patch adds sctp_sched_ops_xxx_init for each stream sched
ops in their .c file, then get them into the global ops by
calling them when initializing sctp module.
Fixes: 637784ade2 ("sctp: introduce priority based stream scheduler")
Fixes: ac1ed8b82c ("sctp: introduce round robin stream scheduler")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to force SCTP_ERROR_INV_STRM with right type to
fit in sctp_chunk_fail to avoid the sparse error.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KASAN revealed another access after delete in group.c. This time
it found that we read the header of a received message after the
buffer has been released.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* CRYPTO_SHA256 is needed for regdb validation
* mac80211: mesh path metric was wrong in some frames
* mac80211: use QoS null-data packets on QoS connections
* mac80211: tear down RX aggregation sessions first to
drop fewer packets in HW restart scenarios
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlob6QMACgkQB8qZga/f
l8SAvw/9E4fttLsdBhUT1gUJPoeNWsSXyhdoo0QWHs4Ni//bayNUL/Wx8QL6MGf/
nprgxDygqFXH5iUO2gwpaMl2xeO1N7Px7bpKV/KvrDX/UAaEDyrkg6B87Omx95vT
i584pU0LwLVL40CT+saKpF+kChe3S/rMPrAQR0xBNPntyIlo9w0vdOkdTC+vw0br
/Nl3YJxpAmOmTDa8tiKLWHg4hSIS3Q7stylUjnEKNev4nBjBaPqfBW6sy/3o4NiR
MfSCNvGeipr0QFZT31cHDry1fRZ/RvD23KSYNMggzctpYf6IRuHSEhIS3mW7GLpd
GKZyLa7G7mZJAKXJAXOxb2PBBrXl/nTqSHJfme0VroGsYdqLHThXd47/0qaAB4Oo
Io84WSmQX41kRyy4KNU7z0OFR9RgLbctdW/hwGeODxKEagGNNjuz+lbzql7jtYLn
8UZK3LZHYC+qribCHR4Uf6A5rJP7bAcl36Ub/8l7AXNVesLagiaE0FGGrewGnE6X
6qQLoYznd4/bjLQLVou56R9/t+o1nUuvmcGMkRxAlhvX3hw5nSUxgkWS/TgCdrBg
cKddmM78AfEakjd6e21bH2QI2tG6fXS913lwIOq9JU870yVlNpUEr98t5nwYNP3P
l5kzxr237Yj4W9LMUmeZIWo/laiAtgYEeQZAynAZWNxXNBdOvaA=
=ODX1
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2017-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Four fixes:
* CRYPTO_SHA256 is needed for regdb validation
* mac80211: mesh path metric was wrong in some frames
* mac80211: use QoS null-data packets on QoS connections
* mac80211: tear down RX aggregation sessions first to
drop fewer packets in HW restart scenarios
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing asoc reset, if the sender of the response has already sent some
chunk and increased asoc->next_tsn before the duplicate request comes, the
response will use the old result with an incorrect sender next_tsn.
Better than asoc->next_tsn, asoc->ctsn_ack_point can't be changed after
the sender of the response has performed the asoc reset and before the
peer has confirmed it, and it's value is still asoc->next_tsn original
value minus 1.
This patch sets sender next_tsn for the old result with ctsn_ack_point
plus 1 when processing the duplicate request, to make sure the sender
next_tsn value peer gets will be always right.
Fixes: 692787cef6 ("sctp: implement receiver-side procedures for the SSN/TSN Reset Request Parameter")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when doing asoc reset, it cleans up sacked and abandoned queues
by calling sctp_outq_free where it also cleans up unsent, retransmit
and transmitted queues.
It's safe for the sender of response, as these 3 queues are empty at
that time. But when the receiver of response is doing the reset, the
users may already enqueue some chunks into unsent during the time
waiting the response, and these chunks should not be flushed.
To void the chunks in it would be removed, it moves the queue into a
temp list, then gets it back after sctp_outq_free is done.
The patch also fixes some incorrect comments in
sctp_process_strreset_tsnreq.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it says in rfc6525#section5.1.4, before sending the request,
C2: The sender has either no outstanding TSNs or considers all
outstanding TSNs abandoned.
Prior to this patch, it tried to consider all outstanding TSNs abandoned
by dropping all chunks in all outqs with sctp_outq_free (even including
sacked, retransmit and transmitted queues) when doing this reset, which
is too aggressive.
To make it work gently, this patch will only allow the asoc reset when
the sender has no outstanding TSNs by checking if unsent, transmitted
and retransmit are all empty with sctp_outq_is_empty before sending
and processing the request.
Fixes: 692787cef6 ("sctp: implement receiver-side procedures for the SSN/TSN Reset Request Parameter")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now the out stream reset in sctp stream reconf could be done even if
the stream outq is not empty. It means that users can not be sure
since which msg the new ssn will be used.
To make this more synchronous, it shouldn't allow to do out stream
reset until these chunks in unsent outq all are sent out.
This patch checks the corresponding stream outqs when sending and
processing the request . If any of them has unsent chunks in outq,
it will return -EAGAIN instead or send SCTP_STRRESET_IN_PROGRESS
back to the sender.
Fixes: 7f9d68ac94 ("sctp: implement sender-side procedures for SSN Reset Request Parameter")
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now in stream reconf part there are still some places using magic
number 2 for each stream number length. To make it more readable,
this patch is to replace them with sizeof(__u16).
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing HW restart we tear down aggregations.
Since at this point we are not TX'ing any aggregation, while
the peer is still sending RX aggregation over the air, it will
make sense to tear down the RX aggregations first.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The previous path metric update from RANN frame has not considered
the own link metric toward the transmitting mesh STA. Fix this.
Reported-by: Michael65535
Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When connected to a QoS/WMM AP, mac80211 should use a QoS NDP
for probing it, instead of a regular non-QoS one, fix this.
Change all the drivers to *not* allow QoS NDP for now, even
though it looks like most of them should be OK with that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we want to add a datapath flow, which has more than 500 vxlan outputs'
action, we will get the following error reports:
openvswitch: netlink: Flow action size 32832 bytes exceeds max
openvswitch: netlink: Flow action size 32832 bytes exceeds max
openvswitch: netlink: Actions may not be safe on all matching packets
... ...
It seems that we can simply enlarge the MAX_ACTIONS_BUFSIZE to fix it, but
this is not the root cause. For example, for a vxlan output action, we need
about 60 bytes for the nlattr, but after it is converted to the flow
action, it only occupies 24 bytes. This means that we can still support
more than 1000 vxlan output actions for a single datapath flow under the
the current 32k max limitation.
So even if the nla_len(attr) is larger than MAX_ACTIONS_BUFSIZE, we
shouldn't report EINVAL and keep it move on, as the judgement can be
done by the reserve_sfa_size.
Signed-off-by: zhangliping <zhangliping02@baidu.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
gso_type is being used in binary AND operations together with SKB_GSO_UDP.
The issue is that variable gso_type is of type unsigned short and
SKB_GSO_UDP expands to more than 16 bits:
SKB_GSO_UDP = 1 << 16
this makes any binary AND operation between gso_type and SKB_GSO_UDP to
be always zero, hence making some code unreachable and likely causing
undesired behavior.
Fix this by changing the data type of variable gso_type to unsigned int.
Addresses-Coverity-ID: 1462223
Fixes: 0c19f846d5 ("net: accept UFO datagrams from tuntap and packet")
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting the refcount to 0 when allocating a tree to match the number of
switch devices it holds may cause an 'increment on 0; use-after-free',
if CONFIG_REFCOUNT_FULL is enabled.
To fix this, do not decrement the refcount of a newly allocated tree,
increment it when an already allocated tree is found, and decrement it
after the probing of a switch, as done with the previous behavior.
At the same time, make dsa_tree_get and dsa_tree_put accept a NULL
argument to simplify callers, and return the tree after incrementation,
as most kref users like of_node_get and of_node_put do.
Fixes: 8e5bf9759a ("net: dsa: simplify tree reference counting")
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using the host personality, VMCI will grab a mutex for any
queue pair access. In the detach callback for the vmci vsock
transport, we call vsock_stream_has_data while holding a spinlock,
and vsock_stream_has_data will access a queue pair.
To avoid this, we can simply omit calling vsock_stream_has_data
for host side queue pairs, since the QPs are empty per default
when the guest has detached.
This bug affects users of VMware Workstation using kernel version
4.4 and later.
Testing: Ran vsock tests between guest and host, and verified that
with this change, the host isn't calling vsock_stream_has_data
during detach. Ran mixedTest between guest and host using both
guest and host as server.
v2: Rebased on top of recent change to sk_state values
Reviewed-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).
Prevent the user after free by holding all chains (except 0, that one is
already held). foreach_safe is not enough in this case.
To reproduce, run the following in a netns and then delete the ns:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1 parent ffff: handle 1 prio 1 flower action goto chain 2
Fixes: 822e86d997 ("net_sched: remove tcf_block_put_deferred()")
Signed-off-by: Roman Kapl <code@rkapl.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When regulatory database certificates are built-in, they're
currently using the SHA256 digest algorithm, so add that to
the build in that case.
Also add a note that for custom certificates, one may need
to add the right algorithms.
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix the rxrpc connection expiry timers so that connections for closed
AF_RXRPC sockets get deleted in a more timely fashion, freeing up the
transport UDP port much more quickly.
(1) Replace the delayed work items with work items plus timers so that
timer_reduce() can be used to shorten them and so that the timer
doesn't requeue the work item if the net namespace is dead.
(2) Don't use queue_delayed_work() as that won't alter the timeout if the
timer is already running.
(3) Don't rearm the timers if the network namespace is dead.
Signed-off-by: David Howells <dhowells@redhat.com>
RxRPC service endpoints expire like they're supposed to by the following
means:
(1) Mark dead rxrpc_net structs (with ->live) rather than twiddling the
global service conn timeout, otherwise the first rxrpc_net struct to
die will cause connections on all others to expire immediately from
then on.
(2) Mark local service endpoints for which the socket has been closed
(->service_closed) so that the expiration timeout can be much
shortened for service and client connections going through that
endpoint.
(3) rxrpc_put_service_conn() needs to schedule the reaper when the usage
count reaches 1, not 0, as idle conns have a 1 count.
(4) The accumulator for the earliest time we might want to schedule for
should be initialised to jiffies + MAX_JIFFY_OFFSET, not ULONG_MAX as
the comparison functions use signed arithmetic.
(5) Simplify the expiration handling, adding the expiration value to the
idle timestamp each time rather than keeping track of the time in the
past before which the idle timestamp must go to be expired. This is
much easier to read.
(6) Ignore the timeouts if the net namespace is dead.
(7) Restart the service reaper work item rather the client reaper.
Signed-off-by: David Howells <dhowells@redhat.com>
We need to transmit a packet every so often to act as a keepalive for the
peer (which has a timeout from the last time it received a packet) and also
to prevent any intervening firewalls from closing the route.
Do this by resetting a timer every time we transmit a packet. If the timer
ever expires, we transmit a PING ACK packet and thereby also elicit a PING
RESPONSE ACK from the other side - which prevents our last-rx timeout from
expiring.
The timer is set to 1/6 of the last-rx timeout so that we can detect the
other side going away if it misses 6 replies in a row.
This is particularly necessary for servers where the processing of the
service function may take a significant amount of time.
Signed-off-by: David Howells <dhowells@redhat.com>
Add an extra timeout that is set/updated when we send a DATA packet that
has the request-ack flag set. This allows us to detect if we don't get an
ACK in response to the latest flagged packet.
The ACK packet is adjudged to have been lost if it doesn't turn up within
2*RTT of the transmission.
If the timeout occurs, we schedule the sending of a PING ACK to find out
the state of the other side. If a new DATA packet is ready to go sooner,
we cancel the sending of the ping and set the request-ack flag on that
instead.
If we get back a PING-RESPONSE ACK that indicates a lower tx_top than what
we had at the time of the ping transmission, we adjudge all the DATA
packets sent between the response tx_top and the ping-time tx_top to have
been lost and retransmit immediately.
Rather than sending a PING ACK, we could just pick a DATA packet and
speculatively retransmit that with request-ack set. It should result in
either a REQUESTED ACK or a DUPLICATE ACK which we can then use in lieu the
a PING-RESPONSE ACK mentioned above.
Signed-off-by: David Howells <dhowells@redhat.com>
Express protocol timeouts for data retransmission and deferred ack
generation in terms on RTT rather than specified timeouts once we have
sufficient RTT samples.
For the moment, this requires just one RTT sample to be able to use this
for ack deferral and two for data retransmission.
The data retransmission timeout is set at RTT*1.5 and the ACK deferral
timeout is set at RTT.
Note that the calculated timeout is limited to a minimum of 4ns to make
sure it doesn't happen too quickly.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't transmit a DELAY ACK immediately on proposal when the Rx window is
rotated, but rather defer it to the work function. This means that we have
a chance to queue/consume more received packets before we actually send the
DELAY ACK, or even cancel it entirely, thereby reducing the number of
packets transmitted.
We do, however, want to continue sending other types of packet immediately,
particularly REQUESTED ACKs, as they may be used for RTT calculation by the
other side.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the rxrpc call expiration timeouts and make them settable from
userspace. By analogy with other rx implementations, there should be three
timeouts:
(1) "Normal timeout"
This is set for all calls and is triggered if we haven't received any
packets from the peer in a while. It is measured from the last time
we received any packet on that call. This is not reset by any
connection packets (such as CHALLENGE/RESPONSE packets).
If a service operation takes a long time, the server should generate
PING ACKs at a duration that's substantially less than the normal
timeout so is to keep both sides alive. This is set at 1/6 of normal
timeout.
(2) "Idle timeout"
This is set only for a service call and is triggered if we stop
receiving the DATA packets that comprise the request data. It is
measured from the last time we received a DATA packet.
(3) "Hard timeout"
This can be set for a call and specified the maximum lifetime of that
call. It should not be specified by default. Some operations (such
as volume transfer) take a long time.
Allow userspace to set/change the timeouts on a call with sendmsg, using a
control message:
RXRPC_SET_CALL_TIMEOUTS
The data to the message is a number of 32-bit words, not all of which need
be given:
u32 hard_timeout; /* sec from first packet */
u32 idle_timeout; /* msec from packet Rx */
u32 normal_timeout; /* msec from data Rx */
This can be set in combination with any other sendmsg() that affects a
call.
Signed-off-by: David Howells <dhowells@redhat.com>
When rxrpc_sendmsg() parses the control message buffer, it places the
parameters extracted into a structure, but lumps together call parameters
(such as user call ID) with operation parameters (such as whether to send
data, send an abort or accept a call).
Split the call parameters out into their own structure, a copy of which is
then embedded in the operation parameters struct.
The call parameters struct is then passed down into the places that need it
instead of passing the individual parameters. This allows for extra call
parameters to be added.
Signed-off-by: David Howells <dhowells@redhat.com>
Delay terminal ACK transmission on a client call by deferring it to the
connection processor. This allows it to be skipped if we can send the next
call instead, the first DATA packet of which will implicitly ack this call.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't set upgrade by default when creating a call from sendmsg(). This is
a holdover from when I was testing the code.
Signed-off-by: David Howells <dhowells@redhat.com>
The caller of rxrpc_accept_call() must release the lock on call->user_mutex
returned by that function.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull networking fixes from David Miller:
1) Fix PCI IDs of 9000 series iwlwifi devices, from Luca Coelho.
2) bpf offload bug fixes from Jakub Kicinski.
3) Fix bpf verifier to NOP out code which is dead at run time because
due to branch pruning the verifier will not explore such
instructions. From Alexei Starovoitov.
4) Fix crash when deleting secondary chains in packet scheduler
classifier. From Roman Kapl.
5) Fix buffer management bugs in smc, from Ursula Braun.
6) Fix regression in anycast route handling, from David Ahern.
7) Fix link settings regression in r8169, from Tobias Jakobi.
8) Add back enough UFO support so that live migration still works, from
Willem de Bruijn.
9) Linearize enough packet data for the full extent to which the ipvlan
code will inspect the packet headers, from Gao Feng.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
ipvlan: Fix insufficient skb linear check for ipv6 icmp
ipvlan: Fix insufficient skb linear check for arp
geneve: only configure or fill UDP_ZERO_CSUM6_RX/TX info when CONFIG_IPV6
net: dsa: bcm_sf2: Clear IDDQ_GLOBAL_PWR bit for PHY
net: accept UFO datagrams from tuntap and packet
net: realtek: r8169: implement set_link_ksettings()
net: ipv6: Fixup device for anycast routes during copy
net/smc: Fix preinitialization of buf_desc in __smc_buf_create()
net/smc: use sk_rcvbuf as start for rmb creation
ipv6: Do not consider linkdown nexthops during multipath
net: sched: fix crash when deleting secondary chains
net: phy: cortina: add missing MODULE_DESCRIPTION/AUTHOR/LICENSE
bpf: fix branch pruning logic
bpf: change bpf_perf_event_output arg5 type to ARG_CONST_SIZE_OR_ZERO
bpf: change bpf_probe_read_str arg2 type to ARG_CONST_SIZE_OR_ZERO
bpf: remove explicit handling of 0 for arg2 in bpf_probe_read
bpf: introduce ARG_PTR_TO_MEM_OR_NULL
i40evf: Use smp_rmb rather than read_barrier_depends
fm10k: Use smp_rmb rather than read_barrier_depends
igb: Use smp_rmb rather than read_barrier_depends
...
Daniel Borkmann says:
====================
pull-request: bpf 2017-11-23
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Several BPF offloading fixes, from Jakub. Among others:
- Limit offload to cls_bpf and XDP program types only.
- Move device validation into the driver and don't make
any assumptions about the device in the classifier due
to shared blocks semantics.
- Don't pass offloaded XDP program into the driver when
it should be run in native XDP instead. Offloaded ones
are not JITed for the host in such cases.
- Don't destroy device offload state when moved to
another namespace.
- Revert dumping offload info into user space for now,
since ifindex alone is not sufficient. This will be
redone properly for bpf-next tree.
2) Fix test_verifier to avoid using bpf_probe_write_user()
helper in test cases, since it's dumping a warning into
kernel log which may confuse users when only running tests.
Switch to use bpf_trace_printk() instead, from Yonghong.
3) Several fixes for correcting ARG_CONST_SIZE_OR_ZERO semantics
before it becomes uabi, from Gianluca. More specifically:
- Add a type ARG_PTR_TO_MEM_OR_NULL that is used only
by bpf_csum_diff(), where the argument is either a
valid pointer or NULL. The subsequent ARG_CONST_SIZE_OR_ZERO
then enforces a valid pointer in case of non-0 size
or a valid pointer or NULL in case of size 0. Given
that, the semantics for ARG_PTR_TO_MEM in combination
with ARG_CONST_SIZE_OR_ZERO are now such that in case
of size 0, the pointer must always be valid and cannot
be NULL. This fix in semantics allows for bpf_probe_read()
to drop the recently added size == 0 check in the helper
that would become part of uabi otherwise once released.
At the same time we can then fix bpf_probe_read_str() and
bpf_perf_event_output() to use ARG_CONST_SIZE_OR_ZERO
instead of ARG_CONST_SIZE in order to fix recently
reported issues by Arnaldo et al, where LLVM optimizes
two boundary checks into a single one for unknown
variables where the verifier looses track of the variable
bounds and thus rejects valid programs otherwise.
4) A fix for the verifier for the case when it detects
comparison of two constants where the branch is guaranteed
to not be taken at runtime. Verifier will rightfully prune
the exploration of such paths, but we still pass the program
to JITs, where they would complain about using reserved
fields, etc. Track such dead instructions and sanitize
them with mov r0,r0. Rejection is not possible since LLVM
may generate them for valid C code and doesn't do as much
data flow analysis as verifier. For bpf-next we might
implement removal of such dead code and adjust branches
instead. Fix from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.
Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.
It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216f ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643
("net: avoid skb_warn_bad_offload false positives on UFO").
(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.
Tested
Booted a v4.13 guest kernel with QEMU. On a host kernel before this
patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
enabled, same as on a v4.13 host kernel.
A UFO packet sent from the guest appears on the tap device:
host:
nc -l -p -u 8000 &
tcpdump -n -i tap0
guest:
dd if=/dev/zero of=payload.txt bs=1 count=2000
nc -u 192.16.1.1 8000 < payload.txt
Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
packets arriving fragmented:
./with_tap_pair.sh ./tap_send_ufo tap0 tap1
(from https://github.com/wdebruij/kerneltools/tree/master/tests)
Changes
v1 -> v2
- simplified set_offload change (review comment)
- documented test procedure
Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe8 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian reported a breakage with anycast routes due to commit
4832c30d54 ("net: ipv6: put host and anycast routes on device with
address"). Prior to this commit anycast routes were added against the
loopback device causing repetitive route entries with no insight into
why they existed. e.g.:
$ ip -6 ro ls table local type anycast
anycast 2001:db8:1:: dev lo proto kernel metric 0 pref medium
anycast 2001:db8:2:: dev lo proto kernel metric 0 pref medium
anycast fe80:: dev lo proto kernel metric 0 pref medium
anycast fe80:: dev lo proto kernel metric 0 pref medium
The point of commit 4832c30d54 is to add the routes using the device
with the address which is causing the route to be added. e.g.,:
$ ip -6 ro ls table local type anycast
anycast 2001:db8:1:: dev eth1 proto kernel metric 0 pref medium
anycast 2001:db8:2:: dev eth2 proto kernel metric 0 pref medium
anycast fe80:: dev eth2 proto kernel metric 0 pref medium
anycast fe80:: dev eth1 proto kernel metric 0 pref medium
For traffic to work as it did before, the dst device needs to be switched
to the loopback when the copy is created similar to local routes.
Fixes: 4832c30d54 ("net: ipv6: put host and anycast routes on device with address")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With gcc-4.1.2:
net/smc/smc_core.c: In function ‘__smc_buf_create’:
net/smc/smc_core.c:567: warning: ‘bufsize’ may be used uninitialized in this function
Indeed, if the for-loop is never executed, bufsize is used
uninitialized. In addition, buf_desc is stored for later use, while it
is still a NULL pointer.
Before, error handling was done by checking if buf_desc is non-NULL.
The cleanup changed this to an error check, but forgot to update the
preinitialization of buf_desc to an error pointer.
Update the preinitializatin of buf_desc to fix this.
Fixes: b33982c3a6 ("net/smc: cleanup function __smc_buf_create()")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3e034725c0 ("net/smc: common functions for RMBs and send buffers")
merged handling of SMC receive and send buffers. It introduced sk_buf_size
as merged start value for size determination. But since sk_buf_size is not
used at all, sk_sndbuf is erroneously used as start for rmb creation.
This patch makes sure, sk_buf_size is really used as intended, and
sk_rcvbuf is used as start value for rmb creation.
Fixes: 3e034725c0 ("net/smc: common functions for RMBs and send buffers")
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>