I wrongly assumed tp->tcp_mstamp was up to date at the time
tcp_rack_reo_timeout() was called.
It is not true, since we only update tcp->tcp_mstamp when receiving
a packet (as initially done in commit 69e996c58a ("tcp: add
tp->tcp_mstamp field")
tcp_rack_reo_timeout() being called by a timer and not an incoming
packet, we need to refresh tp->tcp_mstamp
Fixes: 7c1c730859 ("tcp: do not pass timestamp to tcp_rack_detect_loss()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Always zero out ca_priv data in tcp_assign_congestion_control() so that
ca_priv data is cleared out during socket creation.
Also always zero out ca_priv data in tcp_reinit_congestion_control() so
that when cc algorithm is changed, ca_priv data is cleared out as well.
We should still zero out ca_priv data even in TCP_CLOSE state because
user could call connect() on AF_UNSPEC to disconnect the socket and
leave it in TCP_CLOSE state and later call setsockopt() to switch cc
algorithm on this socket.
Fixes: 2b0a8c9ee ("tcp: add CDG congestion control")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices or distributions use HZ=100 or HZ=250
TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.
With tp->tcp_mstamp introduction, we can switch to high resolution
timestamps almost for free (at the expense of 8 additional bytes per
TCP structure)
Note that some TCP stacks use usec TCP timestamps where this
patch makes even more sense : Many TCP flows have < 500 usec RTT.
Hopefully this finer TS option can be standardized soon.
Tested:
HZ=100 kernel
./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &
Peer without patch :
lpaa24:~# ss -tmi dst lpaa23
...
skmem:(r0,rb8388608,...)
rcv_rtt:10 rcv_space:3210000 minrtt:0.017
Peer with the patch :
lpaa23:~# ss -tmi dst lpaa24
...
skmem:(r0,rb428800,...)
rcv_rtt:0.069 rcv_space:30000 minrtt:0.017
We can see saner RCVBUF, and more precise rcv_rtt information.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is no longer needed, everything uses tp->tcp_mstamp instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch will remove ack_time from struct tcp_sacktag_state
Same info is now found in tp->tcp_mstamp
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed, since tp->tcp_mstamp holds the information.
This is needed to remove sack_state.ack_time in a following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed, since tp->tcp_mstamp holds the information.
This is needed to remove sack_state.ack_time in a following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not used anymore now tp->tcp_mstamp holds the information.
This is needed to remove sack_state.ack_time in a following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not used anymore now tp->tcp_mstamp holds the information.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is no longer used, since tcp_rack_detect_loss() takes
the timestamp from tp->tcp_mstamp
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can use tp->tcp_mstamp as it contains a recent timestamp.
This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to use precise timestamps in TCP stack, but we do not
want to call possibly expensive kernel time services too often.
tp->tcp_mstamp is guaranteed to be updated once per incoming packet.
We will use it in the following patches, removing specific
skb_mstamp_get() calls, and removing ack_time from
struct tcp_sacktag_state.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph Paasch from Apple found another firewall issue for TFO:
After successful 3WHS using TFO, server and client starts to exchange
data. Afterwards, a 10s idle time occurs on this connection. After that,
firewall starts to drop every packet on this connection.
The fix for this issue is to extend existing firewall blackhole detection
logic in tcp_write_timeout() by removing the mss check.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This counter records the number of times the firewall blackhole issue is
detected and active TFO is disabled.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise, UDP checksum offloads could corrupt ESP packets by attempting
to calculate UDP checksum when this inner UDP packet is already protected
by IPsec.
One way to reproduce this bug is to have a VM with virtio_net driver (UFO
set to ON in the guest VM); and then encapsulate all guest's Ethernet
frames in Geneve; and then further encrypt Geneve with IPsec. In this
case following symptoms are observed:
1. If using ixgbe NIC, then it will complain with following error message:
ixgbe 0000:01:00.1: partial checksum but l4 proto=32!
2. Receiving IPsec stack will drop all the corrupted ESP packets and
increase XfrmInStateProtoError counter in /proc/net/xfrm_stat.
3. iperf UDP test from the VM with packet sizes above MTU will not work at
all.
4. iperf TCP test from the VM will get ridiculously low performance because.
Signed-off-by: Ansis Atteka <aatteka@ovn.org>
Co-authored-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David reported that doing the following:
ip li add red type vrf table 10
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red
when either policy routing IP rules are present or the local table
lookup ip rule is before the l3mdev lookup results in a hang with
these messages:
unregister_netdevice: waiting for red to become free. Usage count = 1
The problem is caused by caching the dst used for sending the packet
out of the specified interface on a local route with a different
nexthop interface. Thus the dst could stay around until the route in
the table the lookup was done is deleted which may be never.
Address the problem by not forcing output device to be the l3mdev in
the flow's output interface if the lookup didn't use the l3mdev. This
then results in the dst using the right device according to the route.
Changes in v2:
- make the dev_out passed in by __ip_route_output_key_hash correct
instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as
suggested by David.
Fixes: 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be a loopback")
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Suggested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2017-04-20
This adds the basic infrastructure for IPsec hardware
offloading, it creates a configuration API and adjusts
the packet path.
1) Add the needed netdev features to configure IPsec offloads.
2) Add the IPsec hardware offloading API.
3) Prepare the ESP packet path for hardware offloading.
4) Add gso handlers for esp4 and esp6, this implements
the software fallback for GSO packets.
5) Add xfrm replay handler functions for offloading.
6) Change ESP to use a synchronous crypto algorithm on
offloading, we don't have the option for asynchronous
returns when we handle IPsec at layer2.
7) Add a xfrm validate function to validate_xmit_skb. This
implements the software fallback for non GSO packets.
8) Set the inner_network and inner_transport members of
the SKB, as well as encapsulation, to reflect the actual
positions of these headers, and removes them only once
encryption is done on the payload.
From Ilan Tayari.
9) Prepare the ESP GRO codepath for hardware offloading.
10) Fix incorrect null pointer check in esp6.
From Colin Ian King.
11) Fix for the GSO software fallback path to detect the
fallback correctly.
From Ilan Tayari.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This feature allows the administrator to set an fwmark for
packets traversing a tunnel. This allows the use of independent
routing tables for tunneled packets without the use of iptables.
There is no concept of per-packet routing decisions through IPv4
tunnels, so this implementation does not need to work with
per-packet route lookups as the v6 implementation may
(with IP6_TNL_F_USE_ORIG_FWMARK).
Further, since the v4 tunnel ioctls share datastructures
(which can not be trivially modified) with the kernel's internal
tunnel configuration structures, the mark attribute must be stored
in the tunnel structure itself and passed as a parameter when
creating or changing tunnel attributes.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using TCP FastOpen for an active session, we send one wakeup event
from tcp_finish_connect(), right before the data eventually contained in
the received SYNACK is queued to sk->sk_receive_queue.
This means that depending on machine load or luck, poll() users
might receive POLLOUT events instead of POLLIN|POLLOUT
To fix this, we need to move the call to sk->sk_state_change()
after the (optional) call to tcp_rcv_fastopen_synack()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a RST packet is processed, we send two wakeup events to interested
polling users.
First one by a sk->sk_error_report(sk) from tcp_reset(),
followed by a sk->sk_state_change(sk) from tcp_done().
Depending on machine load and luck, poll() can either return POLLERR,
or POLLIN|POLLOUT|POLLERR|POLLHUP (this happens on 99 % of the cases)
This is probably fine, but we can avoid the confusion by reordering
things so that we have more TCP fields updated before the first wakeup.
This might even allow us to remove some barriers we added in the past.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
If esp*_offload module is loaded, outbound packets take the
GSO code path, being encapsulated at layer 3, but encrypted
in layer 2. validate_xmit_xfrm calls esp*_xmit for that.
esp*_xmit was wrongfully detecting these packets as going
through hardware crypto offload, while in fact they should
be encrypted in software, causing plaintext leakage to
the network, and also dropping at the receiver side.
Perform the encryption in esp*_xmit, if the SA doesn't have
a hardware offload_handle.
Also, align esp6 code to esp4 logic.
Fixes: fca11ebde3 ("esp4: Reorganize esp_output")
Fixes: 383d0350f2 ("esp6: Reorganize esp_output")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.
This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzkaller reported a use-after-free in ip_recv_error at line
info->ipi_ifindex = skb->dev->ifindex;
This function is called on dequeue from the error queue, at which
point the device pointer may no longer be valid.
Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
pointer is valid or NULL. Store it in temporary storage skb->cb.
It is safe to reference skb->dev here, as called from device drivers
or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
in that case it is NULL and ifindex is set to 0 (invalid).
Do not return a pktinfo cmsg if ifindex is 0. This maintains the
current behavior of not returning a cmsg if skb->dev was NULL.
On dequeue, the ipv4 path will cast from sock_exterr_skb to
in_pktinfo. Both have ifindex as their first element, so no explicit
conversion is needed. This is by design, introduced in commit
0b922b7a82 ("net: original ingress device index in PKTINFO"). For
ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.
Fixes: 829ae9d611 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit 87e9f03159
("ipv4: fix a potential deadlock in mcast getsockopt() path"),
there is a deadlock scenario for IP_ROUTER_ALERT too:
CPU0 CPU1
---- ----
lock(rtnl_mutex);
lock(sk_lock-AF_INET);
lock(rtnl_mutex);
lock(sk_lock-AF_INET);
Fix this by always locking RTNL first on all setsockopt() paths.
Note, after this patch ip_ra_lock is no longer needed either.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Missing TCP header sanity check in TCPMSS target, from Eric Dumazet.
2) Incorrect event message type for related conntracks created via
ctnetlink, from Liping Zhang.
3) Fix incorrect rcu locking when handling helpers from ctnetlink,
from Gao feng.
4) Fix missing rcu locking when updating helper, from Liping Zhang.
5) Fix missing read_lock_bh when iterating over list of device addresses
from TPROXY and redirect, also from Liping.
6) Fix crash when trying to dump expectations from conntrack with no
helper via ctnetlink, from Liping.
7) Missing RCU protection to expecation list update given ctnetlink
iterates over the list under rcu read lock side, from Liping too.
8) Don't dump autogenerated seed in nft_hash to userspace, this is
very confusing to the user, again from Liping.
9) Fix wrong conntrack netns module refcount in ipt_CLUSTERIP,
from Gao feng.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
On IPsec hardware offloading, we already get a secpath with
valid state attached when the packet enters the GRO handlers.
So check for hardware offload and skip the state lookup in this
case.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Both esp4 and esp6 used to assume that the SKB payload is encrypted
and therefore the inner_network and inner_transport offsets are
not relevant.
When doing crypto offload in the NIC, this is no longer the case
and the NIC driver needs these offsets so it can do TX TCP checksum
offloading.
This patch sets the inner_network and inner_transport members of
the SKB, as well as encapsulation, to reflect the actual positions
of these headers, and removes them only once encryption is done
on the payload.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We need a fallback algorithm for crypto offloading to a NIC.
This is because packets can be rerouted to other NICs that
don't support crypto offloading. The fallback is going to be
implemented at layer2 where we know the final output device
but can't handle asynchronous returns fron the crypto layer.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch extends the xfrm_type by an encap function pointer
and implements esp4_gso_encap and esp6_gso_encap. These functions
doing the basic esp encapsulation for a GSO packet. In case the
GSO packet needs to be segmented in software, we add gso_segment
functions. This codepath is going to be used on esp hardware
offloads.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We need a fallback for ESP at layer 2, so split esp_output
into generic functions that can be used at layer 3 and layer 2
and use them in esp_output. We also add esp_xmit which is
used for the layer 2 fallback.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch adds all the bits that are needed to do
IPsec hardware offload for IPsec states and ESP packets.
We add xfrmdev_ops to the net_device. xfrmdev_ops has
function pointers that are needed to manage the xfrm
states in the hardware and to do a per packet
offloading decision.
Joint work with:
Ilan Tayari <ilant@mellanox.com>
Guy Shapiro <guysh@mellanox.com>
Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch adds a gso_segment and xmit callback for the
xfrm_mode and implement these functions for tunnel and
transport mode.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Current codes invoke wrongly nf_ct_netns_get in the destroy routine,
it should use nf_ct_netns_put, not nf_ct_netns_get.
It could cause some modules could not be unloaded.
Fixes: ecb2421b5d ("netfilter: add and use nf_ct_netns_get/put")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Don't get the metric RTAX_ADVMSS of dst.
There are two reasons.
1) Its caller dst_metric_advmss has already invoke dst_metric_advmss
before invoke default_advmss.
2) The ipv4_default_advmss is used to get the default mss, it should
not try to get the metric like ip6_default_advmss.
2. Use sizeof(tcphdr)+sizeof(iphdr) instead of literal 40.
3. Define one new macro IPV4_MAX_PMTU instead of 65535 according to
RFC 2675, section 5.1.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because TCP_MIB_OUTRSTS is an important count, so always increase it
whatever send it successfully or not.
Now move the increment of TCP_MIB_OUTRSTS to the top of
tcp_send_active_reset to make sure it is increased always even though
fail to alloc skb.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent extension of F-RTO 89fe18e44 ("tcp: extend F-RTO
to catch more spurious timeouts") interacts badly with certain
broken middle-boxes. These broken boxes modify and falsely raise
the receive window on the ACKs. During a timeout induced recovery,
F-RTO would send new data packets to probe if the timeout is false
or not. Since the receive window is falsely raised, the receiver
would silently drop these F-RTO packets. The recovery would take N
(exponentially backoff) timeouts to repair N packet losses. A TCP
performance killer.
Due to this unfortunate situation, this patch removes this extension
to revert F-RTO back to the RFC specification.
Fixes: 89fe18e44f ("tcp: extend F-RTO to catch more spurious timeouts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
ip_route_input when iif is given. If a multipath route is present for
the designated destination, fib_multipath_hash ends up being called with
that skb. However, as that skb contains no information beyond the
protocol type, the calculated hash does not match the one we would see
for a real packet.
There is currently no way to fix this for layer 4 hashing, as
RTM_GETROUTE doesn't have the necessary information to create layer 4
headers. To fix this for layer 3 hashing, set appropriate saddr/daddrs
in the skb and also change the protocol to UDP to avoid special
treatment for ICMP.
Signed-off-by: Florian Larysch <fl@n621.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
ip_route_input when iif is given. If a multipath route is present for
the designated destination, ip_multipath_icmp_hash ends up being called,
which uses the source/destination addresses within the skb to calculate
a hash. However, those are not set in the synthetic skb, causing it to
return an arbitrary and incorrect result.
Instead, use UDP, which gets no such special treatment.
Signed-off-by: Florian Larysch <fl@n621.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the reordering SNMP counters only increase if a connection
sees a higher degree then it has previously seen. It ignores if the
reordering degree is not greater than the default system threshold.
This significantly under-counts the number of reordering events
and falsely convey that reordering is rare on the network.
This patch properly and faithfully records the number of reordering
events detected by the TCP stack, just like the comment says "this
exciting event is worth to be remembered". Note that even so TCP
still under-estimate the actual reordering events because TCP
requires TS options or certain packet sequences to detect reordering
(i.e. ACKing never-retransmitted sequence in recovery or disordered
state).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lost retransmit SNMP stat is under-counting retransmission
that uses segment offloading. This patch fixes that so all
retransmission related SNMP counters are consistent.
Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define one new macro TCP_MAX_WSCALE instead of literal number '14',
and use U16_MAX instead of 65535 as the max value of TCP window.
There is another minor change, use rounddown(space, mss) instead of
(space / mss) * mss;
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>