a helper to prepare the main RACK patch
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.
Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.
Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code
The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.
The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.
Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/asix_common.c
net/ipv4/inet_connection_sock.c
net/switchdev/switchdev.c
In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.
The other two conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree. Most relevantly, updates for the nfnetlink_log to integrate with
conntrack, fixes for cttimeout and improvements for nf_queue core, they are:
1) Remove useless ifdef around static inline function in IPVS, from
Eric W. Biederman.
2) Simplify the conntrack support for nfnetlink_queue: Merge
nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back
to nfnetlink_queue.c
3) Use y2038 safe timestamp from nfnetlink_queue.
4) Get rid of dead function definition in nf_conntrack, from Flavio
Leitner.
5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA.
This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that
controls enabling both nfqueue and nflog integration with conntrack.
The userspace application can request this via NFULNL_CFG_F_CONNTRACK
configuration flag.
6) Remove unused netns variables in IPVS, from Eric W. Biederman and
Simon Horman.
7) Don't put back the refcount on the cttimeout object from xt_CT on success.
8) Fix crash on cttimeout policy object removal. We have to flush out
the cttimeout extension area of the conntrack not to refer to an unexisting
object that was just removed.
9) Make sure rcu_callback completion before removing nfnetlink_cttimeout
module removal.
10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and
nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann.
11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is
requested. Again from Ken-ichirou MATSUZAWA.
12) Don't use pointer to previous hook when reinjecting traffic via
nf_queue with NF_REPEAT verdict since it may be already gone. This
also avoids a deadloop if the userspace application keeps returning
NF_REPEAT.
13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris.
14) Consolidate logger instance existence check in nfulnl_recv_config().
15) Fix broken atomicity when applying configuration updates to logger
instances in nfnetlink_log.
16) Get rid of the .owner attribute in our hook object. We don't need
this anymore since we're dropping pending packets that have escaped
from the kernel when unremoving the hook. Patch from Florian Westphal.
17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always
assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian.
18) Use static inline function instead of macros to define NF_HOOK() and
NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.
We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.
Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A dhcp server may provide parameters to a client from a pool of IP
addresses and using a shared rootfs, or provide a specific set of
parameters for a specific client, usually using the MAC address to
identify each client individually. The dhcp protocol also specifies
a client-id field which can be used to determine the correct
parameters to supply when no MAC address is available. There is
currently no way to tell the kernel to supply a specific client-id,
only the userspace dhcp clients support this feature, but this can
not be used when the network is needed before userspace is available
such as when the root filesystem is on NFS.
This patch is to be able to do something like "ip=dhcp,client_id_type,
client_id_value", as a kernel parameter to enable the kernel to
identify itself to the server.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This merge resolves conflicts with 75aec9df3a ("bridge: Remove
br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve
netns support in the network stack that reached upstream via David's
net-next tree.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
net/bridge/br_netfilter_hooks.c
This patch cleanses whitespace around arithmetical operators.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use tabs instead of spaces to indent code.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use tabs instead of spaces to indent second line of parameters in
function definitions.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Correct whitespace layout of ternary operators in the netfilter-ipv4
code.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Whitespace cleansing: Labels should not be indented.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
since commit 8405a8fff3 ("netfilter: nf_qeueue: Drop queue entries on
nf_unregister_hook") all pending queued entries are discarded.
So we can simply remove all of the owner handling -- when module is
removed it also needs to unregister all its hooks.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Under stress, a close() on a listener can trigger the
WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()
We need to test if listener is still active before queueing
a child in inet_csk_reqsk_queue_add()
Create a common inet_child_forget() helper, and use it
from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
In many cases we also need to release reference on request socket,
so add a helper to do this, reducing code size and complexity.
Fixes: 4bdc3d6614 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c69736696c.
At the time of above commit, tcp_req_err() and dccp_req_err()
were dead code, as SYN_RECV request sockets were not yet in ehash table.
Real bug was fixed later in a different commit.
We need to revert to not leak a refcount on request socket.
inet_csk_reqsk_queue_drop_and_put() will be added
in following commit to make clean inet_csk_reqsk_queue_drop()
does not release the reference owned by caller.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At listen() time, there is a small window where listener is visible with
a zero backlog, triggering a spurious "Possible SYN flooding on port"
message.
Nothing prevents us from setting the correct backlog.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we no longer hold listener lock in fast path, it is possible that a
child is created right after listener freed its bound port, if a close()
is done while incoming packets are processed.
__inet_inherit_port() must detect this and return an error,
so that caller can free the child earlier.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert the commit e2ca690b65 ("ipv4/icmp: redirect messages
can use the ingress daddr as source"), which tried to introduce a more
suitable behaviour for ICMP redirect messages generated by VRRP routers.
However RFC 5798 section 8.1.1 states:
The IPv4 source address of an ICMP redirect should be the address
that the end-host used when making its next-hop routing decision.
while said commit used the generating packet destination
address, which do not match the above and in most cases leads to
no redirect packets to be generated.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
become stale, meaning 3WHS can not complete.
But current behavior is wrong :
incoming packets finding such stale sockets are dropped.
We need instead to cleanup the request socket and perform another
lookup :
- Incoming ACK will give a RST answer,
- SYN rtx might find another listener if available.
- We expedite cleanup of request sockets and old listener socket.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.
So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_call_ra_chain is called early in the forwarding chain from
ip_forward and ip_mr_input, which makes skb->dev the correct
expression to get the input network device and dev_net(skb->dev) a
correct expression for the network namespace the packet is being
processed in.
Compute the network namespace and store it in a variable to make the
code clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.
The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:
Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.
If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SO_INCOMING_CPU as added in commit 2c8c56e15d was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()
This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.
This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.
Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The alive parameter of tcp_orphan_retries, indicates
whether the connection is assumed alive or not.
In the function and all places calling it is used as a boolean value.
Therefore this changes the type of alive to bool in the function
definition and all calling locations.
Since tcp_orphan_tries is a tcp_timer.c local function no change in
any other file or header is necessary.
Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before recent TCP listener patches, we were updating listener
sk->sk_rxhash before the cloning of master socket.
children sk_rxhash was therefore correct after the normal 3WHS.
But with lockless listener, we no longer dirty/change listener sk_rxhash
as it would be racy.
We need to correctly update the child sk_rxhash, otherwise first data
packet wont hit correct cpu if RFS is used.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit deaa0a6a93 ("net: Lookup actual route when oif is VRF device")
exposed a bug in __ip_route_output_key_hash for VRF devices: on FIB lookup
failure if the oif is specified the current logic drops to make_route on
the assumption that the route tables are wrong. For VRF/L3 master devices
this leads to wrong dst entries and route lookups. For example:
$ ip route ls table vrf-red
unreachable default
broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2
10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2
local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2
broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2
$ ip route get oif vrf-red 1.1.1.1
1.1.1.1 dev vrf-red src 10.0.0.2
cache
With this patch:
$ ip route get oif vrf-red 1.1.1.1
RTNETLINK answers: No route to host
which is the correct response based on the default route
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network namespace is already passed into dst_output pass it into
dst->output lwt->output and friends.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Compute net and store it in a variable in the functions
ip_build_and_send_pkt and ip_queue_xmit so that it does not need to be
recomputed next time it is needed.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Store net in a variable in ip_tunnel_xmit so it does not need
to be recomputed when it is used again.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is confusing and silly hiding a parameter so modify all of
the callers to pass in the appropriate socket or skb->sk if
no socket is known.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with the other similar methods in the kernel pass a
struct sock into the dst_ops .local_out method.
Simplifying the socket passing case is needed a prequel to passing a
struct net reference into .local_out.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace dst_output_okfn with dst_output
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a packet has been encapsulated by a tunnel we should use the
tunnel sockets local multicast loopback flag to control if the
encapsulated packet should be locally loopback back.
Pass sk into ip_local_out_sk so that in the rare case we are dealing
with a tunneled packet whose tunnel destination address is a multicast
address the kernel properly decides to loopback this packet.
In practice I don't think this matters as ip_queue_xmit is used by
tcp, l2tp and sctp none of which I am aware of uses ip level
multicasting as they are all point to point communications protocols.
Let's fix this before someone uses ip_queue_xmit for a tunnel protocol
that does use multicast.
Fixes: aad88724c9 ("ipv4: add a sock pointer to dst->output() path.")
Fixes: b0270e9101 ("ipv4: add a sock pointer to ip_queue_xmit()")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the rare case where sk != skb->sk ip_local_out_sk arranges
to call dst->output differently if the skb is queued or not.
This is a bug.
Fix this bug by passing the sk parameter of ip_local_out_sk through
from ip_local_out_sk to __ip_local_out_sk (skipping __ip_local_out).
Fixes: 7026b1ddb6 ("netfilter: Pass socket pointer down through okfn().")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some applications use a listen() backlog of 1.
Prior kernels were silently enforcing a qlen_log of 4, so that we were
sending up to /proc/sys/net/ipv4/tcp_synack_retries SYNACK messages.
Fixes: ef547f2ac1 ("tcp: remove max_qlen_log")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the user specifies a VRF device in a get route query the custom route
pointing to the VRF device is returned:
$ ip route ls table vrf-red
unreachable default
broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2
10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2
local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2
broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2
$ ip route get oif vrf-red 10.2.1.40
10.2.1.40 dev vrf-red
cache
Add the flags to skip the custom route and go directly to the FIB. With
this patch the actual route is returned:
$ ip route get oif vrf-red 10.2.1.40
10.2.1.40 dev eth1 src 10.2.1.2
cache
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ping originated on box through a VRF device is showing up in tcpdump
without a source address:
$ tcpdump -n -i vrf-blue
08:58:33.311303 IP 0.0.0.0 > 10.2.2.254: ICMP echo request, id 2834, seq 1, length 64
08:58:33.311562 IP 10.2.2.254 > 10.2.2.2: ICMP echo reply, id 2834, seq 1, length 64
Add the call to l3mdev_get_saddr to raw_sendmsg.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add operation to l3mdev to lookup source address for a given flow.
Add support for the operation to VRF driver and convert existing
IPv4 hooks to use the new lookup.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF device needs the same path selection following lookup to set source
address. Rather than duplicating code, move existing code into a
function that is exported to modules.
Code move only; no functional change.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes
net/built-in.o: In function `fib_rebalance':
fib_semantics.c:(.text+0x9df14): undefined reference to `__divdi3'
and
net/built-in.o: In function `fib_rebalance':
net/ipv4/fib_semantics.c:572: undefined reference to `__aeabi_ldivmod'
Fixes: 0e884c78ee ("ipv4: L3 hash-based multipath")
Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>