Commit Graph

606 Commits

Author SHA1 Message Date
Eric Dumazet 789f558cfb tcp/dccp: get rid of central timewait timer
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 16:40:05 -04:00
Eric Dumazet b52e69217b tcp: md5: fix a typo in tcp_v4_md5_lookup()
Lookup key for tcp_md5_do_lookup() has to be taken
from addr_sk, not sk (which can be the listener)

Fixes: fd3a154a00 ("tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 18:11:16 -04:00
Ian Morris 00db41243e ipv4: coding style: comparison for inequality with NULL
The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03 12:11:15 -04:00
Ian Morris 51456b2914 ipv4: coding style: comparison for equality with NULL
The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03 12:11:15 -04:00
David S. Miller 9f0d34bc34 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	drivers/net/usb/sr9800.c
	drivers/net/usb/usbnet.c
	include/linux/usb/usbnet.h
	net/ipv4/tcp_ipv4.c
	net/ipv6/tcp_ipv6.c

The TCP conflicts were overlapping changes.  In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.

With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:16:53 -04:00
Eric Dumazet 41d25fe092 tcp: tcp_syn_flood_action() can be static
After commit 1fb6f159fd ("tcp: add tcp_conn_request"),
tcp_syn_flood_action() is no longer used from IPv6.

We can make it static, by moving it above tcp_conn_request()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 12:17:18 -07:00
Eric Dumazet 0144a81ccc tcp: fix ipv4 mapped request socks
ss should display ipv4 mapped request sockets like this :

tcp    SYN-RECV   0      0  ::ffff:192.168.0.1:8080   ::ffff:192.0.2.1:35261

and not like this :

tcp    SYN-RECV   0      0  192.168.0.1:8080   192.0.2.1:35261

We should init ireq->ireq_family based on listener sk_family,
not the actual protocol carried by SYN packet.

This means we can set ireq_family in inet_reqsk_alloc()

Fixes: 3f66b083a5 ("inet: introduce ireq_family")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 00:57:48 -04:00
Eric Dumazet fd3a154a00 tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup()
With request socks convergence, we no longer need
different lookup methods. A request socket can
use generic lookup function.

Add const qualifier to 2nd tcp_v[46]_md5_lookup() parameter.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:30 -04:00
Eric Dumazet 39f8e58e53 tcp: md5: remove request sock argument of calc_md5_hash()
Since request and established sockets now have same base,
there is no need to pass two pointers to tcp_v4_md5_hash_skb()
or tcp_v6_md5_hash_skb()

Also add a const qualifier to their struct tcp_md5sig_key argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:30 -04:00
Eric Dumazet ff74e23f7e tcp: md5: input path is run under rcu protected sections
It is guaranteed that both tcp_v4_rcv() and tcp_v6_rcv()
run from rcu read locked sections :

ip_local_deliver_finish() and ip6_input_finish() both
use rcu_read_lock()

Also align tcp_v6_inbound_md5_hash() on tcp_v4_inbound_md5_hash()
by returning a boolean.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:29 -04:00
Michal Kubeček d0c294c53a tcp: prevent fetching dst twice in early demux code
On s390x, gcc 4.8 compiles this part of tcp_v6_early_demux()

        struct dst_entry *dst = sk->sk_rx_dst;

        if (dst)
                dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);

to code reading sk->sk_rx_dst twice, once for the test and once for
the argument of ip6_dst_check() (dst_check() is inline). This allows
ip6_dst_check() to be called with null first argument, causing a crash.

Protect sk->sk_rx_dst access by READ_ONCE() both in IPv4 and IPv6
TCP early demux code.

Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Fixes: c7109986db ("ipv6: Early TCP socket demux")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:38:24 -04:00
Fan Du c69736696c inet: fix double request socket freeing
Eric Hugne reported following error :

I'm hitting this warning on latest net-next when i try to SSH into a machine
with eth0 added to a bridge (but i think the problem is older than that)

Steps to reproduce:
node2 ~ # brctl addif br0 eth0
[  223.758785] device eth0 entered promiscuous mode
node2 ~ # ip link set br0 up
[  244.503614] br0: port 1(eth0) entered forwarding state
[  244.505108] br0: port 1(eth0) entered forwarding state
node2 ~ # [  251.160159] ------------[ cut here ]------------
[  251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
[  251.162077] Modules linked in:
[  251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
[  251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  251.164078]  ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
[  251.165084]  0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
[  251.166195]  ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
[  251.167203] Call Trace:
[  251.167533]  [<ffffffff8162ace4>] dump_stack+0x45/0x57
[  251.168206]  [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0
[  251.169239]  [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20
[  251.170271]  [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720
[  251.171408]  [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10
[  251.172589]  [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40
[  251.173366]  [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0
[  251.174134]  [<ffffffff815693a2>] icmp_unreach+0xc2/0x280
[  251.174820]  [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0
[  251.175473]  [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0
[  251.176282]  [<ffffffff815354d8>] ip_local_deliver+0x88/0x90
[  251.177004]  [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310
[  251.177693]  [<ffffffff815357bc>] ip_rcv+0x2dc/0x390
[  251.178336]  [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20
[  251.179170]  [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80
[  251.179922]  [<ffffffff814f97d4>] process_backlog+0x94/0x120
[  251.180639]  [<ffffffff814f9612>] net_rx_action+0x1e2/0x310
[  251.181356]  [<ffffffff81051267>] __do_softirq+0xa7/0x290
[  251.182046]  [<ffffffff81051469>] run_ksoftirqd+0x19/0x30
[  251.182726]  [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0
[  251.183485]  [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130
[  251.184228]  [<ffffffff8106935e>] kthread+0xee/0x110
[  251.184871]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[  251.185690]  [<ffffffff81631108>] ret_from_fork+0x58/0x90
[  251.186385]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[  251.187216] ---[ end trace c947fc7b24e42ea1 ]---
[  259.542268] br0: port 1(eth0) entered forwarding state

Remove the double calls to reqsk_put()

[edumazet] :

I got confused because reqsk_timer_handler() _has_ to call
reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
the timer handler holds a reference on req.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 21:40:48 -04:00
Eric Dumazet 26e3736090 ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets
tcp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New tcp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet b282705336 net: convert syn_wait_lock to a spinlock
This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet fa76ce7328 inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Eric Dumazet 52452c5425 inet: drop prev pointer handling in request sock
When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Eric Dumazet 08d2cc3b26 inet: request sock should init IPv6/IPv4 addresses
In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:35 -04:00
Eric Dumazet d1e559d0b1 inet: add IPv6 support to sk_ehashfn()
Intent is to converge IPv4 & IPv6 inet_hash functions to
factorize code.

IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr
in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set()
helpers.

__inet6_hash can now use sk_ehashfn() instead of a private
inet6_sk_ehashfn() and will simply use __inet_hash() in a
following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:34 -04:00
Eric Dumazet f7e4eb03f9 inet: ip early demux should avoid request sockets
When a request socket is created, we do not cache ip route
dst entry, like for timewait sockets.

Let's use sk_fullsock() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16 15:55:29 -04:00
Eric Dumazet 3f66b083a5 inet: introduce ireq_family
Before inserting request socks into general hash table,
fill their socket family.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 22:58:13 -04:00
Eric Dumazet d4f06873b6 inet: get_openreq4() & get_openreq6() do not need listener
ireq->ir_num contains local port, use it.

Also, get_openreq4() dumping listen_sk->refcnt makes litle sense.

inet_diag_fill_req() can also use ireq->ir_num

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 22:58:13 -04:00
Fan Du 05cbc0db03 ipv4: Create probe timer for tcp PMTU as per RFC4821
As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:57:42 -05:00
Fan Du 6b58e0a5f3 ipv4: Use binary search to choose tcp PMTU probe_size
Current probe_size is chosen by doubling mss_cache,
the probing process will end shortly with a sub-optimal
mss size, and the link mtu will not be taken full
advantage of, in return, this will make user to tweak
tcp_base_mss with care.

Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.

In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.

Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24  -t 60

before this patch:
1.26 Gbits/sec

After this patch: increase 26%
1.59 Gbits/sec

Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:57:41 -05:00
Fan Du b0f9ca53cb ipv4: Namespecify TCP PMTU mechanism
Packetization Layer Path MTU Discovery works separately beside
Path MTU Discovery at IP level, different net namespace has
various requirements on which one to chose, e.g., a virutalized
container instance would require TCP PMTU to probe an usable
effective mtu for underlying tunnel, while the host would
employ classical ICMP based PMTU to function.

Hence making TCP PMTU mechanism per net namespace to decouple
two functionality. Furthermore the probe base MSS should also
be configured separately for each namespace.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 18:45:00 -08:00
David S. Miller 6e03f896b5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/vxlan.c
	drivers/vhost/net.c
	include/linux/if_vlan.h
	net/core/dev.c

The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.

In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.

In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.

In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 14:33:28 -08:00
Eric Dumazet bdbbb8527b ipv4: tcp: get rid of ugly unicast_sock
In commit be9f4a44e7 ("ipv4: tcp: remove per net tcp_sock")
I tried to address contention on a socket lock, but the solution
I chose was horrible :

commit 3a7c384ffd ("ipv4: tcp: unicast_sock should not land outside
of TCP stack") addressed a selinux regression.

commit 0980e56e50 ("ipv4: tcp: set unicast_sock uc_ttl to -1")
took care of another regression.

commit b5ec8eeac4 ("ipv4: fix ip_send_skb()") fixed another regression.

commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
was another shot in the dark.

Really, just use a proper socket per cpu, and remove the skb_orphan()
call, to re-enable flow control.

This solves a serious problem with FQ packet scheduler when used in
hostile environments, as we do not want to allocate a flow structure
for every RST packet sent in response to a spoofed packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01 23:06:19 -08:00
Daniel Borkmann 81164413ad net: tcp: add per route congestion control
This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.

For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.

The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.

The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.

Joint work with Florian Westphal.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05 22:55:24 -05:00
David S. Miller 22f10923dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/amd/xgbe/xgbe-desc.c
	drivers/net/ethernet/renesas/sh_eth.c

Overlapping changes in both conflict cases.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 15:48:20 -05:00
Eric Dumazet 0f85feae6b tcp: fix more NULL deref after prequeue changes
When I cooked commit c3658e8d0f ("tcp: fix possible NULL dereference in
tcp_vX_send_reset()") I missed other spots we could deref a NULL
skb_dst(skb)

Again, if a socket is provided, we do not need skb_dst() to get a
pointer to network namespace : sock_net(sk) is good enough.

Reported-by: Dann Frazier <dann.frazier@canonical.com>
Bisected-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: ca777eff51 ("tcp: remove dst refcount false sharing for prequeue mode")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 21:38:44 -05:00
David S. Miller 60b7379dc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
Eric Dumazet c3658e8d0f tcp: fix possible NULL dereference in tcp_vX_send_reset()
After commit ca777eff51 ("tcp: remove dst refcount false sharing for
prequeue mode") we have to relax check against skb dst in
tcp_v[46]_send_reset() if prequeue dropped the dst.

If a socket is provided, a full lookup was done to find this socket,
so the dst test can be skipped.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88191
Reported-by: Jaša Bartelj <jasa.bartelj@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Borkmann <dborkman@redhat.com>
Fixes: ca777eff51 ("tcp: remove dst refcount false sharing for prequeue mode")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-25 14:29:18 -05:00
Eric Dumazet 2c8c56e15d net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.

Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.

Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.

We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.

After accept(), connect(), or even file descriptor passing around
processes, applications can use :

 int cpu;
 socklen_t len = sizeof(cpu);

 getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);

And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 13:00:06 -05:00
Eric Dumazet 3d97379a67 tcp: move sk_mark_napi_id() at the right place
sk_mark_napi_id() is used to record for a flow napi id of incoming
packets for busypoll sake.
We should do this only on established flows, not on listeners.

This was 'working' by virtue of the socket cloning, but doing
this on SYN packets in unecessary cache line dirtying.

Even if we move sk_napi_id in the same cache line than sk_lock,
we are working to make SYN processing lockless, so it is desirable
to set sk_napi_id only for established flows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 13:00:05 -05:00
Sathya Perla 9e7ceb0607 net: fix saving TX flow hash in sock for outgoing connections
The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
to set skb->hash which is used to spread flows across multiple TXQs.

But, the above routines are invoked before the source port of the connection
is created. Because of this all outgoing connections that just differ in the
source port get hashed into the same TXQ.

This patch fixes this problem for IPv4/6 by invoking the the above routines
after the source port is available for the socket.

Fixes: b73c3d0e4("net: Save TX flow hash in sock and set in skbuf on xmit")

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-22 16:14:29 -04:00
Cong Wang 461b74c391 ipv4: clean up cookie_v4_check()
We can retrieve opt from skb, no need to pass it as a parameter.
And opt should always be non-NULL, no need to check.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-17 12:02:57 -04:00
Cong Wang e25f866fbc ipv4: share tcp_v4_save_options() with cookie_v4_check()
cookie_v4_check() allocates ip_options_rcu in the same way
with tcp_v4_save_options(), we can just make it a helper function.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-17 12:02:57 -04:00
Linus Torvalds 35a9ad8af0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Most notable changes in here:

   1) By far the biggest accomplishment, thanks to a large range of
      contributors, is the addition of multi-send for transmit.  This is
      the result of discussions back in Chicago, and the hard work of
      several individuals.

      Now, when the ->ndo_start_xmit() method of a driver sees
      skb->xmit_more as true, it can choose to defer the doorbell
      telling the driver to start processing the new TX queue entires.

      skb->xmit_more means that the generic networking is guaranteed to
      call the driver immediately with another SKB to send.

      There is logic added to the qdisc layer to dequeue multiple
      packets at a time, and the handling mis-predicted offloads in
      software is now done with no locks held.

      Finally, pktgen is extended to have a "burst" parameter that can
      be used to test a multi-send implementation.

      Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
      virtio_net

      Adding support is almost trivial, so export more drivers to
      support this optimization soon.

      I want to thank, in no particular or implied order, Jesper
      Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
      Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
      David Tat, Hannes Frederic Sowa, and Rusty Russell.

   2) PTP and timestamping support in bnx2x, from Michal Kalderon.

   3) Allow adjusting the rx_copybreak threshold for a driver via
      ethtool, and add rx_copybreak support to enic driver.  From
      Govindarajulu Varadarajan.

   4) Significant enhancements to the generic PHY layer and the bcm7xxx
      driver in particular (EEE support, auto power down, etc.) from
      Florian Fainelli.

   5) Allow raw buffers to be used for flow dissection, allowing drivers
      to determine the optimal "linear pull" size for devices that DMA
      into pools of pages.  The objective is to get exactly the
      necessary amount of headers into the linear SKB area pre-pulled,
      but no more.  The new interface drivers use is eth_get_headlen().
      From WANG Cong, with driver conversions (several had their own
      by-hand duplicated implementations) by Alexander Duyck and Eric
      Dumazet.

   6) Support checksumming more smoothly and efficiently for
      encapsulations, and add "foo over UDP" facility.  From Tom
      Herbert.

   7) Add Broadcom SF2 switch driver to DSA layer, from Florian
      Fainelli.

   8) eBPF now can load programs via a system call and has an extensive
      testsuite.  Alexei Starovoitov and Daniel Borkmann.

   9) Major overhaul of the packet scheduler to use RCU in several major
      areas such as the classifiers and rate estimators.  From John
      Fastabend.

  10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
      Duyck.

  11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
      Dumazet.

  12) Add Datacenter TCP congestion control algorithm support, From
      Florian Westphal.

  13) Reorganize sk_buff so that __copy_skb_header() is significantly
      faster.  From Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
  netlabel: directly return netlbl_unlabel_genl_init()
  net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
  net: description of dma_cookie cause make xmldocs warning
  cxgb4: clean up a type issue
  cxgb4: potential shift wrapping bug
  i40e: skb->xmit_more support
  net: fs_enet: Add NAPI TX
  net: fs_enet: Remove non NAPI RX
  r8169:add support for RTL8168EP
  net_sched: copy exts->type in tcf_exts_change()
  wimax: convert printk to pr_foo()
  af_unix: remove 0 assignment on static
  ipv6: Do not warn for informational ICMP messages, regardless of type.
  Update Intel Ethernet Driver maintainers list
  bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
  tipc: fix bug in multicast congestion handling
  net: better IFF_XMIT_DST_RELEASE support
  net/mlx4_en: remove NETDEV_TX_BUSY
  3c59x: fix bad split of cpu_to_le32(pci_map_single())
  net: bcmgenet: fix Tx ring priority programming
  ...
2014-10-08 21:40:54 -04:00
Linus Torvalds d0cd84817c dmaengine-3.17
1/ Step down as dmaengine maintainer see commit 08223d80df "dmaengine
    maintainer update"
 
 2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit
    7787380336 "net_dma: mark broken"), without reports of performance
    regression.
 
 3/ Miscellaneous fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUKDLKAAoJEB7SkWpmfYgC7wwP/iNHqRjf1suMUTBIF3P6Hgbe
 VCUwh0IkuujMPDG46WRn6cYzarRxVPLoGaLHLPszgjI6pmGPVv19wqeDOlUxtcmr
 0iQWEWv/zqseaAIW+4gj/WYCyMgKil49EUBJKCZCfNmIaad+e0pr8f0uE5yOkHPM
 tqWoZERu9A4dlXGr1TjeOZVzdnPrCt92MrLDN6ZZ6tMuJaEc5PauaLxKTeGy5fYj
 UB+k1xJQzECbsYfpB+uCVYl5/qPO1rNyuBYS8THCsW+JYmrbbfH2kkF2lo2FaUpO
 8Yd50FtzXHKWwAt7BzfIwU2M7x0wRmryrC/xsQi6M+WmVeHYvvHUIpzaA66xRZ5x
 fCy3Fu8sEnmnmboAbh2v2c5uTycqRl2xPzbpLAuxglloXIxzi3ckp6ESF/Z4SldH
 oxIoEievN7lah3vKgvlHZYcWDzrYr8EKf/EzFe9RqDBQDKtzDzre1H9Uivr387Vm
 uFUcGHYG/GXuX47C7EUsMtaSW2UEoR2ytw/HR6CKFPTVXwAzEO6kA9vg0EqL0iIq
 2wVLgavlZuwegmaUBgnr+bgVZMvVN7OU7fAIRVe5xNO6itrPKvheSlQthmRiiq9C
 uzOu4PS6PexqzHUNPCcJpCsj+lawmCSrE0bxtPzTA/CQInVgWs219V9+W5Gn/0YA
 EARN9k6ueX9PZPQrPQLm
 =BBBv
 -----END PGP SIGNATURE-----

Merge tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine

Pull dmaengine updates from Dan Williams:
 "Even though this has fixes marked for -stable, given the size and the
  needed conflict resolutions this is 3.18-rc1/merge-window material.

  These patches have been languishing in my tree for a long while.  The
  fact that I do not have the time to do proper/prompt maintenance of
  this tree is a primary factor in the decision to step down as
  dmaengine maintainer.  That and the fact that the bulk of drivers/dma/
  activity is going through Vinod these days.

  The net_dma removal has not been in -next.  It has developed simple
  conflicts against mainline and net-next (for-3.18).

  Continuing thanks to Vinod for staying on top of drivers/dma/.

  Summary:

   1/ Step down as dmaengine maintainer see commit 08223d80df
      "dmaengine maintainer update"

   2/ Removal of net_dma, as it has been marked 'broken' since 3.13
      (commit 7787380336 "net_dma: mark broken"), without reports of
      performance regression.

   3/ Miscellaneous fixes"

* tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
  net: make tcp_cleanup_rbuf private
  net_dma: revert 'copied_early'
  net_dma: simple removal
  dmaengine maintainer update
  dmatest: prevent memory leakage on error path in thread
  ioat: Use time_before_jiffies()
  dmaengine: fix xor sources continuation
  dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
  dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
  dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
  ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
  drivers: dma: Include appropriate header file in dca.c
  drivers: dma: Mark functions as static in dma_v3.c
  dma: mv_xor: Add DMA API error checks
  ioat/dca: Use dev_is_pci() to check whether it is pci device
2014-10-07 20:39:25 -04:00
Eric Dumazet 971f10eca1 tcp: better TCP_SKB_CB layout to reduce cache line misses
TCP maintains lists of skb in write queue, and in receive queues
(in order and out of order queues)

Scanning these lists both in input and output path usually requires
access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq

These fields are currently in two different cache lines, meaning we
waste lot of memory bandwidth when these queues are big and flows
have either packet drops or packet reorders.

We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
this header is not used in fast path. This allows TCP to search much faster
in the skb lists.

Even with regular flows, we save one cache line miss in fast path.

Thanks to Christoph Paasch for noticing we need to cleanup
skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28 16:35:43 -04:00
Eric Dumazet 24a2d43d88 ipv4: rename ip_options_echo to __ip_options_echo()
ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
Lets break this assumption, but provide a helper to not change all call points.

ip_send_unicast_reply() gets a new struct ip_options pointer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28 16:35:42 -04:00
Dan Williams 7bced39751 net_dma: simple removal
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.

This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.

Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():

    https://lkml.org/lkml/2014/9/3/177

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-28 07:05:16 -07:00
Eric Dumazet fcdd1cf4dd tcp: avoid possible arithmetic overflows
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.

Better use 64bit to perform icsk_rto << icsk_backoff operations

As Joe Perches suggested, add a helper for this.

Yuchung spotted the tcp_v4_err() case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22 16:27:10 -04:00
Eric Dumazet e11ecddf51 tcp: use TCP_SKB_CB(skb)->tcp_flags in input path
Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
which is only used in output path.

tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
and its unfortunate because this bit is located in a cache line right before
the payload.

We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.

This patch does so, and avoids the cache line miss in tcp_recvmsg()

Following patches will
- allow a segment with FIN being coalesced in tcp_try_coalesce()
- simplify tcp_collapse() by not copying the headers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-15 14:41:07 -04:00
Eric Dumazet ca777eff51 tcp: remove dst refcount false sharing for prequeue mode
Alexander Duyck reported high false sharing on dst refcount in tcp stack
when prequeue is used. prequeue is the mechanism used when a thread is
blocked in recvmsg()/read() on a TCP socket, using a blocking model
rather than select()/poll()/epoll() non blocking one.

We already try to use RCU in input path as much as possible, but we were
forced to take a refcount on the dst when skb escaped RCU protected
region. When/if the user thread runs on different cpu, dst_release()
will then touch dst refcount again.

Commit 093162553c (tcp: force a dst refcount when prequeue packet)
was an example of a race fix.

It turns out the only remaining usage of skb->dst for a packet stored
in a TCP socket prequeue is IP early demux.

We can add a logic to detect when IP early demux is probably going
to use skb->dst. Because we do an optimistic check rather than duplicate
existing logic, we need to guard inet_sk_rx_dst_set() and
inet6_sk_rx_dst_set() from using a NULL dst.

Many thanks to Alexander for providing a nice bug report, git bisection,
and reproducer.

Tested using Alexander script on a 40Gb NIC, 8 RX queues.
Hosts have 24 cores, 48 hyper threads.

echo 0 >/proc/sys/net/ipv4/tcp_autocorking

for i in `seq 0 47`
do
  for j in `seq 0 2`
  do
     netperf -H $DEST -t TCP_STREAM -l 1000 \
             -c -C -T $i,$i -P 0 -- \
             -m 64 -s 64K -D &
  done
done

Before patch : ~6Mpps and ~95% cpu usage on receiver
After patch : ~9Mpps and ~35% cpu usage on receiver.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 16:54:41 -07:00
Eric Dumazet 7faee5c0d5 tcp: remove TCP_SKB_CB(skb)->when
After commit 740b0f1841 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.

TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:49:33 -07:00
Eric Dumazet 04317dafd1 tcp: introduce TCP_SKB_CB(skb)->tcp_tw_isn
TCP_SKB_CB(skb)->when has different meaning in output and input paths.

In output path, it contains a timestamp.
In input path, it contains an ISN, chosen by tcp_timewait_state_process()

Lets add a different name to ease code comprehension.

Note that 'when' field will disappear in following patch,
as skb_mstamp already contains timestamp, the anonymous
union will promptly disappear as well.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:49:33 -07:00
stephen hemminger 688d1945bc tcp: whitespace fixes
Fix places where there is space before tab, long lines, and
awkward if(){, double spacing etc. Add blank line after declaration/initialization.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 18:12:45 -07:00
Neal Cardwell 4fab907195 tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()
Make sure we use the correct address-family-specific function for
handling MTU reductions from within tcp_release_cb().

Previously AF_INET6 sockets were incorrectly always using the IPv6
code path when sometimes they were handling IPv4 traffic and thus had
an IPv4 dst.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:54 -07:00
Dmitry Popov 9ea88a1530 tcp: md5: check md5 signature without socket lock
Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup
doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is
no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these
calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock:
from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv.

Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 16:00:20 -07:00
Dmitry Popov 64a124edcc tcp: md5: remove unneeded check in tcp_v4_parse_md5_keys
tcpm_key is an array inside struct tcp_md5sig, there is no need to check it
against NULL.

Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04 15:09:45 -07:00
Duan Jiong 7304fe4681 net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS
When dealing with ICMPv[46] Error Message, function icmp_socket_deliver()
and icmpv6_notify() do some valid checks on packet's length, but then some
protocols check packet's length redaudantly. So remove those duplicated
statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in
function icmp_socket_deliver() and icmpv6_notify() respectively.

In addition, add missed counter in udp6/udplite6 when socket is NULL.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 22:04:18 -07:00
Tom Herbert b73c3d0e4f net: Save TX flow hash in sock and set in skbuf on xmit
For a connected socket we can precompute the flow hash for setting
in skb->hash on output. This is a performance advantage over
calculating the skb->hash for every packet on the connection. The
computation is done using the common hash algorithm to be consistent
with computations done for packets of the connection in other states
where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).

This patch adds sk_txhash to the sock structure. inet_set_txhash and
ip6_set_txhash functions are added which are called from points in
TCP and UDP where socket moves to established state.

skb_set_hash_from_sk is a function which sets skb->hash from the
sock txhash value. This is called in UDP and TCP transmit path when
transmitting within the context of a socket.

Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
interface (in this case skb_get_hash called on every TX packet to
create a UDP source port).

Before fix:

  95.02% CPU utilization
  154/256/505 90/95/99% latencies
  1.13042e+06 tps

  Time in functions:
    0.28% skb_flow_dissect
    0.21% __skb_get_hash

After fix:

  94.95% CPU utilization
  156/254/485 90/95/99% latencies
  1.15447e+06

  Neither __skb_get_hash nor skb_flow_dissect appear in perf

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07 21:14:21 -07:00
Neal Cardwell 86c6a2c75a tcp: switch snt_synack back to measuring transmit time of first SYNACK
Always store in snt_synack the time at which the server received the
first client SYN and attempted to send the first SYNACK.

Recent commit aa27fc501 ("tcp: tcp_v[46]_conn_request: fix snt_synack
initialization") resolved an inconsistency between IPv4 and IPv6 in
the initialization of snt_synack. This commit brings back the idea
from 843f4a55e (tcp: use tcp_v4_send_synack on first SYN-ACK), which
was going for the original behavior of snt_synack from the commit
where it was added in 9ad7c049f0 ("tcp: RFC2988bis + taking RTT
sample from 3WHS for the passive open side") in v3.1.

In addition to being simpler (and probably a tiny bit faster),
unconditionally storing the time of the first SYNACK attempt has been
useful because it allows calculating a performance metric quantifying
how long it took to establish a passive TCP connection.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Octavian Purdila <octavian.purdila@intel.com>
Cc: Jerry Chu <hkchu@google.com>
Acked-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07 19:26:37 -07:00
Octavian Purdila 1fb6f159fd tcp: add tcp_conn_request
Create tcp_conn_request and remove most of the code from
tcp_v4_conn_request and tcp_v6_conn_request.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:37 -07:00
Octavian Purdila 695da14eb0 tcp: add queue_add_hash to tcp_request_sock_ops
Add queue_add_hash member to tcp_request_sock_ops so that we can later
unify tcp_v4_conn_request and tcp_v6_conn_request.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:36 -07:00
Octavian Purdila 2aec4a297b tcp: add mss_clamp to tcp_request_sock_ops
Add mss_clamp member to tcp_request_sock_ops so that we can later
unify tcp_v4_conn_request and tcp_v6_conn_request.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:36 -07:00
Octavian Purdila 5db92c9949 tcp: unify tcp_v4_rtx_synack and tcp_v6_rtx_synack
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:36 -07:00
Octavian Purdila d6274bd8d6 tcp: add send_synack method to tcp_request_sock_ops
Create a new tcp_request_sock_ops method to unify the IPv4/IPv6
signature for tcp_v[46]_send_synack. This allows us to later unify
tcp_v4_rtx_synack with tcp_v6_rtx_synack and tcp_v4_conn_request with
tcp_v4_conn_request.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:36 -07:00
Octavian Purdila 936b8bdb53 tcp: add init_seq method to tcp_request_sock_ops
More work in preparation of unifying tcp_v4_conn_request and
tcp_v6_conn_request: indirect the init sequence calls via the
tcp_request_sock_ops.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:36 -07:00
Octavian Purdila d94e0417ad tcp: add route_req method to tcp_request_sock_ops
Create wrappers with same signature for the IPv4/IPv6 request routing
calls and use these wrappers (via route_req method from
tcp_request_sock_ops) in tcp_v4_conn_request and tcp_v6_conn_request
with the purpose of unifying the two functions in a later patch.

We can later drop the wrapper functions and modify inet_csk_route_req
and inet6_cks_route_req to use the same signature.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:36 -07:00
Octavian Purdila fb7b37a7f3 tcp: add init_cookie_seq method to tcp_request_sock_ops
Move the specific IPv4/IPv6 cookie sequence initialization to a new
method in tcp_request_sock_ops in preparation for unifying
tcp_v4_conn_request and tcp_v6_conn_request.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:35 -07:00
Octavian Purdila 16bea70aa7 tcp: add init_req method to tcp_request_sock_ops
Move the specific IPv4/IPv6 intializations to a new method in
tcp_request_sock_ops in preparation for unifying tcp_v4_conn_request
and tcp_v6_conn_request.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:35 -07:00
Octavian Purdila aa27fc5018 tcp: tcp_v[46]_conn_request: fix snt_synack initialization
Commit 016818d07 (tcp: TCP Fast Open Server - take SYNACK RTT after
completing 3WHS) changes the code to only take a snt_synack timestamp
when a SYNACK transmit or retransmit succeeds. This behaviour is later
broken by commit 843f4a55e (tcp: use tcp_v4_send_synack on first
SYN-ACK), as snt_synack is now updated even if tcp_v4_send_synack
fails.

Also, commit 3a19ce0ee (tcp: IPv6 support for fastopen server) misses
the required IPv6 updates for 016818d07.

This patch makes sure that snt_synack is updated only when the SYNACK
trasnmit/retransmit succeeds, for both IPv4 and IPv6.

Cc: Cardwell <ncardwell@google.com>
Cc: Daniel Lee <longinus00@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27 15:53:35 -07:00
Octavian Purdila e0f802fbca tcp: move ir_mark initialization to tcp_openreq_init
ir_mark initialization is done for both TCP v4 and v6, move it in the
common tcp_openreq_init function.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-17 15:30:54 -07:00
Lorenzo Colitti 84f39b08d7 net: support marking accepting TCP sockets
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.

This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.

This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established.  If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.

Black-box tested using user-mode linux:

- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
  mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
  incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 18:35:09 -04:00
Yuchung Cheng 0a672f7413 tcp: improve fastopen icmp handling
If a fast open socket is already accepted by the user, it should
be treated like a connected socket to record the ICMP error in
sk_softerr, so the user can fetch it. Do that in both tcp_v4_err
and tcp_v6_err.

Also refactor the sequence window check to improve readability
(e.g., there were two local variables named 'req').

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 17:53:03 -04:00
Yuchung Cheng 843f4a55e3 tcp: use tcp_v4_send_synack on first SYN-ACK
To avoid large code duplication in IPv6, we need to first simplify
the complicate SYN-ACK sending code in tcp_v4_conn_request().

To use tcp_v4(6)_send_synack() to send all SYN-ACKs, we need to
initialize the mini socket's receive window before trying to
create the child socket and/or building the SYN-ACK packet. So we move
that initialization from tcp_make_synack() to tcp_v4_conn_request()
as a new function tcp_openreq_init_req_rwin().

After this refactoring the SYN-ACK sending code is simpler and easier
to implement Fast Open for IPv6.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 17:53:02 -04:00
Yuchung Cheng 89278c9dc9 tcp: simplify fast open cookie processing
Consolidate various cookie checking and generation code to simplify
the fast open processing. The main goal is to reduce code duplication
in tcp_v4_conn_request() for IPv6 support.

Removes two experimental sysctl flags TFO_SERVER_ALWAYS and
TFO_SERVER_COOKIE_NOT_CHKD used primarily for developmental debugging
purposes.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 17:53:02 -04:00
Yuchung Cheng 5b7ed0892f tcp: move fastopen functions to tcp_fastopen.c
Move common TFO functions that will be used by both v4 and v6
to tcp_fastopen.c. Create a helper tcp_fastopen_queue_check().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 17:53:02 -04:00
Tom Herbert ed70fcfcee net: Call skb_checksum_init in IPv4
Call skb_checksum_init instead of private functions.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-05 15:26:30 -04:00
David S. Miller 676d23690f net: Fix use after free by removing length arg from sk_data_ready callbacks.
Several spots in the kernel perform a sequence like:

	skb_queue_tail(&sk->s_receive_queue, skb);
	sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up.  So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument.  And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11 16:15:36 -04:00
David S. Miller 64c27237a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/marvell/mvneta.c

The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 18:48:54 -04:00
Eric Dumazet e2a1d3e47b tcp: fix get_timewait4_sock() delay computation on 64bit
It seems I missed one change in get_timewait4_sock() to compute
the remaining time before deletion of IPV4 timewait socket.

This could result in wrong output in /proc/net/tcp for tm->when field.

Fixes: 96f817fede ("tcp: shrink tcp6_timewait_sock by one cache line")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:43:14 -04:00
Daniel Baluta e35bad5d87 net: remove empty lines from tcp_syn_flood_action
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20 16:17:22 -04:00
Yuchung Cheng f19c29e3e3 tcp: snmp stats for Fast Open, SYN rtx, and data pkts
Add the following snmp stats:

TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
the remote does not accept it or the attempts timed out.

TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.

TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
more useful to track the TCP retransmission rate.

Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
TCPFastOpenPassive.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Lawrence Brakmo <brakmo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03 15:58:03 -05:00
Eric Dumazet 740b0f1841 tcp: switch rtt estimations to usec resolution
Upcoming congestion controls for TCP require usec resolution for RTT
estimations. Millisecond resolution is simply not enough these days.

FQ/pacing in DC environments also require this change for finer control
and removal of bimodal behavior due to the current hack in
tcp_update_pacing_rate() for 'small rtt'

TCP_CONG_RTT_STAMP is no longer needed.

As Julian Anastasov pointed out, we need to keep user compatibility :
tcp_metrics used to export RTT and RTTVAR in msec resolution,
so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
to use the new attributes if provided by the kernel.

In this example ss command displays a srtt of 32 usecs (10Gbit link)

lpk51:~# ./ss -i dst lpk52
Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
Address:Port
tcp    ESTAB      0      1         10.246.11.51:42959
10.246.11.52:64614
         cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
cwnd:10 send
3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559

Updated iproute2 ip command displays :

lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
10.246.11.51

Old binary displays :

lpk51:~# ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
10.246.11.51

With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Larry Brakmo <brakmo@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 17:08:40 -05:00
Peter Pan(潘卫平) 4d83e17730 tcp: delete redundant calls of tcp_mtup_init()
As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen
sock, we can delete the redundant calls of tcp_mtup_init() in
tcp_{v4,v6}_syn_recv_sock().

Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:52:31 -08:00
Weilong Chen d41db5af26 ipv4: fix checkpatch error with foo * bar
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:21 -05:00
Steffen Klassert 0e0d44ab42 net: Remove FLOWI_FLAG_CAN_SLEEP
FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-12-06 07:24:39 +01:00
Eric Dumazet f1d8cba61c inet: fix possible seqlock deadlocks
In commit c9e9042994 ("ipv4: fix possible seqlock deadlock") I left
another places where IP_INC_STATS_BH() were improperly used.

udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from
process context, not from softirq context.

This was detected by lockdep seqlock support.

Reported-by: jongman heo <jongman.heo@samsung.com>
Fixes: 584bdf8cbd ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:37:36 -05:00
Tetsuo Handa 652586df95 seq_file: remove "%n" usage from seq_file users
All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Hannes Frederic Sowa 482fc6094a ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
their sockets won't accept and install new path mtu information and they
will always use the interface mtu for outgoing packets. It is guaranteed
that the packet is not fragmented locally. But we won't set the DF-Flag
on the outgoing frames.

Florian Weimer had the idea to use this flag to ensure DNS servers are
never generating outgoing fragments. They may well be fragmented on the
path, but the server never stores or usees path mtu values, which could
well be forged in an attack.

(The root of the problem with path MTU discovery is that there is
no reliable way to authenticate ICMP Fragmentation Needed But DF Set
messages because they are sent from intermediate routers with their
source addresses, and the IMCP payload will not always contain sufficient
information to identify a flow.)

Recent research in the DNS community showed that it is possible to
implement an attack where DNS cache poisoning is feasible by spoofing
fragments. This work was done by Amir Herzberg and Haya Shulman:
<https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>

This issue was previously discussed among the DNS community, e.g.
<http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
without leading to fixes.

This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
regarding local fragmentation with UFO/CORK" for the enforcement of the
non-fragmentable checks. If other users than ip_append_page/data should
use this semantic too, we have to add a new flag to IPCB(skb)->flags to
suppress local fragmentation and check for this in ip_finish_output.

Many thanks to Florian Weimer for the idea and feedback while implementing
this patch.

Cc: David S. Miller <davem@davemloft.net>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-05 21:52:27 -05:00
Eric W. Biederman a4fe34bf90 tcp_memcontrol: Remove the per netns control.
The code that is implemented is per memory cgroup not per netns, and
having per netns bits is just confusing.  Remove the per netns bits to
make it easier to see what is really going on.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21 18:43:02 -04:00
Eric Dumazet 634fb979e8 inet: includes a sock_common in request_sock
TCP listener refactoring, part 5 :

We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.

This patch includes the needed struct sock_common in front
of struct request_sock

This means there is no more inet6_request_sock IPv6 specific
structure.

Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.

loc_port   -> ir_loc_port
loc_addr   -> ir_loc_addr
rmt_addr   -> ir_rmt_addr
rmt_port   -> ir_rmt_port
iif        -> ir_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
Eric Dumazet 05dbc7b594 tcp/dccp: remove twchain
TCP listener refactoring, part 3 :

Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.

Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.

As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.

If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.

[ INET_TW_MATCH() is no longer needed ]

I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()

This way, SYN_RECV pseudo sockets will be supported the same.

A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].

Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()

Before patch :

dmesg | grep "TCP established"

TCP established hash table entries: 524288 (order: 11, 8388608 bytes)

After patch :

TCP established hash table entries: 524288 (order: 10, 4194304 bytes)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 23:19:24 -04:00
Eric Dumazet 6ae705323b tcp: sndbuf autotuning improvements
tcp_fixup_sndbuf() is underestimating initial send buffer requirements.

It was not noticed because big GSO packets were escaping the limitation,
but with smaller TSO packets (or TSO/GSO/SG off), application hits
sk_sndbuf before having a chance to fill enough packets in socket write
queue.

- initial cwnd can be bigger than 10 for specific routes

- SKB_TRUESIZE() is a bit under real needs in some cases,
  because of power-of-two rounding in kmalloc()

- Fast Recovery (RFC 5681 3.2) : Cubic needs 70% factor

- Extra cushion (application might react slowly to POLLOUT)

tcp_v4_conn_req_fastopen() needs to call tcp_init_metrics() before
calling tcp_init_buffer_space()

Then we realize tcp_new_space() should call tcp_fixup_sndbuf()
instead of duplicating this stuff.

Rename tcp_fixup_sndbuf() to tcp_sndbuf_expand() to be more
descriptive.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 16:45:17 -04:00
Vijay Subramanian c995ae2259 tcp: Change return value of tcp_rcv_established()
tcp_rcv_established() returns only one value namely 0. We change the return
value to void (as suggested by David Miller).

After commit 0c24604b (tcp: implement RFC 5961 4.2), we no longer send RSTs in
response to SYNs. We can remove the check and processing on the return value of
tcp_rcv_established().

We also fix jtcp_rcv_established() in tcp_probe.c to match that of
tcp_rcv_established().

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 00:27:28 -04:00
Christoph Paasch 397b417463 tcp: trivial: Remove nocache argument from tcp_v4_send_synack
The nocache-argument was used in tcp_v4_send_synack as an argument to
inet_csk_route_req. However, since ba3f7f04ef (ipv4: Kill
FLOWI_FLAG_RT_NOCACHE and associated code.) this is no more used.

This patch removes the unsued argument from tcp_v4_send_synack.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 15:05:04 -07:00
Francesco Fusco d14c5ab6be net: proc_fs: trivial: print UIDs as unsigned int
UIDs are printed in the proc_fs as signed int, whereas
they are unsigned int.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 14:37:46 -07:00
Yuchung Cheng 149479d019 tcp: add server ip to encrypt cookie in fast open
Encrypt the cookie with both server and client IPv4 addresses,
such that multi-homed server will grant different cookies
based on both the source and destination IPs. No client change
is needed since cookie is opaque to the client.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10 00:35:33 -07:00
Hannes Frederic Sowa 5ad37d5dee tcp: add tcp_syncookies mode to allow unconditionally generation of syncookies
| If you want to test which effects syncookies have to your
| network connections you can set this knob to 2 to enable
| unconditionally generation of syncookies.

Original idea and first implementation by Eric Dumazet.

Cc: Florian Westphal <fw@strlen.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 16:15:18 -07:00
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Yuchung Cheng 375fe02c91 tcp: consolidate SYNACK RTT sampling
The first patch consolidates SYNACK and other RTT measurement to use a
central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
RTT measurement happens after PAWS check, potentially reducing the
impact of RTO seeding on bad TCP timestamps values.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Eliezer Tamir 8b80cda536 net: rename ll methods to busy-poll
Rename ndo_ll_poll to ndo_busy_poll.
Rename sk_mark_ll to sk_mark_napi_id.
Rename skb_mark_ll to skb_mark_napi_id.
Correct all useres of these functions.
Update comments and defines  in include/net/busy_poll.h

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir 076bb0c82a net: rename include/net/ll_poll.h to include/net/busy_poll.h
Rename the file and correct all the places where it is included.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
David S. Miller 0c1072ae02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fec_main.c
	drivers/net/ethernet/renesas/sh_eth.c
	net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:55:13 -07:00
Aydin Arik c0353c7b5d ipv4: Fixed MD5 key lookups when adding/ removing MD5 to/ from TCP sockets.
MD5 key lookups on a given TCP socket were being performed
incorrectly. This fix alters parameter inputs to the MD5
lookup function tcp_md5_do_lookup, which is called by functions
tcp_md5_do_add and tcp_md5_do_del. Specifically, the change now
inputs the correct address and address family required to make
a proper lookup.

Signed-off-by: Aydin Arik <aydin.arik@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 21:21:53 -07:00
Eliezer Tamir d30e383bb8 tcp: add low latency socket poll support.
Adds low latency socket poll support for TCP.
In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
from the skb to the sk.
In tcp_recvmsg(), when there is no data in the socket we busy-poll.
This is a good example of how to add busy-poll support to more protocols.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:36 -07:00
Daniel Borkmann 28850dc7c7 net: tcp: move GRO/GSO functions to tcp_offload
Would be good to make things explicit and move those functions to
a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c.
While moving all related functions into tcp_offload.c, we can also
make some of them static, since they are only used there. Also, add
an explicit registration function.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 14:39:05 -07:00
Eric Dumazet 71cea17ed3 tcp: md5: remove spinlock usage in fast path
TCP md5 code uses per cpu variables but protects access to them with
a shared spinlock, which is a contention point.

[ tcp_md5sig_pool_lock is locked twice per incoming packet ]

Makes things much simpler, by allocating crypto structures once, first
time a socket needs md5 keys, and not deallocating them as they are
really small.

Next step would be to allow crypto allocations being done in a NUMA
aware way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-20 14:00:42 -07:00
Linus Torvalds 20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
David S. Miller 58717686cf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
	drivers/net/ethernet/emulex/benet/be.h
	include/net/tcp.h
	net/mac802154/mac802154.h

Most conflicts were minor overlapping stuff.

The be2net driver brought in some fixes that added __vlan_put_tag
calls, which in net-next take an additional argument.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30 03:55:20 -04:00
Eric Dumazet 6a5dc9e598 net: Add MIB counters for checksum errors
Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.

$ nstat -a | grep  Csum
IcmpInCsumErrors                72                 0.0
TcpInCsumErrors                 382                0.0
UdpInCsumErrors                 463221             0.0
Icmp6InCsumErrors               75                 0.0
Udp6InCsumErrors                173442             0.0
IpExtInCsumErrors               10884              0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 15:14:03 -04:00
Al Viro d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
David S. Miller 61816596d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull in the 'net' tree to get Daniel Borkmann's flow dissector
infrastructure change.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 12:46:26 -04:00
Eric Dumazet 0d4f060861 tcp: dont handle MTU reduction on LISTEN socket
When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
LISTEN socket, and this socket is currently owned by the user, we
set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.

This is bad because if we clone the parent before it had a chance to
clear the flag, the child inherits the tsq_flags value, and next
tcp_release_cb() on the child will decrement sk_refcnt.

Result is that we might free a live TCP socket, as reported by
Dormando.

IPv4: Attempt to release TCP socket in state 1

Fix this issue by testing sk_state against TCP_LISTEN early, so that we
set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)

This bug was introduced in commit 563d34d057
(tcp: dont drop MTU reduction indications)

Reported-by: dormando <dormando@rydia.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-18 13:31:28 -04:00
Christoph Paasch 1a2c6181c4 tcp: Remove TCPCT
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
	average (among 7 runs) of 20845.5 Requests/Second
after:
	average (among 7 runs) of 21403.6 Requests/Second

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-17 14:35:13 -04:00
Nandita Dukkipati 6ba8a3b19e tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 08:30:34 -04:00
Eric Dumazet b2fb4f54ec tcp: uninline tcp_prequeue()
tcp_prequeue() became too big to be inlined.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-07 16:22:39 -05:00
Sasha Levin b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Gao feng ece31ffd53 net: proc: change proc_net_remove to remove_proc_entry
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Andrey Vagin ee684b6f28 tcp: send packets with a socket timestamp
A socket timestamp is a sum of the global tcp_time_stamp and
a per-socket offset.

A socket offset is added in places where externally visible
tcp timestamp option is parsed/initialized.

Connections in the SYN_RECV state are not supported, global
tcp_time_stamp is used for them, because repair mode doesn't support
this state. In a future it can be implemented by the similar way
as for TIME_WAIT sockets.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-13 13:22:16 -05:00
David S. Miller 188d1f76d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/ethtool.c
	drivers/net/vmxnet3/vmxnet3_drv.c
	drivers/net/wireless/iwlwifi/dvm/tx.c
	net/ipv6/route.c

The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.

The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.

The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.

The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-05 14:12:20 -05:00
Vijay Subramanian 848bf15f36 tcp: Update MIB counters for drops
This patch updates LINUX_MIB_LISTENDROPS in tcp_v4_conn_request() and
tcp_v4_err(). tcp_v4_conn_request() in particular can drop SYNs for various
reasons which are not currently tracked.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-04 13:06:27 -05:00
Nivedita Singhvi 2aeef18d37 tcp: Increment LISTENOVERFLOW and LISTENDROPS in tcp_v4_conn_request()
We drop a connection request if the accept backlog is full and there are
sufficient packets in the syn queue to warrant starting drops. Increment the
appropriate counters so this isn't silent, for accurate stats and help in
debugging.

This patch assumes LINUX_MIB_LISTENDROPS is a superset of/includes the
counter LINUX_MIB_LISTENOVERFLOWS.

Signed-off-by: Nivedita Singhvi <niv@us.ibm.com>
Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:43:04 -05:00
David S. Miller f1e7b73acc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Bring in the 'net' tree so that we can get some ipv4/ipv6 bug
fixes that some net-next work will build upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:32:13 -05:00
Tom Herbert da5e36308d soreuseport: TCP/IPv4 implementation
Allow multiple listener sockets to bind to the same port.

Motivation for soresuseport would be something like a web server
binding to port 80 running with multiple threads, where each thread
might have it's own listener socket.  This could be done as an
alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-23 13:44:01 -05:00
Eric Dumazet b74aa930ef tcp: fix incorrect LOCKDROPPEDICMPS counter
commit 563d34d057 (tcp: dont drop MTU reduction indications)
added an error leading to incorrect accounting of
LINUX_MIB_LOCKDROPPEDICMPS

If socket is owned by the user, we want to increment
this SNMP counter, unless the message is a
(ICMP_DEST_UNREACH,ICMP_FRAG_NEEDED) one.

Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-20 17:22:05 -05:00
Hannes Frederic Sowa 5d134f1c1f tcp: make sysctl_tcp_ecn namespace aware
As per suggestion from Eric Dumazet this patch makes tcp_ecn sysctl
namespace aware.  The reason behind this patch is to ease the testing
of ecn problems on the internet and allows applications to tune their
own use of ecn.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-06 21:09:56 -08:00
Christoph Paasch e337e24d66 inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock
If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:

unreferenced object 0xffff88022e8a92c0 (size 1592):
  comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
  hex dump (first 32 bytes):
    0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00  ................
    02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
    [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
    [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
    [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
    [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
    [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
    [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
    [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
    [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
    [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
    [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
    [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
    [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
    [<ffffffff814cee68>] ip_rcv+0x217/0x267
    [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
    [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82

This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.

This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...

Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.

Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

A similar approach is taken for dccp by calling dccp_done().

This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-14 13:14:07 -05:00
Andrey Vagin 2b9164771e ipv6: adapt connect for repair move
This is work the same as for ipv4.

All other hacks about tcp repair are in common code for ipv4 and ipv6,
so this patch is enough for repairing ipv6 connections.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-22 15:30:14 -05:00
David S. Miller d4185bbf62 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10 18:32:51 -05:00
Eric Dumazet e6c022a4fa tcp: better retrans tracking for defer-accept
For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-03 14:45:00 -04:00
stephen hemminger e0683e707c tcp: make tcp_clear_md5_list static
Trivial. Only used in one file.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-31 13:56:40 -04:00
Eric Dumazet 45f00f99d6 ipv4: tcp: clean up tcp_v4_early_demux()
Use same header helpers than tcp_v6_early_demux() because they
are a bit faster, and as they make IPv4/IPv6 versions look
the same.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-23 13:03:44 -04:00
Yuchung Cheng 6f73601efb tcp: add SYN/data info to TCP_INFO
Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
client application can use this information to check Fast Open success rate.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-22 15:16:06 -04:00
Alexey Kuznetsov 4c67525849 tcp: resets are misrouted
After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no
sock case").. tcp resets are always lost, when routing is asymmetric.
Yes, backing out that patch will result in misrouting of resets for
dead connections which used interface binding when were alive, but we
actually cannot do anything here.  What's died that's died and correct
handling normal unbound connections is obviously a priority.

Comment to comment:
> This has few benefits:
>   1. tcp_v6_send_reset already did that.

It was done to route resets for IPv6 link local addresses. It was a
mistake to do so for global addresses. The patch fixes this as well.

Actually, the problem appears to be even more serious than guaranteed
loss of resets.  As reported by Sergey Soloviev <sol@eqv.ru>, those
misrouted resets create a lot of arp traffic and huge amount of
unresolved arp entires putting down to knees NAT firewalls which use
asymmetric routing.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
2012-10-12 13:52:40 -04:00
Eric Dumazet 861b650101 tcp: gro: add checksuming helpers
skb with CHECKSUM_NONE cant currently be handled by GRO, and
we notice this deep in GRO stack in tcp[46]_gro_receive()

But there are cases where GRO can be a benefit, even with a lack
of checksums.

This preliminary work is needed to add GRO support
to tunnels.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 17:00:27 -04:00
Christoph Paasch 5dff747b70 tcp: Remove unused parameter from tcp_v4_save_options
struct sock *sk is not used inside tcp_v4_save_options. Thus it can be
removed.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 19:20:26 -04:00
Eric Dumazet 5640f76858 net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:31:37 -04:00
Neal Cardwell 016818d076 tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS
When taking SYNACK RTT samples for servers using TCP Fast Open, fix
the code to ensure that we only call tcp_valid_rtt_meas() after we
receive the ACK that completes the 3-way handshake.

Previously we were always taking an RTT sample in
tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
to take the RTT sample.

To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
already ensures that we only take a SYNACK RTT sample if snt_synack is
non-zero. To be careful, we only take a snt_synack timestamp when
a SYNACK transmit or retransmit succeeds.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22 15:47:10 -04:00
Neal Cardwell 623df484a7 tcp: extract code to compute SYNACK RTT
In preparation for adding another spot where we compute the SYNACK
RTT, extract this code so that it can be shared.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22 15:47:10 -04:00
Jerry Chu 168a8f5805 tcp: TCP Fast Open Server - main code path
This patch adds the main processing path to complete the TFO server
patches.

A TFO request (i.e., SYN+data packet with a TFO cookie option) first
gets processed in tcp_v4_conn_request(). If it passes the various TFO
checks by tcp_fastopen_check(), a child socket will be created right
away to be accepted by applications, rather than waiting for the 3WHS
to finish.

In additon to the use of TFO cookie, a simple max_qlen based scheme
is put in place to fend off spoofed TFO attack.

When a valid ACK comes back to tcp_rcv_state_process(), it will cause
the state of the child socket to switch from either TCP_SYN_RECV to
TCP_ESTABLISHED, or TCP_FIN_WAIT1 to TCP_FIN_WAIT2. At this time
retransmission will resume for any unack'ed (data, FIN,...) segments.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
Jerry Chu 8336886f78 tcp: TCP Fast Open Server - support TFO listeners
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
David S. Miller e6acb38480 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 18:54:37 -04:00
David S. Miller 1304a7343b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-08-22 14:21:38 -07:00
Eric Dumazet 144d56e910 tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :

[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]

The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.

timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.

We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags

Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.

For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21 14:42:23 -07:00
Neal Cardwell fae6ef87fa net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child()
This commit removes the sk_rx_dst_set calls from
tcp_create_openreq_child(), because at that point the icsk_af_ops
field of ipv6_mapped TCP sockets has not been set to its proper final
value.

Instead, to make sure we get the right sk_rx_dst_set variant
appropriate for the address family of the new connection, we have
tcp_v{4,6}_syn_recv_sock() directly call the appropriate function
shortly after the call to tcp_create_openreq_child() returns.

This also moves inet6_sk_rx_dst_set() to avoid a forward declaration
with the new approach.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Artem Savkov <artem.savkov@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 03:03:33 -07:00
Eric W. Biederman a7cb5a49bf userns: Print out socket uids in a user namespace aware fashion.
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-14 21:48:06 -07:00
Eric Dumazet 63d02d157e net: tcp: ipv6_mapped needs sk_rx_dst_set method
commit 5d299f3d3c (net: ipv6: fix TCP early demux) added a
regression for ipv6_mapped case.

[   67.422369] SELinux: initialized (dev autofs, type autofs), uses
genfs_contexts
[   67.449678] SELinux: initialized (dev autofs, type autofs), uses
genfs_contexts
[   92.631060] BUG: unable to handle kernel NULL pointer dereference at
(null)
[   92.631435] IP: [<          (null)>]           (null)
[   92.631645] PGD 0
[   92.631846] Oops: 0010 [#1] SMP
[   92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
uhci_hcd
[   92.634294] CPU 0
[   92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
[   92.634294] RIP: 0010:[<0000000000000000>]  [<          (null)>]
(null)
[   92.634294] RSP: 0018:ffff880245fc7cb0  EFLAGS: 00010282
[   92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
0000000000000000
[   92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
ffff88024827ad00
[   92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
0000000000000000
[   92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
ffff880254735380
[   92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
7fffffffffff0218
[   92.634294] FS:  00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
knlGS:0000000000000000
[   92.634294] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
00000000000007f0
[   92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[   92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
task ffff880254b8cac0)
[   92.634294] Stack:
[   92.634294]  ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
ffff880245fc7d68
[   92.634294]  ffffffff81385083 00000000001d2680 ffff8802547353a8
ffff880245fc7d18
[   92.634294]  ffffffff8105903a ffff88024827ad60 0000000000000002
00000000000000ff
[   92.634294] Call Trace:
[   92.634294]  [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa
[   92.634294]  [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6
[   92.634294]  [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1
[   92.634294]  [<ffffffff81059073>] ? local_clock+0x2b/0x3c
[   92.634294]  [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670
[   92.634294]  [<ffffffff8133278e>] release_sock+0x128/0x1bd
[   92.634294]  [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352
[   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
[   92.634294]  [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25
[   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
[   92.634294]  [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b
[   92.634294]  [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b
[   92.634294]  [<ffffffff8132e8cf>] sys_connect+0x78/0x9e
[   92.634294]  [<ffffffff813fd407>] ? sysret_check+0x1b/0x56
[   92.634294]  [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8
[   92.634294]  [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[   92.634294]  [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b
[   92.634294] Code:  Bad RIP value.
[   92.634294] RIP  [<          (null)>]           (null)
[   92.634294]  RSP <ffff880245fc7cb0>
[   92.634294] CR2: 0000000000000000
[   92.648982] ---[ end trace 24e2bed94314c8d9 ]---
[   92.649146] Kernel panic - not syncing: Fatal exception in interrupt

Fix this using inet_sk_rx_dst_set(), and export this function in case
IPv6 is modular.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 20:56:09 -07:00
Eric Dumazet a399a80531 time: jiffies_delta_to_clock_t() helper to the rescue
Various /proc/net files sometimes report crazy timer values, expressed
in clock_t units.

This happens when an expired timer delta (expires - jiffies) is passed
to jiffies_to_clock_t().

This function has an overflow in :

return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);

commit cbbc719fcc (time: Change jiffies_to_clock_t() argument type
to unsigned long) only got around the problem.

As we cant output negative values in /proc/net/tcp without breaking
various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
that caps the negative delta to a 0 value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: hank <pyu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 16:17:03 -07:00
Eric Dumazet 5d299f3d3c net: ipv6: fix TCP early demux
IPv6 needs a cookie in dst_check() call.

We need to add rx_dst_cookie and provide a family independent
sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-06 13:33:21 -07:00
Linus Torvalds ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Andrew Morton c255a45805 memcg: rename config variables
Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
Eric Dumazet 404e0a8b6a net: ipv4: fix RCU races on dst refcounts
commit c6cffba4ff (ipv4: Fix input route performance regression.)
added various fatal races with dst refcounts.

crashes happen on tcp workloads if routes are added/deleted at the same
time.

The dst_free() calls from free_fib_info_rcu() are clearly racy.

We need instead regular dst refcounting (dst_release()) and make
sure dst_release() is aware of RCU grace periods :

Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
before dst destruction for cached dst

Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
to make sure we dont increase a zero refcount (On a dst currently
waiting an rcu grace period before destruction)

rt_cache_route() must take a reference on the new cached route, and
release it if was not able to install it.

With this patch, my machines survive various benchmarks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-30 14:53:22 -07:00
Eric Dumazet 505fbcf035 ipv4: fix TCP early demux
commit 92101b3b2e (ipv4: Prepare for change of rt->rt_iif encoding.)
invalidated TCP early demux, because rx_dst_ifindex is not properly
initialized and checked.

Also remove the use of inet_iif(skb) in favor or skb->skb_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27 13:45:51 -07:00
Eric Dumazet 9cb429d692 tcp: early_demux fixes
1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux()
   and fix a potential bug if skb->head was reallocated
   (iph & th pointers were not reloaded)

TCP stack will pull/check headers anyway.

2) must reload iph in ip_rcv_finish() after early_demux()
 call since skb->head might have changed.

3) skb->dev->ifindex can be now replaced by skb->skb_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-24 13:54:15 -07:00
David S. Miller 92101b3b2e ipv4: Prepare for change of rt->rt_iif encoding.
Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 16:36:26 -07:00
Eric Dumazet 563d34d057 tcp: dont drop MTU reduction indications
ICMP messages generated in output path if frame length is bigger than
mtu are actually lost because socket is owned by user (doing the xmit)

One example is the ipgre_tunnel_xmit() calling
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));

We had a similar case fixed in commit a34a101e1e (ipv6: disable GSO on
sockets hitting dst_allfrag).

Problem of such fix is that it relied on retransmit timers, so short tcp
sessions paid a too big latency increase price.

This patch uses the tcp_release_cb() infrastructure so that MTU
reduction messages (ICMP messages) are not lost, and no extra delay
is added in TCP transmits.

Reported-by: Maciej Żenczykowski <maze@google.com>
Diagnosed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 00:58:46 -07:00
David S. Miller ba3f7f04ef ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:36:54 -07:00