linux-sg2042/net/ipv4
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
..
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-07-03 14:55:13 -07:00
Kconfig Kconfig: remove dangling references to the deleted file 2013-06-04 15:17:39 -07:00
Makefile net: gre: move GSO functions to gre_offload 2013-07-03 14:37:39 -07:00
af_inet.c gro: remove a sparse error 2013-06-12 15:03:24 -07:00
ah4.c ipv4: properly refresh rtable entries on pmtu/redirect events 2013-06-03 00:07:42 -07:00
arp.c arp: flush arp cache on IFF_NOARP change 2013-05-28 13:11:02 -07:00
cipso_ipv4.c cipso: don't follow a NULL pointer when setsockopt() is called 2012-07-18 09:01:12 -07:00
datagram.c ipv4: Add a socket release callback for datagram sockets 2013-01-21 14:17:05 -05:00
devinet.c net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
esp4.c ipv4: properly refresh rtable entries on pmtu/redirect events 2013-06-03 00:07:42 -07:00
fib_frontend.c netlink: fix splat in skb_clone with large messages 2013-06-27 22:44:16 -07:00
fib_lookup.h
fib_rules.c sections: fix section conflicts in net 2012-10-06 03:04:45 +09:00
fib_semantics.c ipv4: use next hop exceptions also for input routes 2013-06-28 21:27:47 -07:00
fib_trie.c fib_trie: no need to delay vfree() 2013-05-06 11:06:51 -04:00
gre_demux.c net: gre: move GSO functions to gre_offload 2013-07-03 14:37:39 -07:00
gre_offload.c gso: Update tunnel segmentation to support Tx checksum offload 2013-07-11 12:18:49 -07:00
icmp.c icmp: avoid allocating large struct on stack 2013-06-03 00:28:44 -07:00
igmp.c net: convert resend IGMP to notifier event 2013-07-23 16:52:47 -07:00
inet_connection_sock.c tcp: Remove TCPCT 2013-03-17 14:35:13 -04:00
inet_diag.c netlink: rename ssk to sk in struct netlink_skb_params 2013-04-19 14:57:56 -04:00
inet_fragment.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-07-09 18:24:39 -07:00
inet_hashtables.c inet: fix spacing in assignment 2013-07-11 12:02:39 -07:00
inet_lro.c ipv4: replace ip_fast_csum with csum_replace2 2013-03-15 09:12:25 -04:00
inet_timewait_sock.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
inetpeer.c Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2012-10-02 09:54:49 -07:00
ip_forward.c ipv4: introduce rt_uses_gateway 2012-10-08 17:42:36 -04:00
ip_fragment.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-04-22 20:32:51 -04:00
ip_gre.c gre: fix a regression in ioctl 2013-07-01 23:35:22 -07:00
ip_input.c ipv4: set transport header earlier 2013-07-16 12:59:28 -07:00
ip_options.c net/ipv4: Ensure that location of timestamp option is stored 2013-03-12 05:35:39 -04:00
ip_output.c ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functions 2013-05-11 16:12:44 -07:00
ip_sockglue.c net: prevent setting ttl=0 via IP_TTL 2013-01-08 17:57:10 -08:00
ip_tunnel.c gre: Fix MTU sizing check for gretap tunnels 2013-07-11 16:12:03 -07:00
ip_tunnel_core.c ip_tunnel: push generic protocol handling to ip_tunnel module. 2013-06-19 18:07:41 -07:00
ip_vti.c vti: switch to new ip tunnel code 2013-07-23 17:27:32 -07:00
ipcomp.c ipv4: properly refresh rtable entries on pmtu/redirect events 2013-06-03 00:07:42 -07:00
ipconfig.c ipconfig: add informative timeout messages while waiting for carrier 2013-04-02 14:35:33 -04:00
ipip.c ipip: fix a regression in ioctl 2013-07-02 01:13:09 -07:00
ipmr.c ipmr: change the prototype of ip_mr_forward(). 2013-07-23 17:01:05 -07:00
netfilter.c netfilter: add my copyright statements 2013-04-18 20:27:55 +02:00
ping.c net: ping_check_bind_addr() etc. can be static 2013-06-13 01:36:41 -07:00
proc.c net: add low latency socket poll 2013-06-10 21:22:35 -07:00
protocol.c ipv4: Disallow non-namespace aware protocols to register. 2013-02-05 14:42:23 -05:00
raw.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
route.c ipv4: use next hop exceptions also for input routes 2013-06-28 21:27:47 -07:00
syncookies.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-04-22 20:32:51 -04:00
sysctl_net_ipv4.c tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
tcp.c tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
tcp_bic.c tcp: fix undo after RTO for BIC 2012-01-20 14:17:26 -05:00
tcp_cong.c tcp: remove Appropriate Byte Count support 2013-02-05 14:51:16 -05:00
tcp_cubic.c tcp: fix undo after RTO for CUBIC 2012-01-20 14:17:26 -05:00
tcp_diag.c inet_diag: Rename inet_diag_req into inet_diag_req_v2 2012-01-11 12:56:06 -08:00
tcp_fastopen.c tcp: TCP Fast Open Server - header & support functions 2012-08-31 20:02:18 -04:00
tcp_highspeed.c
tcp_htcp.c
tcp_hybla.c tcp: bool conversions 2012-05-17 14:59:59 -04:00
tcp_illinois.c net: fix divide by zero in tcp algorithm illinois 2012-11-01 11:55:59 -04:00
tcp_input.c tcp: use RTT from SACK for RTO 2013-07-22 17:53:42 -07:00
tcp_ipv4.c tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
tcp_lp.c Fix common misspellings 2011-03-31 11:26:23 -03:00
tcp_memcontrol.c net: tcp_memcontrol: minor: remove unused variable 2013-04-14 15:41:49 -04:00
tcp_metrics.c tcp: do not expire TCP fastopen cookies 2013-05-05 16:58:02 -04:00
tcp_minisocks.c tcp: consolidate SYNACK RTT sampling 2013-07-22 17:53:42 -07:00
tcp_offload.c net: tcp: move GRO/GSO functions to tcp_offload 2013-06-07 14:39:05 -07:00
tcp_output.c tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
tcp_probe.c net: proc: change proc_net_remove to remove_proc_entry 2013-02-18 14:53:08 -05:00
tcp_scalable.c
tcp_timer.c tcp: refactor F-RTO 2013-03-21 11:47:50 -04:00
tcp_vegas.c
tcp_vegas.h
tcp_veno.c
tcp_westwood.c tcp: refactor F-RTO 2013-03-21 11:47:50 -04:00
tcp_yeah.c Fix common misspellings 2011-03-31 11:26:23 -03:00
tunnel4.c net: Convert printks to pr_<level> 2012-03-11 23:42:51 -07:00
udp.c gso: Update tunnel segmentation to support Tx checksum offload 2013-07-11 12:18:49 -07:00
udp_diag.c netlink: rename ssk to sk in struct netlink_skb_params 2013-04-19 14:57:56 -04:00
udp_impl.h ipv4: fix checkpatch errors 2012-04-15 12:37:19 -04:00
udp_offload.c net: udp4: move GSO functions to udp_offload 2013-06-12 00:47:25 -07:00
udplite.c net: ipv4: Standardize prefixes for message logging 2012-03-12 17:05:21 -07:00
xfrm4_input.c net: Add skb_unclone() helper function. 2013-02-15 15:10:37 -05:00
xfrm4_mode_beet.c ipsec: be careful of non existing mac headers 2012-02-23 16:50:45 -05:00
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c xfrm: allow to avoid copying DSCP during encapsulation 2013-03-06 07:02:45 +01:00
xfrm4_output.c xfrm4: Don't call icmp_send on local error 2011-07-01 17:33:19 -07:00
xfrm4_policy.c xfrm: make gc_thresh configurable in all namespaces 2013-02-06 11:36:29 +01:00
xfrm4_state.c net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules 2011-10-31 19:30:30 -04:00
xfrm4_tunnel.c sit: add IPv4 over IPv4 support 2013-05-31 17:19:05 -07:00