OpenCloudOS-Kernel/net
Eric Dumazet 218af599fa tcp: internal implementation for pacing
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:43:31 -04:00
..
6lowpan 6lowpan: Don't set IFF_NO_QUEUE 2017-04-12 22:02:40 +02:00
9p xen: fixes and featrues for 4.12 2017-05-04 11:37:09 -07:00
802 Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
8021q vlan: Keep NETIF_F_HW_CSUM similar to other software devices 2017-05-08 14:39:19 -04:00
appletalk lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
atm neighbour: fix nlmsg_pid in notifications 2017-03-22 10:48:49 -07:00
ax25 net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
batman-adv This feature/cleanup patchset includes the following patches: 2017-04-06 14:37:50 -07:00
bluetooth net: socket: mark socket protocol handler structs as const 2017-05-16 11:54:07 -04:00
bpf bpf: Align packet data properly in program testing framework. 2017-05-02 11:46:28 -04:00
bridge bridge: netlink: account for IFLA_BRPORT_{B, M}CAST_FLOOD size and policy 2017-05-05 11:21:03 -04:00
caif net: socket: mark socket protocol handler structs as const 2017-05-16 11:54:07 -04:00
can can: fix CAN BCM build with CONFIG_PROC_FS disabled 2017-04-27 09:34:13 +02:00
ceph The two main items are support for disabling automatic rbd exclusive 2017-05-10 08:42:33 -07:00
core tcp: internal implementation for pacing 2017-05-16 15:43:31 -04:00
dcb net: rtnetlink: plumb extended ack to doit function 2017-04-17 15:35:38 -04:00
dccp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-15 15:50:49 -07:00
decnet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-09 15:42:31 -07:00
dns_resolver Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
dsa net: dsa: Remove redundant NULL dst check 2017-04-21 10:41:24 -04:00
ethernet Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2017-02-16 21:25:49 -05:00
hsr netlink: extended ACK reporting 2017-04-13 13:58:20 -04:00
ieee802154 netlink: pass extended ACK struct where available 2017-04-13 13:58:22 -04:00
ife net: Introduce ife encapsulation module 2017-02-03 15:16:45 -05:00
ipv4 tcp: internal implementation for pacing 2017-05-16 15:43:31 -04:00
ipv6 udp: use a separate rx queue for packet reception 2017-05-16 15:41:29 -04:00
ipx ipx: call ipxitf_put() in ioctl error path 2017-05-02 15:34:53 -04:00
irda net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
iucv net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
kcm net: socket: mark socket protocol handler structs as const 2017-05-16 11:54:07 -04:00
key Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-04-21 20:23:53 -07:00
l2tp l2tp: remove useless device duplication test in l2tp_eth_create() 2017-04-27 16:32:13 -04:00
l3mdev
lapb Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
llc Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu 2017-04-23 11:12:44 +02:00
mac80211 mac80211: fix IBSS presp allocation size 2017-05-08 11:25:04 +02:00
mac802154 drivers: add explicit interrupt.h includes 2017-03-30 11:05:34 -07:00
mpls treewide: use kv[mz]alloc* rather than opencoded variants 2017-05-08 17:15:13 -07:00
ncsi
netfilter Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-05-10 10:30:46 -07:00
netlabel netlink: pass extended ACK struct to parsing functions 2017-04-13 13:58:22 -04:00
netlink netlink: pass extended ACK struct where available 2017-04-13 13:58:22 -04:00
netrom net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
nfc net: socket: mark socket protocol handler structs as const 2017-05-16 11:54:07 -04:00
openvswitch Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf 2017-05-03 10:11:26 -04:00
packet net/packet: fix missing net_device reference release 2017-05-15 14:22:12 -04:00
phonet net: rtnetlink: plumb extended ack to doit function 2017-04-17 15:35:38 -04:00
psample net: Introduce psample, a new genetlink channel for packet sampling 2017-01-24 13:44:28 -05:00
qrtr Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-04-21 20:23:53 -07:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-05-02 16:40:27 -07:00
rfkill rfkill: remove rfkill-regulator 2017-01-24 11:07:35 +01:00
rose net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
rxrpc rxrpc: Trace client call connection 2017-04-06 11:10:41 +01:00
sched tcp: internal implementation for pacing 2017-05-16 15:43:31 -04:00
sctp sctp: fix src address selection if using secondary addresses for ipv6 2017-05-12 10:50:32 -04:00
smc Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-05-10 10:30:46 -07:00
strparser strparser: destroy workqueue on module exit 2017-03-03 20:43:26 -08:00
sunrpc Another RDMA update from Chuck Lever, and a bunch of miscellaneous 2017-05-10 13:29:23 -07:00
switchdev netlink: pass extended ACK struct to parsing functions 2017-04-13 13:58:22 -04:00
tipc tipc: make macro tipc_wait_for_cond() smp safe 2017-05-11 22:19:30 -04:00
unix af_unix: Use designated initializers 2017-04-06 12:43:04 -07:00
vmw_vsock virtio: fixes, cleanups, performance 2017-05-10 11:33:08 -07:00
wimax
wireless nl80211: correctly validate MU-MIMO groups 2017-05-08 11:24:34 +02:00
x25 net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-05-02 16:40:27 -07:00
Kconfig bpf: make jited programs visible in traces 2017-02-17 13:40:05 -05:00
Makefile bpf: introduce BPF_PROG_TEST_RUN command 2017-04-01 12:45:57 -07:00
compat.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-02-22 10:15:09 -08:00
socket.c l2tp: device MTU setup, tunnel socket needs a lock 2017-04-17 13:01:48 -04:00
sysctl_net.c sysctl: Remove dead register_sysctl_root 2017-04-16 23:42:49 -05:00