Adds support for fq's Earliest Departure Time to HBM (Host Bandwidth
Manager). Includes a new BPF program supporting EDT, and also updates
corresponding programs.
It will drop packets with an EDT of more than 500us in the future
unless the packet belongs to a flow with less than 2 packets in flight.
This is done so each flow has at least 2 packets in flight, so they
will not starve, and also to help prevent delayed ACK timeouts.
It will also work with ECN enabled traffic, where the packets will be
CE marked if their EDT is more than 50us in the future.
The table below shows some performance numbers. The flows are back to
back RPCS. One server sending to another, either 2 or 4 flows.
One flow is a 10KB RPC, the rest are 1MB RPCs. When there are more
than one flow of a given RPC size, the numbers represent averages.
The rate limit applies to all flows (they are in the same cgroup).
Tests ending with "-edt" ran with the new BPF program supporting EDT.
Tests ending with "-hbt" ran on top HBT qdisc with the specified rate
(i.e. no HBM). The other tests ran with the HBM BPF program included
in the HBM patch-set.
EDT has limited value when using DCTCP, but it helps in many cases when
using Cubic. It usually achieves larger link utilization and lower
99% latencies for the 1MB RPCs.
HBM ends up queueing a lot of packets with its default parameter values,
reducing the goodput of the 10KB RPCs and increasing their latency. Also,
the RTTs seen by the flows are quite large.
Aggr 10K 10K 10K 1MB 1MB 1MB
Limit rate drops RTT rate P90 P99 rate P90 P99
Test rate Flows Mbps % us Mbps us us Mbps ms ms
-------- ---- ----- ---- ----- --- ---- ---- ---- ---- ---- ----
cubic 1G 2 904 0.02 108 257 511 539 647 13.4 24.5
cubic-edt 1G 2 982 0.01 156 239 656 967 743 14.0 17.2
dctcp 1G 2 977 0.00 105 324 408 744 653 14.5 15.9
dctcp-edt 1G 2 981 0.01 142 321 417 811 660 15.7 17.0
cubic-htb 1G 2 919 0.00 1825 40 2822 4140 879 9.7 9.9
cubic 200M 2 155 0.30 220 81 532 655 74 283 450
cubic-edt 200M 2 188 0.02 222 87 1035 1095 101 84 85
dctcp 200M 2 188 0.03 111 77 912 939 111 76 325
dctcp-edt 200M 2 188 0.03 217 74 1416 1738 114 76 79
cubic-htb 200M 2 188 0.00 5015 8 14ms 15ms 180 48 50
cubic 1G 4 952 0.03 110 165 516 546 262 38 154
cubic-edt 1G 4 973 0.01 190 111 1034 1314 287 65 79
dctcp 1G 4 951 0.00 103 180 617 905 257 37 38
dctcp-edt 1G 4 967 0.00 163 151 732 1126 272 43 55
cubic-htb 1G 4 914 0.00 3249 13 7ms 8ms 300 29 34
cubic 5G 4 4236 0.00 134 305 490 624 1310 10 17
cubic-edt 5G 4 4865 0.00 156 306 425 759 1520 10 16
dctcp 5G 4 4936 0.00 128 485 221 409 1484 7 9
dctcp-edt 5G 4 4924 0.00 148 390 392 623 1508 11 26
v1 -> v2: Incorporated Andrii's suggestions
v2 -> v3: Incorporated Yonghong's suggestions
v3 -> v4: Removed credit update that is not needed
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Update hbm_out_kern.c to support returning cn notifications.
Also updates relevant files to allow disabling cn notifications.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Script for testing HBM (Host Bandwidth Manager) framework.
It creates a cgroup to use for testing and load a BPF program to limit
egress bandwidht. It then uses iperf3 or netperf to create
loads. The output is the goodput in Mbps (unless -D is used).
It can work on a single host using loopback or among two hosts (with netperf).
When using loopback, it is recommended to also introduce a delay of at least
1ms (-d=1), otherwise the assigned bandwidth is likely to be underutilized.
USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]
[-d=<delay>|--delay=<delay>] [--debug] [-E]
[-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >] [-l]
[-N] [-p=<port>|--port=<port>] [-P] [-q=<qdisc>]
[-R] [-s=<server>|--server=<server] [--stats]
[-t=<time>|--time=<time>] [-w] [cubic|dctcp]
Where:
out Egress (default egress)
-b or --bpf BPF program filename to load and attach.
Default is nrm_out_kern.o for egress,
-c or -cc TCP congestion control (cubic or dctcp)
-d or --delay Add a delay in ms using netem
-D In addition to the goodput in Mbps, it also outputs
other detailed information. This information is
test dependent (i.e. iperf3 or netperf).
--debug Print BPF trace buffer
-E Enable ECN (not required for dctcp)
-f or --flows Number of concurrent flows (default=1)
-i or --id cgroup id (an integer, default is 1)
-l Do not limit flows using loopback
-N Use netperf instead of iperf3
-h Help
-p or --port iperf3 port (default is 5201)
-P Use an iperf3 instance for each flow
-q Use the specified qdisc.
-r or --rate Rate in Mbps (default 1s 1Gbps)
-R Use TCP_RR for netperf. 1st flow has req
size of 10KB, rest of 1MB. Reply in all
cases is 1 byte.
More detailed output for each flow can be found
in the files netperf.<cg>.<flow>, where <cg> is the
cgroup id as specified with the -i flag, and <flow>
is the flow id starting at 1 and increasing by 1 for
flow (as specified by -f).
-s or --server hostname of netperf server. Used to create netperf
test traffic between to hosts (default is within host)
netserver must be running on the host.
--stats Get HBM stats (marked, dropped, etc.)
-t or --time duration of iperf3 in seconds (default=5)
-w Work conserving flag. cgroup can increase its
bandwidth beyond the rate limit specified
while there is available bandwidth. Current
implementation assumes there is only one NIC
(eth0), but can be extended to support multiple
NICs. This is just a proof of concept.
cubic or dctcp specify TCP CC to use
Examples:
./do_hbm_test.sh -l -d=1 -D --stats
Runs a 5 second test, using a single iperf3 flow and with the default
rate limit of 1Gbps and a delay of 1ms (using netem) using the default
TCP congestion control on the loopback device (hence we use "-l" to
enforce bandwidth limit on loopback device). Since no direction is
specified, it defaults to egress. Since no TCP CC algorithm is
specified it uses the system default (Cubic for this test).
With no -D flag, only the value of the AGGREGATE OUTPUT would show.
id refers to the cgroup id and is useful when running multi cgroup
tests (supported by a future patch).
This patchset does not support calling TCP's congesion window
reduction, even when packets are dropped by the BPF program, resulting
in a large number of packets dropped. It is recommended that the current
HBM implemenation only be used with ECN enabled flows. A future patch
will add support for reducing TCP's cwnd and will increase the
performance of non-ECN enabled flows.
Output:
Details for HBM in cgroup 1
id:1
rate_mbps:493
duration:4.8 secs
packets:11355
bytes_MB:590
pkts_dropped:4497
bytes_dropped_MB:292
pkts_marked_percent: 39.60
bytes_marked_percent: 49.49
pkts_dropped_percent: 39.60
bytes_dropped_percent: 49.49
PING AVG DELAY:2.075
AGGREGATE_GOODPUT:505
./do_nrm_test.sh -l -d=1 -D --stats dctcp
Same as above but using dctcp. Note that fewer bytes are dropped
(0.01% vs. 49%).
Output:
Details for HBM in cgroup 1
id:1
rate_mbps:945
duration:4.9 secs
packets:16859
bytes_MB:578
pkts_dropped:1
bytes_dropped_MB:0
pkts_marked_percent: 28.74
bytes_marked_percent: 45.15
pkts_dropped_percent: 0.01
bytes_dropped_percent: 0.01
PING AVG DELAY:2.083
AGGREGATE_GOODPUT:965
./do_nrm_test.sh -d=1 -D --stats
As first example, but without limiting loopback device (i.e. no
"-l" flag). Since there is no bandwidth limiting, no details for
HBM are printed out.
Output:
Details for HBM in cgroup 1
PING AVG DELAY:2.019
AGGREGATE_GOODPUT:42655
./do_hbm.sh -l -d=1 -D --stats -f=2
Uses iper3 and does 2 flows
./do_hbm.sh -l -d=1 -D --stats -f=4 -P
Uses iperf3 and does 4 flows, each flow as a separate process.
./do_hbm.sh -l -d=1 -D --stats -f=4 -N
Uses netperf, 4 flows
./do_hbm.sh -f=1 -r=2000 -t=5 -N -D --stats dctcp -s=<server-name>
Uses netperf between two hosts. The remote host name is specified
with -s= and you need to start the program netserver manually on
the remote host. It will use 1 flow, a rate limit of 2Gbps and dctcp.
./do_hbm.sh -f=1 -r=2000 -t=5 -N -D --stats -w dctcp \
-s=<server-name>
As previous, but allows use of extra bandwidth. For this test the
rate is 8Gbps vs. 1Gbps of the previous test.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>