This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sb1000 is a module network device driver for the General Instrument (also known
as NextLevel) SURFboard1000 internal cable modem board. This is an ISA card
which is used by a number of cable TV companies to provide cable modem access.
It's a one-way downstream-only cable modem, meaning that your upstream net link
is provided by your regular phone modem.
This driver was written by Franco Venturi <fventuri@mediaone.net>. He deserves
a great deal of thanks for this wonderful piece of code!
-----------------------------------------------------------------------------
Support for this device is now a part of the standard Linux kernel. The
driver source code file is drivers/net/sb1000.c. In addition to this
you will need:
1.) The "cmconfig" program. This is a utility which supplements "ifconfig"
to configure the cable modem and network interface (usually called "cm0");
and
2.) Several PPP scripts which live in /etc/ppp to make connecting via your
cable modem easy.
These utilities can be obtained from:
http://www.jacksonville.net/~fventuri/
in Franco's original source code distribution .tar.gz file. Support for
the sb1000 driver can be found at:
http://web.archive.org/web/*/http://home.adelphia.net/~siglercm/sb1000.html
http://web.archive.org/web/*/http://linuxpower.cx/~cable/
along with these utilities.
3.) The standard isapnp tools. These are necessary to configure your SB1000
card at boot time (or afterwards by hand) since it's a PnP card.
If you don't have these installed as a standard part of your Linux
distribution, you can find them at:
http://www.roestock.demon.co.uk/isapnptools/
or check your Linux distribution binary CD or their web site. For help with
isapnp, pnpdump, or /etc/isapnp.conf, go to:
http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html
-----------------------------------------------------------------------------
To make the SB1000 card work, follow these steps:
1.) Run `make config', or `make menuconfig', or `make xconfig', whichever
you prefer, in the top kernel tree directory to set up your kernel
configuration. Make sure to say "Y" to "Prompt for development drivers"
and to say "M" to the sb1000 driver. Also say "Y" or "M" to all the standard
networking questions to get TCP/IP and PPP networking support.
2.) *BEFORE* you build the kernel, edit drivers/net/sb1000.c. Make sure
to redefine the value of READ_DATA_PORT to match the I/O address used
by isapnp to access your PnP cards. This is the value of READPORT in
/etc/isapnp.conf or given by the output of pnpdump.
3.) Build and install the kernel and modules as usual.
4.) Boot your new kernel following the usual procedures.
5.) Set up to configure the new SB1000 PnP card by capturing the output
of "pnpdump" to a file and editing this file to set the correct I/O ports,
IRQ, and DMA settings for all your PnP cards. Make sure none of the settings
conflict with one another. Then test this configuration by running the
"isapnp" command with your new config file as the input. Check for
errors and fix as necessary. (As an aside, I use I/O ports 0x110 and
0x310 and IRQ 11 for my SB1000 card and these work well for me. YMMV.)
Then save the finished config file as /etc/isapnp.conf for proper configuration
on subsequent reboots.
6.) Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of
the others referenced above. As root, unpack it into a temporary directory and
do a `make cmconfig' and then `install -c cmconfig /usr/local/sbin'. Don't do
`make install' because it expects to find all the utilities built and ready for
installation, not just cmconfig.
7.) As root, copy all the files under the ppp/ subdirectory in Franco's
tar file into /etc/ppp, being careful not to overwrite any files that are
already in there. Then modify ppp@gi-on to set the correct login name,
phone number, and frequency for the cable modem. Also edit pap-secrets
to specify your login name and password and any site-specific information
you need.
8.) Be sure to modify /etc/ppp/firewall to use ipchains instead of
the older ipfwadm commands from the 2.0.x kernels. There's a neat utility to
convert ipfwadm commands to ipchains commands:
http://users.dhp.com/~whisper/ipfwadm2ipchains/
You may also wish to modify the firewall script to implement a different
firewalling scheme.
9.) Start the PPP connection via the script /etc/ppp/ppp@gi-on. You must be
root to do this. It's better to use a utility like sudo to execute
frequently used commands like this with root permissions if possible. If you
connect successfully the cable modem interface will come up and you'll see a
driver message like this at the console:
cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11.
sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net)
The "ifconfig" command should show two new interfaces, ppp0 and cm0.
The command "cmconfig cm0" will give you information about the cable modem
interface.
10.) Try pinging a site via `ping -c 5 www.yahoo.com', for example. You should
see packets received.
11.) If you can't get site names (like www.yahoo.com) to resolve into
IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file
has no syntax errors and has the right nameserver IP addresses in it.
If this doesn't help, try something like `ping -c 5 204.71.200.67' to
see if the networking is running but the DNS resolution is where the
problem lies.
12.) If you still have problems, go to the support web sites mentioned above
and read the information and documentation there.
-----------------------------------------------------------------------------
Common problems:
1.) Packets go out on the ppp0 interface but don't come back on the cm0
interface. It looks like I'm connected but I can't even ping any
numerical IP addresses. (This happens predominantly on Debian systems due
to a default boot-time configuration script.)
Solution -- As root `echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter' so it
can share the same IP address as the ppp0 interface. Note that this
command should probably be added to the /etc/ppp/cablemodem script
*right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands.
You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well.
If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot
(in rc.local or some such) then any interfaces can share the same IP
addresses.
2.) I get "unresolved symbol" error messages on executing `insmod sb1000.o'.
Solution -- You probably have a non-matching kernel source tree and
/usr/include/linux and /usr/include/asm header files. Make sure you
install the correct versions of the header files in these two directories.
Then rebuild and reinstall the kernel.
3.) When isapnp runs it reports an error, and my SB1000 card isn't working.
Solution -- There's a problem with later versions of isapnp using the "(CHECK)"
option in the lines that allocate the two I/O addresses for the SB1000 card.
This first popped up on RH 6.0. Delete "(CHECK)" for the SB1000 I/O addresses.
Make sure they don't conflict with any other pieces of hardware first! Then
rerun isapnp and go from there.
4.) I can't execute the /etc/ppp/ppp@gi-on file.
Solution -- As root do `chmod ug+x /etc/ppp/ppp@gi-on'.
5.) The firewall script isn't working (with 2.2.x and higher kernels).
Solution -- Use the ipfwadm2ipchains script referenced above to convert the
/etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains.
6.) I'm getting *tons* of firewall deny messages in the /var/kern.log,
/var/messages, and/or /var/syslog files, and they're filling up my /var
partition!!!
Solution -- First, tell your ISP that you're receiving DoS (Denial of Service)
and/or portscanning (UDP connection attempts) attacks! Look over the deny
messages to figure out what the attack is and where it's coming from. Next,
edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on
to the "cmconfig" command (uncomment that line). If you're not receiving these
denied packets on your broadcast interface (IP address xxx.yyy.zzz.255
typically), then someone is attacking your machine in particular. Be careful
out there....
7.) Everything seems to work fine but my computer locks up after a while
(and typically during a lengthy download through the cable modem)!
Solution -- You may need to add a short delay in the driver to 'slow down' the
SURFboard because your PC might not be able to keep up with the transfer rate
of the SB1000. To do this, it's probably best to download Franco's
sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually. You'll
want to edit the 'Makefile' and look for the 'SB1000_DELAY'
define. Uncomment those 'CFLAGS' lines (and comment out the default ones)
and try setting the delay to something like 60 microseconds with:
'-DSB1000_DELAY=60'. Then do `make' and as root `make install' and try
it out. If it still doesn't work or you like playing with the driver, you may
try other numbers. Remember though that the higher the delay, the slower the
driver (which slows down the rest of the PC too when it is actively
used). Thanks to Ed Daiga for this tip!
-----------------------------------------------------------------------------
Credits: This README came from Franco Venturi's original README file which is
still supplied with his driver .tar.gz archive. I and all other sb1000 users
owe Franco a tremendous "Thank you!" Additional thanks goes to Carl Patten
and Ralph Bonnell who are now managing the Linux SB1000 web site, and to
the SB1000 users who reported and helped debug the common problems listed
above.
Clemmitt Sigler
csigler@vt.edu