This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch adds all of eBPF verfier documentation and empty bpf_check()
The end goal for the verifier is to statically check safety of the program.
Verifier will catch:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
More details in Documentation/networking/filter.txt
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF syscall is a multiplexor for a range of different operations on eBPF.
This patch introduces syscall with single command to create a map.
Next patch adds commands to access maps.
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
Userspace example:
/* this syscall wrapper creates a map with given type and attributes
* and returns map_fd on success.
* use close(map_fd) to delete the map
*/
int bpf_create_map(enum bpf_map_type map_type, int key_size,
int value_size, int max_entries)
{
union bpf_attr attr = {
.map_type = map_type,
.key_size = key_size,
.value_size = value_size,
.max_entries = max_entries
};
return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}
'union bpf_attr' is backwards compatible with future extensions.
More details in Documentation/networking/filter.txt and in manpage
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
protected by a lock, meaning that hosts can be stuck hard if all cpus
want to check ICMP limits.
When say a DNS or NTP server process is restarted, inetpeer tree grows
quick and machine comes to its knees.
iptables can not help because the bottleneck happens before ICMP
messages are even cooked and sent.
This patch adds a new global limitation, using a token bucket filter,
controlled by two new sysctl :
icmp_msgs_per_sec - INTEGER
Limit maximal number of ICMP packets sent per second from this host.
Only messages whose type matches icmp_ratemask are
controlled by this limit.
Default: 1000
icmp_msgs_burst - INTEGER
icmp_msgs_per_sec controls number of ICMP packets sent per second,
while icmp_msgs_burst controls the burst size of these packets.
Default: 50
Note that if we really want to send millions of ICMP messages per
second, we might extend idea and infra added in commit 04ca6973f7
("ip: make IP identifiers less predictable") :
add a token bucket in the ip_idents hash and no longer rely on inetpeer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c801e3cc19 ("ipv4: Clarify in docs that accept_local requires rp_filter.").
It is not needed anymore since commit 1dced6a854 ("ipv4: Restore accept_local behaviour in fib_validate_source()").
Suggested-by: Julian Anastasov <ja@ssi.bg>
Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)
User space generated programs will use this instruction to load constants only.
To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A buffer is incorrectly zeroed to the length of the pointer. If
cfg_payload_len < sizeof(void *) this can overwrites unrelated memory.
The buffer contents are never read, so no need to zero.
Fixes: 8fe2f761ca ("net-timestamp: expand documentation")
Reported-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As in IPv6 people might increase the igmp query robustness variable to
make sure unsolicited state change reports aren't lost on the network. Add
and document this new knob to igmp code.
RFCs allow tuning this parameter back to first IGMP RFC, so we also use
this setting for all counters, including source specific multicast.
Also take over sysctl value when upping the interface and don't reuse
the last one seen on the interface.
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query
robustness variable. It specifies how many retransmit of unsolicited mld
retransmit should happen. Admins might want to tune this on lossy links.
Also reset mld state on interface down/up, so we pick up new sysctl
settings during interface up event.
IPv6 certification requests this knob to be available.
I didn't make this knob netns specific, as it is mostly a setting in a
physical environment and should be per host.
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expand Documentation/networking/timestamping.txt with new
interfaces and bytestream timestamping. Also minor
cleanup of the other text.
Import txtimestamp.c test of the new features.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Missing documentation for gc_thresh2 sysctl.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds newly added FCoE files to the build but only if FCoE module is configured.
Also, updates i40e document for added FCoE support.
Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Tested-by: Jack Morgan<jack.morgan@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
trivial rename to indicate that this functions performs classic BPF checking
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the PHY library documentation to describe how a specific PHY
driver can use the PAL MMD register access routines or override those
routines with it's own in the event the PHY does not support the IEEE
standard for reading and writing MMD phy registers.
Signed-off-by: Vince Bridgers <vbridgers2013@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_TIMESTAMPING API defines three types of timestamps: software,
hardware in raw format (hwtstamp) and hardware converted to system
format (syststamp). The last has been deprecated in favor of combining
hwtstamp with a PTP clock driver. There are no active users in the
kernel.
The option was device driver dependent. If set, but without hardware
support, the correct behavior is to return zero in the relevant field
in the SCM_TIMESTAMPING ancillary message. Without device drivers
implementing the option, this field is effectively always zero.
Remove the internal plumbing to dissuage new drivers from implementing
the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
avoid breaking existing applications that request the timestamp.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No device driver will ever return an skb_shared_info structure with
syststamp non-zero, so remove the branch that tests for this and
optionally marks the packet timestamp as TP_STATUS_TS_SYS_HARDWARE.
Do not remove the definition TP_STATUS_TS_SYS_HARDWARE, as processes
may refer to it.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes init_net's high_thresh limit to be the maximum for all
namespaces, thus introducing a global memory limit threshold equal to the
sum of the individual high_thresh limits which are capped.
It also introduces some sane minimums for low_thresh as it shouldn't be
able to drop below 0 (or > high_thresh in the unsigned case), and
overall low_thresh should not ever be above high_thresh, so we make the
following relations for a namespace:
init_net:
high_thresh - max(not capped), min(init_net low_thresh)
low_thresh - max(init_net high_thresh), min (0)
all other namespaces:
high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
low_thresh = max(namespace's high_thresh), min(0)
The major issue with having low_thresh > high_thresh is that we'll
schedule eviction but never evict anything and thus rely only on the
timers.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
merge functionality into the eviction workqueue.
Instead of rebuilding every n seconds, take advantage of the upper
hash chain length limit.
If we hit it, mark table for rebuild and schedule workqueue.
To prevent frequent rebuilds when we're completely overloaded,
don't rebuild more than once every 5 seconds.
ipfrag_secret_interval sysctl is now obsolete and has been marked as
deprecated, it still can be changed so scripts won't be broken but it
won't have any effect. A comment is left above each unused secret_timer
variable to avoid confusion.
Joint work with Nikolay Aleksandrov.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the high_thresh limit is reached we try to toss the 'oldest'
incomplete fragment queues until memory limits are below the low_thresh
value. This happens in softirq/packet processing context.
This has two drawbacks:
1) processors might evict a queue that was about to be completed
by another cpu, because they will compete wrt. resource usage and
resource reclaim.
2) LRU list maintenance is expensive.
But when constantly overloaded, even the 'least recently used' element is
recent, so removing 'lru' queue first is not 'fairer' than removing any
other fragment queue.
This moves eviction out of the fast path:
When the low threshold is reached, a work queue is scheduled
which then iterates over the table and removes the queues that exceed
the memory limits of the namespace. It sets a new flag called
INET_FRAG_EVICTED on the evicted queues so the proper counters will get
incremented when the queue is forcefully expired.
When the high threshold is reached, no more fragment queues are
created until we're below the limit again.
The LRU list is now unused and will be removed in a followup patch.
Joint work with Nikolay Aleksandrov.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Document the Layer 2 hash factors with packet type ID field.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
CC: Pan Jiafei <Jiafei.Pan@freescale.com>
Signed-off-by: Jianhua Xie <jianhua.xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_TIMESTAMPING API defines option SOF_TIMESTAMPING_SYS_HW.
This feature is deprecated. It should not be implemented by new
device drivers. Existing drivers do not implement it, either --
with one exception.
Driver developers are encouraged to expose the NIC hw clock as a
PTP HW clock source, instead, and synchronize system time to the
HW source.
The control flag cannot be removed due to being part of the ABI, nor
can the structure scm_timestamping that is returned. Due to the one
legacy driver, the internal datapath and structure are not removed.
This patch only clearly marks the interface as deprecated. Device
drivers should always return a syststamp value of zero.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
We can consider adding a WARN_ON_ONCE in__sock_recv_timestamp
if non-zero syststamp is encountered
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Automatically generate flow labels for IPv6 packets on transmit.
The flow label is computed based on skb_get_hash. The flow label will
only automatically be set when it is zero otherwise (i.e. flow label
manager hasn't set one). This supports the transmit side functionality
of RFC 6438.
Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
functionality per socket.
By default, auto flowlabels are disabled to avoid possible conflicts
with flow label manager, however if this feature proves useful we
may want to enable it by default.
It should also be noted that FreeBSD has already implemented automatic
flow labels (including the sysctl and socket option). In FreeBSD,
automatic flow labels default to enabled.
Performance impact:
Running super_netperf with 200 flows for TCP_RR and UDP_RR for
IPv6. Note that in UDP case, __skb_get_hash will be called for
every packet with explains slight regression. In the TCP case
the hash is saved in the socket so there is no regression.
Automatic flow labels disabled:
TCP_RR:
86.53% CPU utilization
127/195/322 90/95/99% latencies
1.40498e+06 tps
UDP_RR:
90.70% CPU utilization
118/168/243 90/95/99% latencies
1.50309e+06 tps
Automatic flow labels enabled:
TCP_RR:
85.90% CPU utilization
128/199/337 90/95/99% latencies
1.40051e+06
UDP_RR
92.61% CPU utilization
115/164/236 90/95/99% latencies
1.4687e+06
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
running full. Thus, the TX ring is artificially limiting pktgen.
(Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
counters.)
Using ixgbe, the real reason behind the TX ring running full, is due
to TX ring not being cleaned up fast enough. The ixgbe driver combines
TX+RX ring cleanups, and the cleanup interval is affected by the
ethtool --coalesce setting of parameter "rx-usecs".
Do not increase the default NIC TX ring buffer or default cleanup
interval. Instead simply document that pktgen needs special NIC
tuning for maximum packet per sec performance.
Performance results with pktgen with clone_skb=100000.
TX ring size 512 (default), adjusting "rx-usecs":
(Single CPU performance, E5-2630, ixgbe)
- 3935002 pps - rx-usecs: 1 (irqs: 9346)
- 5132350 pps - rx-usecs: 10 (irqs: 99157)
- 5375111 pps - rx-usecs: 20 (irqs: 50154)
- 5454050 pps - rx-usecs: 30 (irqs: 33872)
- 5496320 pps - rx-usecs: 40 (irqs: 26197)
- 5502510 pps - rx-usecs: 50 (irqs: 21527)
TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
- 3935002 pps - tx-size: 512
- 5354401 pps - tx-size: 768
- 5356847 pps - tx-size: 1024
- 5327595 pps - tx-size: 1536
- 5356779 pps - tx-size: 2048
- 5353438 pps - tx-size: 4096
Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled
devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag. This allow us to put
more pressure on the TX ring buffers.
It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This can be used in virtual networking applications, and
may have other uses as well. The option is disabled by
default.
A specific use case is setting up virtual routers, bridges, and
hosts on a single OS without the use of network namespaces or
virtual machines. With proper use of ip rules, routing tables,
veth interface pairs and/or other virtual interfaces,
and applications that can bind to interfaces and/or IP addresses,
it is possibly to create one or more virtual routers with multiple
hosts attached. The host interfaces can act as IPv6 systems,
with radvd running on the ports in the virtual routers. With the
option provided in this patch enabled, those hosts can now properly
obtain IPv6 addresses from the radvd.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
This patch adds a description of eBPFs instruction encoding in order
to bring the documentation in line with the implementation.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the term eBPF is used anyway on mailing list discussions, lets
also document that in the main BPF documentation file and replace a
couple of occurrences with eBPF terminology to be more clear.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The macro 'A' used in internal BPF interpreter:
#define A regs[insn->a_reg]
was easily confused with the name of classic BPF register 'A', since
'A' would mean two different things depending on context.
This patch is trying to clean up the naming and clarify its usage in the
following way:
- A and X are names of two classic BPF registers
- BPF_REG_A denotes internal BPF register R0 used to map classic register A
in internal BPF programs generated from classic
- BPF_REG_X denotes internal BPF register R7 used to map classic register X
in internal BPF programs generated from classic
- internal BPF instruction format:
struct sock_filter_int {
__u8 code; /* opcode */
__u8 dst_reg:4; /* dest register */
__u8 src_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};
- BPF_X/BPF_K is 1 bit used to encode source operand of instruction
In classic:
BPF_X - means use register X as source operand
BPF_K - means use 32-bit immediate as source operand
In internal:
BPF_X - means use 'src_reg' register as source operand
BPF_K - means use 32-bit immediate as source operand
Suggested-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull trivial tree changes from Jiri Kosina:
"Usual pile of patches from trivial tree that make the world go round"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
staging: go7007: remove reference to CONFIG_KMOD
aic7xxx: Remove obsolete preprocessor define
of: dma: doc fixes
doc: fix incorrect formula to calculate CommitLimit value
doc: Note need of bc in the kernel build from 3.10 onwards
mm: Fix printk typo in dmapool.c
modpost: Fix comment typo "Modules.symvers"
Kconfig.debug: Grammar s/addition/additional/
wimax: Spelling s/than/that/, wording s/destinatary/recipient/
aic7xxx: Spelling s/termnation/termination/
arm64: mm: Remove superfluous "the" in comment
of: Spelling s/anonymouns/anonymous/
dma: imx-sdma: Spelling s/determnine/determine/
ath10k: Improve grammar in comments
ath6kl: Spelling s/determnine/determine/
of: Improve grammar for of_alias_get_id() documentation
drm/exynos: Spelling s/contro/control/
radio-bcm2048.c: fix wrong overflow check
doc: printk-formats: do not mention casts for u64/s64
doc: spelling error changes
...
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Mention the recently added test suite in the documentation file.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 61b905da33 ("net: Rename skb->rxhash to skb->hash"), skb->rxhash
was renamed to skb->hash. Update references in Documentation
accordingly.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To benefit from special filters for single SFF or single EFF CAN identifier
subscriptions the CAN_EFF_FLAG bit and the CAN_RTR_FLAG bit has to be set
together with the CAN_(SFF|EFF)_MASK in can_filter.mask.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
An initial attempt on describing some of the odd APIs
provided by this driver.
Cc: Greg Suarez <gsuarez@smithmicro.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
In particular, this patch tries to clarify internal BPF calling
convention and adds internal BPF examples, JIT guide, use cases.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/igb/e1000_mac.c
net/core/filter.c
Both conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The aggresive load balancing causes packet re-ordering as active
flows are moved from a slave to another within the group. Sometime
this aggresive lb is not necessary if the preference is for less
re-ordering. This parameter if used with value "0" disables
this dynamic flow shuffling minimizing packet re-ordering. Of course
the side effect is that it has to live with the static load balancing
that the hashing distribution provides. This impact is less severe if
the correct xmit-hashing-policy is used for the tlb setup.
The default value of the parameter is set to "1" mimicing the earlier
behavior.
Ran the netperf test with 200 stream for 1 min between two hosts with
4x1G trunk (xmit-lb mode with xmit-policy L3+4) before and after these
changes. Following was the command used for those 200 instances -
netperf -t TCP_RR -l 60 -s 5 -H <host> -- -r81920,81920
Transactions per second:
Before change: 1,367.11
After change: 1,470.65
Change-Id: Ie3f75c77282cf602e83a6e833c6eb164e72a0990
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Re-organized the xmit function for the lb mode separating tlb xmit
from the alb mode. This will enable use of the hashing policies
like 802.3ad mode. Also extended use of xmit-hash-policy to tlb mode.
Now the tlb-mode defaults to BOND_XMIT_POLICY_LAYER2 if the xmit policy
module parameter is not set (just like 802.3ad, or Xor mode).
Change-Id: I140257403d272df75f477b380207338d0f04963e
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a new ancillary load (bpf call in eBPF parlance) that produces
a 32-bit random number. We are implementing it as an ancillary load
(instead of an ISA opcode) because (a) it is simpler, (b) allows easy
JITing, and (c) seems more in line with generic ISAs that do not have
"get a random number" as a instruction, but as an OS call.
The main use for this ancillary load is to perform random packet sampling.
Signed-off-by: Chema Gonzalez <chema@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Here is my initial pull request for the networking subsystem during
this merge window:
1) Support for ESN in AH (RFC 4302) from Fan Du.
2) Add full kernel doc for ethtool command structures, from Ben
Hutchings.
3) Add BCM7xxx PHY driver, from Florian Fainelli.
4) Export computed TCP rate information in netlink socket dumps, from
Eric Dumazet.
5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas
Dichtel.
6) Convert many drivers to pci_enable_msix_range(), from Alexander
Gordeev.
7) Record SKB timestamps more efficiently, from Eric Dumazet.
8) Switch to microsecond resolution for TCP round trip times, also
from Eric Dumazet.
9) Clean up and fix 6lowpan fragmentation handling by making use of
the existing inet_frag api for it's implementation.
10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss.
11) Auto size SKB lengths when composing netlink messages based upon
past message sizes used, from Eric Dumazet.
12) qdisc dumps can take a long time, add a cond_resched(), From Eric
Dumazet.
13) Sanitize netpoll core and drivers wrt. SKB handling semantics.
Get rid of never-used-in-tree netpoll RX handling. From Eric W
Biederman.
14) Support inter-address-family and namespace changing in VTI tunnel
driver(s). From Steffen Klassert.
15) Add Altera TSE driver, from Vince Bridgers.
16) Optimizing csum_replace2() so that it doesn't adjust the checksum
by checksumming the entire header, from Eric Dumazet.
17) Expand BPF internal implementation for faster interpreting, more
direct translations into JIT'd code, and much cleaner uses of BPF
filtering in non-socket ocntexts. From Daniel Borkmann and Alexei
Starovoitov"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits)
netpoll: Use skb_irq_freeable to make zap_completion_queue safe.
net: Add a test to see if a skb is freeable in irq context
qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port'
net: ptp: move PTP classifier in its own file
net: sxgbe: make "core_ops" static
net: sxgbe: fix logical vs bitwise operation
net: sxgbe: sxgbe_mdio_register() frees the bus
Call efx_set_channels() before efx->type->dimension_resources()
xen-netback: disable rogue vif in kthread context
net/mlx4: Set proper build dependancy with vxlan
be2net: fix build dependency on VxLAN
mac802154: make csma/cca parameters per-wpan
mac802154: allow only one WPAN to be up at any given time
net: filter: minor: fix kdoc in __sk_run_filter
netlink: don't compare the nul-termination in nla_strcmp
can: c_can: Avoid led toggling for every packet.
can: c_can: Simplify TX interrupt cleanup
can: c_can: Store dlc private
can: c_can: Reduce register access
can: c_can: Make the code readable
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iEYEABECAAYFAlM6jfsACgkQjTAFq1RaXHOVTwCeOyOYOJ+Ze2x33WydGcGsvX8a
QPoAoJVAcekmL5MVbcMTTvc8QL1AebAm
=4aEp
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-3.15-20140401' of git://gitorious.org/linux-can/linux-can
linux-can-fixes-for-3.15-20140401
Marc Kleine-Budde says:
====================
this is a pull request of 16 patches for the 3.15 release cycle.
Bjorn Van Tilt contributes a patch which fixes a memory leak in usb_8dev's
usb_8dev_start_xmit()s error path. A patch by Robert Schwebel fixes a typo in
the can documentation. The remaining patches all target the c_can driver. Two
of them are by me; they add a missing netif_napi_del() and return value
checking. Thomas Gleixner contributes 12 patches, which address several
shortcomings in the driver like hardware initialisation, concurrency, message
ordering and poor performance.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the name of the parameter to configure the sample point used
in iproute2's ip command. The correct writing is "sample-point" not
"sample_point".
Signed-off-by: Robert Schwebel <r.schwebel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Further extend the current BPF documentation to document new BPF
engine internals. Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the MDIO bus documentation to mention that the MDIO bus reset
function is completely optional. It became optional with commit
e13934563d ("[PATCH] PHY Layer fixup")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>