This patch adds the receiver side and the (fast-path) activation part for
dynamic changes of non-negotiable (NN) parameters in (PART)OPEN state.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.uk>
In contrast to static feature negotiation at the begin of a connection, this
patch introduces support for exchange of dynamically changing options.
Such an update/exchange is necessary in at least two cases:
* CCID-2's Ack Ratio (RFC 4341, 6.1.2) which changes during the connection;
* Sequence Window values that, as per RFC 4340, 7.5.2, should be sent "as
the connection progresses".
Both are non-negotiable (NN) features, which means that no new capabilities
are negotiated, but rather that changes in known parameters are brought
up-to-date at either end.
Thse characteristics are reflected by the implementation:
* only NN options can be exchanged after connection setup;
* an ack is scheduled directly after activation to speed up the update;
* CCIDs may request changes to an NN feature even if a negotiation for that
feature is already underway: this is required by CCID-2, where changes in
cwnd necessitate Ack Ratio changes, such that the previous Ack Ratio (which
is still being negotiated) would cause irrecoverable RTO timeouts (thanks
to work by Samuel Jero).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.uk>
CCID-2's cwnd increases like TCP during slow-start, which has implications for
* the local Sequence Window value (should be > cwnd),
* the Ack Ratio value.
Hence an exponential growth, if it does not reflect the actual network
conditions, can quickly lead to instability.
This patch adds congestion-window validation (RFC2861) to CCID-2:
* cwnd is constrained if the sender is application limited;
* cwnd is reduced after a long idle period, as suggested in the '90 paper
by Van Jacobson, in RFC 2581 (sec. 4.1);
* cwnd is never reduced below the RFC 3390 initial window.
As marked in the comments, the code is actually almost a direct copy of the
TCP congestion-window-validation algorithms. By continuing this work, it may
in future be possible to use the TCP code (not possible at the moment).
The mechanism can be turned off using a module parameter. Sampling of the
currently-used window (moving-maximum) is however done constantly; this is
used to determine the expected window, which can be exploited to regulate
DCCP's Sequence Window value.
This patch also sets slow-start-after-idle (RFC 4341, 5.1), i.e. it behaves like
TCP when net.ipv4.tcp_slow_start_after_idle = 1.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This replaces a switch statement with a test, using the equivalent
function dccp_data_packet(skb). It also doubles the range of the field
`rx_num_data_pkts' by changing the type from `int' to `u32', avoiding
signed/unsigned comparison with the u16 field `dccps_r_ack_ratio'.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This moves CCID-2's initial window function into the header file, since several
parts throughout the CCID-2 code need to call it (CCID-2 still uses RFC 3390).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Leandro Melo de Sales <leandro@ic.ufal.br>
Change the CCID (de)activation message to start with the
protocol name, as 'CCID' is already in there.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Realising the following call pattern,
* first dccp_entail() is called to enqueue a new skb and
* then skb_clone() is called to transmit a clone of that skb,
this patch integrates both into the same function.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch rearranges the order of statements of the slow-path input processing
(i.e. any other state than OPEN), to resolve the following issues.
1. Dependencies: the order of statements now better matches RFC 4340, 8.5, i.e.
step 7 is before step 9 (previously 9 was before 7), and parsing options in
step 8 (which may consume resources) now comes after step 7.
2. Sequence number checks are omitted if in state LISTEN/REQUEST, due to the
note underneath the table in RFC 4340, 7.5.3.
As a result, CCID processing is now indeed confined to OPEN/PARTOPEN states,
i.e. congestion control is performed only on the flow of data packets. This
avoids pathological cases of doing congestion control on those messages
which set up and terminate the connection.
3. Packets are now passed on to Ack Vector / CCID processing only after
- step 7 (receive unexpected packets),
- step 9 (receive Reset),
- step 13 (receive CloseReq),
- step 14 (receive Close)
and only if the state is PARTOPEN. This simplifies CCID processing:
- in LISTEN/CLOSED the CCIDs are non-existent;
- in RESPOND/REQUEST the CCIDs have not yet been negotiated;
- in CLOSEREQ and active-CLOSING the node has already closed this socket;
- in passive-CLOSING the client is waiting for its Reset.
In the last case, RFC 4340, 8.3 leaves it open to ignore further incoming
data, which is the approach taken here.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This allows us to acquire the exact route keying information from the
protocol, however that might be managed.
It handles all of the possibilities, from the simplest case of storing
the key in inet->cork.fl to the more complex setup SCTP has where
individual transports determine the flow.
Signed-off-by: David S. Miller <davem@davemloft.net>
Operation order is now transposed, we first create the child
socket then we try to hook up the route.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since this is invoked from inet_stream_connect() the socket is locked
and therefore this usage is safe.
Signed-off-by: David S. Miller <davem@davemloft.net>
A length of zero (after subtracting two for the type and len fields) for
the DCCPO_{CHANGE,CONFIRM}_{L,R} options will cause an underflow due to
the subtraction. The subsequent code may read past the end of the
options value buffer when parsing. I'm unsure of what the consequences
of this might be, but it's probably not good.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst
Signed-off-by: David S. Miller <davem@davemloft.net>
We lack proper synchronization to manipulate inet->opt ip_options
Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.
Another thread can change inet->opt pointer and free old one under us.
Use RCU to protect inet->opt (changed to inet->inet_opt).
Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.
We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions are used together as a unit for route resolution
during connect(). They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.
It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.
Let the callers provide the on-stack flow object. That way we only
need to initialize it once in the ip_route_connect() call.
Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.
Also, describe why this set of facilities are needed and how it works
in a big comment.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create two sets of port member accessors, one set prefixed by fl4_*
and the other prefixed by fl6_*
This will let us to create AF optimal flow instances.
It will work because every context in which we access the ports,
we have to be fully aware of which AF the flowi is anyways.
Signed-off-by: David S. Miller <davem@davemloft.net>
I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.
This is the first step to move in that direction.
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug in the order of dccp_rcv_state_process() that still permitted
reception even after closing the socket. A Reset after close thus causes a NULL
pointer dereference by not preventing operations on an already torn-down socket.
dccp_v4_do_rcv()
|
| state other than OPEN
v
dccp_rcv_state_process()
|
| DCCP_PKT_RESET
v
dccp_rcv_reset()
|
v
dccp_time_wait()
WARNING: at net/ipv4/inet_timewait_sock.c:141 __inet_twsk_hashdance+0x48/0x128()
Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
[<c0038850>] (unwind_backtrace+0x0/0xec) from [<c0055364>] (warn_slowpath_common)
[<c0055364>] (warn_slowpath_common+0x4c/0x64) from [<c0055398>] (warn_slowpath_n)
[<c0055398>] (warn_slowpath_null+0x1c/0x24) from [<c02b72d0>] (__inet_twsk_hashd)
[<c02b72d0>] (__inet_twsk_hashdance+0x48/0x128) from [<c031caa0>] (dccp_time_wai)
[<c031caa0>] (dccp_time_wait+0x40/0xc8) from [<c031c15c>] (dccp_rcv_state_proces)
[<c031c15c>] (dccp_rcv_state_process+0x120/0x538) from [<c032609c>] (dccp_v4_do_)
[<c032609c>] (dccp_v4_do_rcv+0x11c/0x14c) from [<c0286594>] (release_sock+0xac/0)
[<c0286594>] (release_sock+0xac/0x110) from [<c031fd34>] (dccp_close+0x28c/0x380)
[<c031fd34>] (dccp_close+0x28c/0x380) from [<c02d9a78>] (inet_release+0x64/0x70)
The fix is by testing the socket state first. Receiving a packet in Closed state
now also produces the required "No connection" Reset reply of RFC 4340, 8.3.1.
Reported-and-tested-by: Johan Hovold <jhovold@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Route lookups follow a general pattern in the ipv6 code wherein
we first find the non-IPSEC route, potentially override the
flow destination address due to ipv6 options settings, and then
finally make an IPSEC search using either xfrm_lookup() or
__xfrm_lookup().
__xfrm_lookup() is used when we want to generate a blackhole route
if the key manager needs to resolve the IPSEC rules (in this case
-EREMOTE is returned and the original 'dst' is left unchanged).
Otherwise plain xfrm_lookup() is used and when asynchronous IPSEC
resolution is necessary, we simply fail the lookup completely.
All of these cases are encapsulated into two routines,
ip6_dst_lookup_flow and ip6_sk_dst_lookup_flow. The latter of which
handles unconnected UDP datagram sockets.
Signed-off-by: David S. Miller <davem@davemloft.net>
Declaration and assignment of newdp is removed. Usage of dccp_sk()
exhibit no side effects.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.
Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.
Rewrite ip_route_newports() such that:
1) The caller passes in the original port values, so we don't need
to use the rth->fl.fl_ip_{s,d}port values to remember them.
2) The lookup flow is constructed by hand instead of being copied
from the routing cache entry's flow.
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'seq_window' sysctl sets the initial value for the DCCP Sequence Window,
which may range from 32..2^46-1 (RFC 4340, 7.5.2). The patch sets the upper
bound consistently to 2^32-1 on both 32 and 64 bit systems, which should be
sufficient - with a RTT of 1sec and 1-byte packets, a seq_window of 2^32-1
corresponds to a link speed of 34 Gbps.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Currently dccp_check_seqno allows any valid packet to update the Greatest
Sequence Number Received, even if that packet's sequence number is less than
the current GSR. This patch adds a check to make sure that the new packet's
sequence number is greater than GSR.
Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Currently dccp_check_seqno returns 0 (indicating a valid packet) if the
acknowledgment number is out of bounds and the sync that RFC 4340 mandates at
this point is currently being rate-limited. This function should return -1,
indicating an invalid packet.
Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
Remove macros which have been unused since the initial implementation
(commit 7c657876b6, [DCCP]: Initial
implementation from Tue Aug 9 20:14:34 2005 -0700).
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Ensure that cmsg->cmsg_type value is valid for qpolicy
that is currently in use.
Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch adds a generic infrastructure for policy-based dequeueing of
TX packets and provides two policies:
* a simple FIFO policy (which is the default) and
* a priority based policy (set via socket options).
Both policies honour the tx_qlen sysctl for the maximum size of the write
queue (can be overridden via socket options).
The priority policy uses skb->priority internally to assign an u32 priority
identifier, using the same ranking as SO_PRIORITY. The skb->priority field
is set to 0 when the packet leaves DCCP. The priority is supplied as ancillary
data using cmsg(3), the patch also provides the requisite parsing routines.
Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a bug in updating the Greatest Acknowledgment number Received (GAR):
the current implementation does not track the greatest received value -
lower values in the range AWL..AWH (RFC 4340, 7.5.1) erase higher ones.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes whitespace noise introduced in commit "dccp ccid-2: Algorithm to
update buffer state", 5753fdfe8b, 14 Nov 2010.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the macros defined for the members of flowi to clean the code up.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the documentation refers to web pages under
the domain `osdl.org'. However, `osdl.org' now
redirects to `linuxfoundation.org'.
Rather than rely on redirections, this patch updates
the addresses appropriately; for the most part, only
documentation that is meant to be current has been
updated.
The patch should be pretty quick to scan and check;
each new web-page url was gotten by trying out the
original URL in a browser and then simply copying the
the redirected URL (formatting as necessary).
There is some conflict as to which one of these domain
names is preferred:
linuxfoundation.org
linux-foundation.org
So, I wrote:
info@linuxfoundation.org
and got this reply:
Message-ID: <4CE17EE6.9040807@linuxfoundation.org>
Date: Mon, 15 Nov 2010 10:41:42 -0800
From: David Ames <david@linuxfoundation.org>
...
linuxfoundation.org is preferred. The canonical name for our web site is
www.linuxfoundation.org. Our list site is actually
lists.linux-foundation.org.
Regarding email linuxfoundation.org is preferred there are a few people
who choose to use linux-foundation.org for their own reasons.
Consequently, I used `linuxfoundation.org' for web pages and
`lists.linux-foundation.org' for mailing-list web pages and email addresses;
the only personal email address I updated from `@osdl.org' was that of
Andrew Morton, who prefers `linux-foundation.org' according `git log'.
Signed-off-by: Michael Witten <mfwitten@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch replaces an almost identical replication of code: large parts
of dccp_parse_options() re-appeared as ccid2_ackvector() in ccid2.c.
Apart from the duplication, this caused two more problems:
1. CCIDs should not need to be concerned with parsing header options;
2. one can not assume that Ack Vectors appear as a contiguous area within an
skb, it is legal to insert other options and/or padding in between. The
current code would throw an error and stop reading in such a case.
Since Ack Vectors provide CCID-specific information, they are now processed
by the CCID directly, separating this functionality from the main DCCP code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes
* functions for which updates have been provided in the preceding patches and
* the @av_vec_len field - it is no longer necessary since the buffer length is
now always computed dynamically.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The problem with Ack Vectors is that
i) their length is variable and can in principle grow quite large,
ii) it is hard to predict exactly how large they will be.
Due to the second point it seems not a good idea to reduce the MPS; in
particular when on average there is enough room for the Ack Vector and an
increase in length is momentarily due to some burst loss, after which the
Ack Vector returns to its normal/average length.
The solution taken by this patch is to subtract a minimum-expected Ack Vector
length from the MPS, and to defer any larger Ack Vectors onto a separate
Sync - but only if indeed there is no space left on the skb.
This patch provides the infrastructure to schedule Sync-packets for transporting
(urgent) out-of-band data. Its signalling is quicker than scheduling an Ack, since
it does not need to wait for new application data.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This aggregates Ack Vector processing (handling input and clearing old state)
into one function, for the following reasons and benefits:
* all Ack Vector-specific processing is now in one place;
* duplicated code is removed;
* ensuring sanity: from an Ack Vector point of view, it is better to clear the
old state first before entering new state;
* Ack Event handling happens mostly within the CCIDs, not the main DCCP module.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch updates the code which registers new packets as received, using the
new circular buffer interface. It contributes a new algorithm which
* supports both tail/head pointers and buffer wrap-around and
* deals with overflow (head/tail move in lock-step).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This provides a routine to consistently update the buffer state when the
peer acknowledges receipt of Ack Vectors; updating state in the list of Ack
Vectors as well as in the circular buffer.
While based on RFC 4340, several additional (and necessary) precautions were
added to protect the consistency of the buffer state. These additions are
essential, since analysis and experience showed that the basic algorithm was
insufficient for this task (which lead to problems that were hard to debug).
The algorithm now
* deals with HC-sender acknowledging to HC-receiver and vice versa,
* keeps track of the last unacknowledged but received seqno in tail_ackno,
* has special cases to reset the overflow condition when appropriate,
* is protected against receiving older information (would mess up buffer state).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This completes the implementation of a circular buffer for Ack Vectors, by
extending the current (linear array-based) implementation. The changes are:
(a) An `overflow' flag to deal with the case of overflow. As before, dynamic
growth of the buffer will not be supported; but code will be added to deal
robustly with overflowing Ack Vector buffers.
(b) A `tail_seqno' field. When naively implementing the algorithm of Appendix A
in RFC 4340, problems arise whenever subsequent Ack Vector records overlap,
which can bring the entire run length calculation completely out of synch.
(This is documented on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/\
ack_vectors/tracking_tail_ackno/ .)
(c) The buffer length is now computed dynamically (i.e. current fill level),
as the span between head to tail.
As a result, dccp_ackvec_pending() is now simpler - the #ifdef is no longer
necessary since buf_empty is always true when IP_DCCP_ACKVEC is not configured.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch
* separates Ack Vector housekeeping code from option-insertion code;
* shifts option-specific code from ackvec.c into options.c;
* introduces a dedicated routine to take care of the Ack Vector records;
* simplifies the dccp_ackvec_insert_avr() routine: the BUG_ON was redundant,
since the list is automatically arranged in descending order of ack_seqno.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch brings the Ack Vector interface up to date. Its main purpose is
to lay the basis for the subsequent patches of this set, which will use the
new data structure fields and routines.
There are no real algorithmic changes, rather an adaptation:
(1) Replaced the static Ack Vector size (2) with a #define so that it can
be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems
to be sufficient for the moment) and added a solution so that computing
the ECN nonce will continue to work - even with larger Ack Vectors.
(2) Replaced the #defines for Ack Vector states with a complete enum.
(3) Replaced #defines to compute Ack Vector length and state with general
purpose routines (inlines), and updated code to use these.
(4) Added a `tail' field (conversion to circular buffer in subsequent patch).
(5) Updated the (outdated) documentation for Ack Vector struct.
(6) All sequence number containers now trimmed to 48 bits.
(7) Removal of unused bits:
* removed dccpav_ack_nonce from struct dccp_ackvec, since this is already
redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record);
* removed Elapsed Time for Ack Vectors (it was nowhere used);
* replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since
the code needs to be able to remember the old run length;
* reduced the de-/allocation routines (redundant / duplicate tests).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This updates CCID-2 to use the CCID dequeuing mechanism, converting from
previous continuous-polling to a now event-driven mechanism.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extends the existing wait-for-ccid routine so that it may be used with
different types of CCID, addressing the following problems:
1) The queue-drain mechanism only works with rate-based CCIDs. If CCID-2 for
example has a full TX queue and becomes network-limited just as the
application wants to close, then waiting for CCID-2 to become unblocked
could lead to an indefinite delay (i.e., application "hangs").
2) Since each TX CCID in turn uses a feedback mechanism, there may be changes
in its sending policy while the queue is being drained. This can lead to
further delays during which the application will not be able to terminate.
3) The minimum wait time for CCID-3/4 can be expected to be the queue length
times the current inter-packet delay. For example if tx_qlen=100 and a delay
of 15 ms is used for each packet, then the application would have to wait
for a minimum of 1.5 seconds before being allowed to exit.
4) There is no way for the user/application to control this behaviour. It would
be good to use the timeout argument of dccp_close() as an upper bound. Then
the maximum time that an application is willing to wait for its CCIDs to can
be set via the SO_LINGER option.
These problems are addressed by giving the CCID a grace period of up to the
`timeout' value.
The wait-for-ccid function is, as before, used when the application
(a) has read all the data in its receive buffer and
(b) if SO_LINGER was set with a non-zero linger time, or
(c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ
state (client application closes after receiving CloseReq).
In addition, there is a catch-all case of __skb_queue_purge() after waiting for
the CCID. This is necessary since the write queue may still have data when
(a) the host has been passively-closed,
(b) abnormal termination (unread data, zero linger time),
(c) wait-for-ccid could not finish within the given time limit.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extends the packet dequeuing interface of dccp_write_xmit() to allow
1. CCIDs to take care of timing when the next packet may be sent;
2. delayed sending (as before, with an inter-packet gap up to 65.535 seconds).
The main purpose is to take CCID-2 out of its polling mode (when it is network-
limited, it tries every millisecond to send, without interruption).
The mode of operation for (2) is as follows:
* new packet is enqueued via dccp_sendmsg() => dccp_write_xmit(),
* ccid_hc_tx_send_packet() detects that it may not send (e.g. window full),
* it signals this condition via `CCID_PACKET_WILL_DEQUEUE_LATER',
* dccp_write_xmit() returns without further action;
* after some time the wait-condition for CCID becomes true,
* that CCID schedules the tasklet,
* tasklet function calls ccid_hc_tx_send_packet() via dccp_write_xmit(),
* since the wait-condition is now true, ccid_hc_tx_packet() returns "send now",
* packet is sent, and possibly more (since dccp_write_xmit() loops).
Code reuse: the taskled function calls dccp_write_xmit(), the timer function
reduces to a wrapper around the same code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reorganises the return value convention of the CCID TX sending
function, to permit more flexible schemes, as required by subsequent patches.
Currently the convention is
* values < 0 mean error,
* a value == 0 means "send now", and
* a value x > 0 means "send in x milliseconds".
The patch provides symbolic constants and a function to interpret return values.
In addition, it caps the maximum positive return value to 0xFFFF milliseconds,
corresponding to 65.535 seconds. This is possible since in CCID-3/4 the
maximum possible inter-packet gap is fixed at t_mbi = 64 sec.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
When __inet_inherit_port() is called on a tproxy connection the wrong locks are
held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
implicit assumption that the listener's port number (and thus its bind bucket).
Unfortunately, if you're using the TPROXY target to redirect skbs to a
transparent proxy that assumption is not true anymore and things break.
This patch adds code to __inet_inherit_port() so that it can handle this case
by looking up or creating a new bind bucket for the child socket and updates
callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
failing.
Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>.
See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
This omits the redundant "DCCP:" in warning messages, since DCCP_WARN() already
echoes the function name, avoiding messages like
kernel: [10988.766503] dccp_close: DCCP: ABORT -- 209 bytes unread
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This schedules an Ack when receiving a timestamp, exploiting the
existing inet_csk_schedule_ack() function, saving one case in the
`dccp_ack_pending()' function.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch generalises the task of determining data loss from RFC 4340, 7.7.1.
Let S_A, S_B be sequence numbers such that S_B is "after" S_A, and let
N_B be the NDP count of packet S_B. Then, using modulo-2^48 arithmetic,
D = S_B - S_A - 1 is an upper bound of the number of lost data packets,
D - N_B is an approximation of the number of lost data packets
(there are cases where this is not exact).
The patch implements this as
dccp_loss_count(S_A, S_B, N_B) := max(S_B - S_A - 1 - N_B, 0)
Signed-off-by: Ivo Calado <ivocalado@embedded.ufcg.edu.br>
Signed-off-by: Erivaldo Xavier <desadoc@gmail.com>
Signed-off-by: Leandro Sales <leandroal@gmail.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the argument `more' from ccid_hc_tx_packet_sent, since it was
nowhere used in the entire code.
(Btw, this argument was not even used in the original KAME code where the
function initially came from; compare the variable moreToSend in the
freebsd61-dccp-kame-28.08.2006.patch kept by Emmanuel Lochin.)
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
After moving the assignment of GAR/ISS from dccp_connect_init() to
dccp_transmit_skb(), the former function becomes very small, so that
a merger with dccp_connect() suggests itself.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a problem and a potential loophole with regard to seqno/ackno
validity: currently the initial adjustments to AWL/SWL are only performed
once at the begin of the connection, during the handshake.
Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.
This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.
Therefore it is better to perform that safety check each time SWL/AWL are
updated, as implemented by the patch.
A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.
During the initial handshake we have more stringent sequence number protection;
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Remove dead code and make some functions static.
Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change "return (EXPR);" to "return EXPR;"
return is not a function, parentheses are not required.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The `options_received' struct is redundant, since it re-duplicates the existing
`p' and `x_recv' fields. This patch removes the sub-struct and migrates the
format conversion operations to ccid3_hc_tx_parse_options().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This adds a function to take care of the following, separate cases occurring in
the computation of the Loss Rate p:
* 1/(2^32-1) is mapped into 0% as per RFC 4342, 8.5;
* 1/0 is mapped into 100%, the maximum;
* to avoid that p = 1/x is rounded down to 0 when x is very large, since this
means accidentally re-entering slow-start indicated by p == 0, the minimum
resolution value of p is now returned instead;
* a bug in ccid3_hc_rx_getsockopt is fixed: 1/0 was mapped into ~0U.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch is thanks to an investigation by Leandro Sales de Melo and his
colleagues. They worked out two state diagrams which highlight the fact that
the xxx_TERM states in CCID-3/4 are in fact not necessary.
And this can be confirmed by in turn looking at the code: the xxx_TERM states
are only ever set in ccid3_hc_{rx,tx}_exit(): when CCID-3 sets the state
to xxx_TERM, it is at a time where no more processing should be going on,
hence it is not necessary to introduce a dedicated exit state - this is already
implied by unloading the CCID.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The constants DCCPO_{MIN,MAX}_CCID_SPECIFIC are nowhere used in the code, but
instead for the CCID-specific options numbers are used.
This patch unifies the use of CCID-specific option numbers, by adding symbolic
names reflecting the definitions in RFC 4340, 10.3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This
1. adds packet type information to ccid_hc_{rx,tx}_parse_options(). This is
necessary, since table 3 in RFC 4340, 5.8 leaves it to the CCIDs to state
which options may (not) appear on what packet type.
2. adds such a check for CCID-3's {Loss Event, Receive} Rate as specified in
RFC 4340 8.3 ("Receive Rate options MUST NOT be sent on DCCP-Data packets")
and 8.5 ("Loss Event Rate options MUST NOT be sent on DCCP-Data packets").
3. removes an unused argument `idx' from ccid_hc_{rx,tx}_parse_options(). This
is also no longer necessary, since the CCID-specific option-parsing routines
are passed every single parameter of the type-length-value option encoding.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This simplifies and consolidates the TX option-parsing code:
1. The Loss Intervals option is not currently used, so dead code related to
this option is removed. I am aware of no plans to support the option, but
if someone wants to implement it (e.g. for inter-op tests), it is better
to start afresh than having to also update currently unused code.
2. The Loss Event and Receive Rate options have a lot of code in common (both
are 32 bit, both have same length etc.), so this is consolidated.
3. The test against GSR is not necessary, because
- on first loading CCID3, ccid_new() zeroes out all fields in the socket;
- ccid3_hc_tx_packet_recv() treats 0 and ~0U equivalently, due to
pinv = opt_recv->ccid3or_loss_event_rate;
if (pinv == ~0U || pinv == 0)
hctx->p = 0;
- as a result, the sequence number field is removed from opt_recv.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the RTT-sampling function tfrc_tx_hist_rtt(), since
1. it suffered from complex passing of return values (the return value both
indicated successful lookup while the value doubled as RTT sample);
2. when for some odd reason the sample value equalled 0, this triggered a bug
warning about "bogus Ack", due to the ambiguity of the return value;
3. on a passive host which has not sent anything the TX history is empty and
thus will lead to unwanted "bogus Ack" warnings such as
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-28197148
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-26641606.
The fix is to replace the implicit encoding by performing the steps manually.
Furthermore, the "bogus Ack" warning has been removed, since it can actually be
triggered due to several reasons (network reordering, old packet, (3) above),
hence it is not very useful.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a subtle bug in the calculation of the inter-packet gap and shows
that t_delta, as it is currently used, is not needed.
The algorithm from RFC 5348, 8.3 below continually computes a send time t_nom,
which is initialised with the current time t_now; t_gran = 1E6 / HZ specifies
the scheduling granularity, s the packet size, and X the sending rate:
t_distance = t_nom - t_now; // in microseconds
t_delta = min(t_ipi, t_gran) / 2; // `delta' parameter in microseconds
if (t_distance >= t_delta) {
reschedule after (t_distance / 1000) milliseconds;
} else {
t_ipi = s / X; // inter-packet interval in usec
t_nom += t_ipi; // compute the next send time
send packet now;
}
Problem:
--------
Rescheduling requires a conversion into milliseconds (sk_reset_timer()). The
highest jiffy resolution with HZ=1000 is 1 millisecond, so using a higher
granularity does not make much sense here.
As a consequence, values of t_distance < 1000 are truncated to 0. This issue
has so far been resolved by using instead
if (t_distance >= t_delta + 1000)
reschedule after (t_distance / 1000) milliseconds;
This is unnecessarily large, a lower bound is t_delta' = max(t_delta, 1000).
And it implies a further simplification:
a) when HZ >= 500, then t_delta <= t_gran/2 = 10^6/(2*HZ) <= 1000, so that
t_delta' = MAX(1000, t_delta) = 1000 (constant value);
b) when HZ < 500, then t_delta = 1/2*MIN(rtt, t_ipi, t_gran) <= t_gran/2,
so that 1000 <= t_delta' <= t_gran/2.
The maximum error of using a constant t_delta in (b) is less than half a jiffy.
Fix:
----
The patch replaces t_delta with a constant, whose value depends on CONFIG_HZ,
changing the above algorithm to:
if (t_distance >= t_delta')
reschedule after (t_distance / 1000) milliseconds;
where t_delta' = 10^6/(2*HZ) if HZ < 500, and t_delta' = 1000 otherwise.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This makes RTAX_RTO_MIN also available to CCID-3, replacing the compile-time
RTO lower bound with a per-route tunable value.
The original Kconfig option solved the problem that a very low RTT (in the
order of HZ) can trigger too frequent and unnecessary reductions of the
sending rate.
This tunable does not affect the initial RTO value of 2 seconds specified in
RFC 5348, section 4.2 and Appendix B. But like the hardcoded Kconfig value,
it allows to adapt to network conditions.
The same effect as the original Kconfig option of 100ms is now achieved by
> ip route replace to unicast 192.168.0.0/24 rto_min 100j dev eth0
(assuming HZ=1000).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a fixed RTO_MIN of 0.2 seconds was found to cause problems for CCID-2
over 802.11g: at least once per session there was a spurious timeout. It
helped to then increase the the value of RTO_MIN over this link.
Since the problem is the same as in TCP, this patch makes the solution from
commit "05bb1fad1cde025a864a90cfeb98dcbefe78a44a"
"[TCP]: Allow minimum RTO to be configurable via routing metrics."
available to DCCP.
This avoids reinventing the wheel, so that e.g. the following works in the
expected way now also for CCID-2:
> ip route change 10.0.0.2 rto_min 800 dev ath0
Luckily this useful rto_min function was recently moved to net/tcp.h,
which simplifies sharing code originating from TCP.
Documentation also updated (plus minor whitespace fixes).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch consolidates initial-window code common to TCP and CCID-2:
* TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
* CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the wrappers around the sk timer functions, since not much is
gained from using them: the BUG_ON in start_rto_timer will never trigger
since that function is called only if:
* the RTO timer expires (rto_expire, and then timer_pending() is false);
* in tx_packet_sent only if !timer_pending() (BUG_ON is redundant here);
* previously in new_ack, after stopping the timer (timer_pending() false).
Removing the wrappers also clears the way for eventually replacing the
RTO timer with the icsk-retransmission-timer, as it is already part of the
DCCP socket.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since CCID-2 is de facto a mini implementation of TCP, it makes sense to share
as much code as possible.
Hence this patch aligns CCID-2 timestamping with TCP timestamping.
This also halves the space consumption (on 64-bit systems).
The necessary include file <net/tcp.h> is already included by way of
net/dccp.h. Redundant includes have been removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current CCID-2 RTT estimator code is in parts broken and lags behind the
suggestions in RFC2988 of using scaled variants for SRTT/RTTVAR.
That code is replaced by the present patch, which reuses the Linux TCP RTT
estimator code.
Further details:
----------------
1. The minimum RTO of previously one second has been replaced with TCP's, since
RFC4341, sec. 5 says that the minimum of 1 sec. (suggested in RFC2988, 2.4)
is not necessary. Instead, the TCP_RTO_MIN is used, which agrees with DCCP's
concept of a default RTT (RFC 4340, 3.4).
2. The maximum RTO has been set to DCCP_RTO_MAX (64 sec), which agrees with
RFC2988, (2.5).
3. De-inlined the function ccid2_new_ack().
4. Added a FIXME: the RTT is sampled several times per Ack Vector, which will
give the wrong estimate. It should be replaced with one sample per Ack.
However, at the moment this can not be resolved easily, since
- it depends on TX history code (which also needs some work),
- the cleanest solution is not to use the `sent' time at all (saves 4 bytes
per entry) and use DCCP timestamps / elapsed time to estimated the RTT,
which however is non-trivial to get right (but needs to be done).
Reasons for reusing the Linux TCP estimator algorithm:
------------------------------------------------------
Some time was spent to find a better alternative, using basic RFC2988 as a first
step. Further analysis and experimentation showed that the Linux TCP RTO
estimator is superior to a basic RFC2988 implementation. A summary is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid2/rto_estimator/
In addition, this estimator fared well in a recent empirical evaluation:
Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith.
A Performance Study of Loss Detection/Recovery in Real-world TCP
Implementations. Proceedings of 15th IEEE International
Conference on Network Protocols (ICNP-07), 2007.
Thus there is significant benefit in reusing the existing TCP code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the dec_pipe function and improves the way the RTO timer is rearmed
when a new acknowledgment comes in.
Details and justification for removal:
--------------------------------------
1) The BUG_ON in dec_pipe is never triggered: pipe is only decremented for TX
history entries between tail and head, for which it had previously been
incremented in tx_packet_sent; and it is not decremented twice for the same
entry, since it is
- either decremented when a corresponding Ack Vector cell in state 0 or 1
was received (and then ccid2s_acked==1),
- or it is decremented when ccid2s_acked==0, as part of the loss detection
in tx_packet_recv (and hence it can not have been decremented earlier).
2) Restarting the RTO timer happens for every single entry in each Ack Vector
parsed by tx_packet_recv (according to RFC 4340, 11.4 this can happen up to
16192 times per Ack Vector).
3) The RTO timer should not be restarted when all outstanding data has been
acknowledged. This is currently done similar to (2), in dec_pipe, when
pipe has reached 0.
The patch onsolidates the code which rearms the RTO timer, combining the
segments from new_ack and dec_pipe. As a result, the code becomes clearer
(compare with tcp_rearm_rto()).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the ccid2_hc_tx_check_sanity function: it is redundant.
Details:
The tx_check_sanity function performs three tests:
1) it checks that the circular TX list is sorted
- in ascending order of sequence number (ccid2s_seq)
- and time (ccid2s_sent),
- in the direction from `tail' (hctx_seqt) to `head' (hctx_seqh);
2) it ensures that the entire list has the length seqbufc * CCID2_SEQBUF_LEN;
3) it ensures that pipe equals the number of packets that were not
marked `acked' (ccid2s_acked) between `tail' and `head'.
The following argues that each of these tests is redundant, this can be verified
by going through the code.
(1) is not necessary, since both time and GSS increase from one packet to the
next, so that subsequent insertions in tx_packet_sent (which advance the `head'
pointer) will be in ascending order of time and sequence number.
In (2), the length of the list is always equal to seqbufc times CCID2_SEQBUF_LEN
(set to 1024) unless allocation caused an earlier failure, because:
* at initialisation (tx_init), there is one chunk of size 1024 and seqbufc=1;
* subsequent calls to tx_alloc_seq take place whenever head->next == tail in
tx_packet_sent; then a new chunk of size 1024 is inserted between head and
tail, and seqbufc is incremented by one.
To show that (3) is redundant requires looking at two cases.
The `pipe' variable of the TX socket is incremented only in tx_packet_sent, and
decremented in tx_packet_recv. When head == tail (TX history empty) then pipe
should be 0, which is the case directly after initialisation and after a
retransmission timeout has occurred (ccid2_hc_tx_rto_expire).
The first case involves parsing Ack Vectors for packets recorded in the live
portion of the buffer, between tail and head. For each packet marked by the
receiver as received (state 0) or ECN-marked (state 1), pipe is decremented by
one, so for all such packets the BUG_ON in tx_check_sanity will not trigger.
The second case is the loss detection in the second half of tx_packet_recv,
below the comment "Check for NUMDUPACK".
The first while-loop here ensures that the sequence number of `seqp' is either
above or equal to `high_ack', or otherwise equal to the highest sequence number
sent so far (of the entry head->prev, as head points to the next unsent entry).
The next while-loop ("while (1)") counts the number of acked packets starting
from that position of seqp, going backwards in the direction from head->prev to
tail. If NUMDUPACK=3 such packets were counted within this loop, `seqp' points
to the last acknowledged packet of these, and the "if (done == NUMDUPACK)" block
is entered next.
The while-loop contained within that block in turn traverses the list backwards,
from head to tail; the position of `seqp' is saved in the variable `last_acked'.
For each packet not marked as `acked', a congestion event is triggered within
the loop, and pipe is decremented. The loop terminates when `seqp' has reached
`tail', whereupon tail is set to the position previously stored in `last_acked'.
Thus, between `last_acked' and the previous position of `tail',
- pipe has been decremented earlier if the packet was marked as state 0 or 1;
- pipe was decremented if the packet was not marked as acked.
That is, pipe has been decremented by the number of packets between `last_acked'
and the previous position of `tail'. As a consequence, pipe now again reflects
the number of packets which have not (yet) been acked between the new position
of tail (at `last_acked') and head->prev, or 0 if head==tail. The result is that
the BUG_ON condition in check_sanity will also not be triggered, hence the test
(3) is also redundant.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CCIDs are activated as last of the features, at the end of the handshake,
were the LISTEN state of the master socket is inherited into the server
state of the child socket. Thus, the only states visible to CCIDs now are
OPEN/PARTOPEN, and the closing states.
This allows to remove tests which were previously necessary to protect
against referencing a socket in the listening state (in CCID-3), but which
now have become redundant.
As a further byproduct of enabling the CCIDs only after the connection has been
fully established, several typecast-initialisations of ccid3_hc_{rx,tx}_sock
can now be eliminated:
* the CCID is loaded, so it is not necessary to test if it is NULL,
* if it is possible to load a CCID and leave the private area NULL, then this
is a bug, which should crash loudly - and earlier,
* the test for state==OPEN || state==PARTOPEN now reduces only to the closing
phase (e.g. when the node has received an unexpected Reset).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch collects cosmetics-only changes to separate these from
code changes:
* update with regard to CodingStyle and whitespace changes,
* documentation:
- adding/revising comments,
- remove CCID-3 RX socket documentation which is either
duplicate or refers to fields that no longer exist,
* expand embedded tfrc_tx_info struct inline for consistency,
removing indirections via #define.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
'gap' is unsigned, so this code is wrong:
gap = -new_head;
...
if (gap > 0) { ... }
Make 'gap' signed.
The semantic patch that finds this problem (many false-positive results):
(http://coccinelle.lip6.fr/)
// <smpl>
@ r1 @
identifier f;
@@
int f(...) { ... }
@@
identifier r1.f;
type T;
unsigned T x;
@@
*x = f(...)
...
*x > 0
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for 64bit snmp counters for some mibs,
add an 'align' parameter to snmp_mib_init(), instead
of assuming mibs only contain 'unsigned long' fields.
Callers can use __alignof__(type) to provide correct
alignment.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is thanks to Andre Noll who reported the issue and helped testing.
The Syn-RTT sampled during the initial handshake currently only works for
the client sending the DCCP-Request. TFRC penalizes the absence of an RTT
sample with a very slow initial speed (1 packet per second), which delays
slow-start significantly, resulting in sluggish performance.
This patch mirrors the "Syn RTT" principle by adding a timestamp also onto
the DCCP-Response, producing an RTT sample when the (Data)Ack completing
the handshake arrives.
Also changed the documentation to 'TFRC' since Syn RTTs are also used by CCID-4.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes an unused 'sk' argument from several option-inserting functions.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove useless union keyword in rtable, rt6_info and dn_route.
Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are more than a dozen occurrences of following code in the
IPv6 stack:
if (opt && opt->srcrt) {
struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
ipv6_addr_copy(&final, &fl.fl6_dst);
ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
final_p = &final;
}
Replace those with a helper. Note that the helper overrides final_p
in all cases. This is ok as final_p was previously initialized to
NULL when declared.
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits)
drivers/net/usb/asix.c: Fix pointer cast.
be2net: Bug fix to avoid disabling bottom half during firmware upgrade.
proc_dointvec: write a single value
hso: add support for new products
Phonet: fix potential use-after-free in pep_sock_close()
ath9k: remove VEOL support for ad-hoc
ath9k: change beacon allocation to prefer the first beacon slot
sock.h: fix kernel-doc warning
cls_cgroup: Fix build error when built-in
macvlan: do proper cleanup in macvlan_common_newlink() V2
be2net: Bug fix in init code in probe
net/dccp: expansion of error code size
ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
wireless: fix sta_info.h kernel-doc warnings
wireless: fix mac80211.h kernel-doc warnings
iwlwifi: testing the wrong variable in iwl_add_bssid_station()
ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
ath9k_htc: dereferencing before check in hif_usb_tx_cb()
rt2x00: Fix rt2800usb TX descriptor writing.
rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
...
- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.
- Make SHRT_MIN of type s16, not int, for consistency.
[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because MIPS's EDQUOT value is 1133(0x46d).
It's larger than u8.
Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.
RCU conversion is pretty much needed :
1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).
[Future patch will add a list anchor for wakeup coalescing]
2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().
3) Respect RCU grace period when freeing a "struct socket_wq"
4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"
5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep
6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.
7) Change all sk_has_sleeper() callers to :
- Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
- Use wq_has_sleeper() to eventually wakeup tasks.
- Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)
8) sock_wake_async() is modified to use rcu protection as well.
9) Exceptions :
macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.
Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define a new function to return the waitqueue of a "struct sock".
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
return sk->sk_sleep;
}
Change all read occurrences of sk_sleep by a call to this function.
Needed for a future RCU conversion. sk_sleep wont be a field directly
available.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Herbert Xu said: we should be able to simply replace ipfragok
with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.
The patch kills the ipfragok parameter of .queue_xmit().
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.
sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)
This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))
This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet: Remove unused send_check length argument
This patch removes the unused length argument from the send_check
function in struct inet_connection_sock_af_ops.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
There is no point to align or pad mibs to cache lines, they are per cpu
allocated with a 8 bytes alignment anyway.
This wastes space for no gain. This patch removes __SNMP_MIB_ALIGN__
Since SNMP mibs contain "unsigned long" fields only, we can relax the
allocation alignment from "unsigned long long" to "unsigned long"
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dccp: fix panic caused by failed initialisation
This fixes a kernel panic reported thanks to Andre Noll:
if DCCP is compiled into the kernel and any out of the initialisation
steps in net/dccp/proto.c:dccp_init() fail, a subsequent attempt to create
a SOCK_DCCP socket will panic, since inet{,6}_create() are not prevented
from creating DCCP sockets.
This patch fixes the problem by propagating a failure in dccp_init() to
dccp_v{4,6}_init_net(), and from there to dccp_v{4,6}_init(), so that the
DCCP protocol is not made available if its initialisation fails.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __percpu sparse annotations to net.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.
The macro and type tricks around snmp stats make things a bit
interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
snmp_mib_*() users which used to cast the argument to (void **) are
updated to cast it to (void __percpu **).
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
DCCP is datagram-oriented but lacks UDP's support for MSG_TRUNC as defined in
recvmsg(2)/recv(2). Hence the following 'Hello world\0' receiver
len = recv(fd, buf, 10, MSG_PEEK | MSG_TRUNC);
wrongly (always) returns 10, while in UDP it returns 12 as expected.
This patch adds the missing MSG_TRUNC support to recvmsg().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a problem in the DCCP getsockopt() API: currently there is no way
for a user to a priori know the number of built-in CCIDs, other than trying
DCCP_SOCKOPT_AVAILABLE_CCIDS in a loop, incrementing the option length until
EINVAL is no longer returned.
This patch truncates the array to the user-provided length. No copy is made
when the length is <= 0.
Due to the length restriction in do_dccp_getsockopt() to sizeof(int), the
minimum array length remains 4, which is a reasonable default (only 3
CCIDs, CCID-2..4, are currently defined).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes commit (38ff3e6bb9) ("dccp_probe:
Fix module load dependencies between dccp and dccp_probe", from 15 Jan).
It fixes the construction of the first argument of try_then_request_module(),
where only valid return codes from the first argument should be returned.
What we do now is assign the result of register_jprobe() to ret, without
the side effect of the comparison.
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug introduced in commit de4ef86cfc
("dccp: fix dccp rmmod when kernel configured to use slub", 17 Jan): the
vsnprintf used sizeof(slab_name_fmt), which became truncated to 4 bytes, since
slab_name_fmt is now a 4-byte pointer and no longer a 32-character array.
This lead to error messages such as
FATAL: Error inserting dccp: No buffer space available
>> kernel: [ 1456.341501] kmem_cache_create: duplicate cache cci
generated due to the truncation after the 3rd character.
Fixed for the moment by introducing a symbolic constant. Tested to fix the bug.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hey all-
I was tinkering with dccp recently and noticed that I BUG halted the
kernel when I rmmod-ed the dccp module. The bug halt occured because the page
that I passed to kfree failed the PageCompound and PageSlab test in the slub
implementation of kfree. I tracked the problem down to the following set of
events:
1) dccp, unlike all other uses of kmem_cache_create, allocates a string
dynamically when registering a slab cache. This allocated string is freed when
the cache is destroyed.
2) Normally, (1) is not an issue, but when Slub is in use, it is possible that
caches are 'merged'. This process causes multiple caches of simmilar
configuration to use the same cache data structure. When this happens, the new
name of the cache is effectively dropped.
3) (2) results in kmem_cache_name returning an ambigous value (i.e.
ccid_kmem_cache_destroy, which uses this fuction to retrieve the name pointer
for freeing), is no longer guaranteed that the string it assigned is what is
returned.
4) If such merge event occurs, ccid_kmem_cache_destroy frees the wrong pointer,
which trips over the BUG in the slub implementation of kfree (since its likely
not a slab allocation, but rather a pointer into the static string table
section.
So, what to do about this. At first blush this is pretty clearly a leak in the
information that slub owns, and as such a slub bug. Unfortunately, theres no
really good way to fix it, without exposing slub specific implementation details
to the generic slab interface. Also, even if we could fix this in slub cleanly,
I think the RCU free option would force us to do lots of string duplication, not
only in slub, but in every slab allocator. As such, I'd like to propose this
solution. Basically, I just move the storage for the kmem cache name to the
ccid_operations structure. In so doing, we don't have to do the kstrdup or
kfree when we allocate/free the various caches for dccp, and so we avoid the
problem, by storing names with static memory, rather than heap, the way all
other calls to kmem_cache_create do.
I've tested this out myself here, and it solves the problem quite well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__net_init/__net_exit are apparently not going away, so use them
to full extent.
In some cases __net_init was removed, because it was called from
__net_exit code.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was just recently reported to me. When built as modules, the
dccp_probe module has a silent dependency on the dccp module. This
stems from the fact that the module_init routine of dccp_probe
registers a jprobe on the dccp_sendmsg symbol. Since the symbol is
only referenced as a text string (the .symbol_name field in the jprobe
struct) rather than the address of the symbol itself, depmod never
picks this dependency up, and so if you load the dccp_probe module
without the dccp module loaded, the register_jprobe call fails with an
-EINVAL, and the whole module load fails.
The fix is pretty easy, we can just wrap the register_jprobe call in a
try_then_request_module call, which forces the dependency to get
satisfied prior to the probe registration.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rename kfifo_put... into kfifo_in... to prevent miss use of old non in
kernel-tree drivers
ditto for kfifo_get... -> kfifo_out...
Improve the prototypes of kfifo_in and kfifo_out to make the kerneldoc
annotations more readable.
Add mini "howto porting to the new API" in kfifo.h
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the pointer to the spinlock out of struct kfifo. Most users in
tree do not actually use a spinlock, so the few exceptions now have to
call kfifo_{get,put}_locked, which takes an extra argument to a
spinlock.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a new generic kernel FIFO implementation.
The current kernel fifo API is not very widely used, because it has to
many constrains. Only 17 files in the current 2.6.31-rc5 used it.
FIFO's are like list's a very basic thing and a kfifo API which handles
the most use case would save a lot of development time and memory
resources.
I think this are the reasons why kfifo is not in use:
- The API is to simple, important functions are missing
- A fifo can be only allocated dynamically
- There is a requirement of a spinlock whether you need it or not
- There is no support for data records inside a fifo
So I decided to extend the kfifo in a more generic way without blowing up
the API to much. The new API has the following benefits:
- Generic usage: For kernel internal use and/or device driver.
- Provide an API for the most use case.
- Slim API: The whole API provides 25 functions.
- Linux style habit.
- DECLARE_KFIFO, DEFINE_KFIFO and INIT_KFIFO Macros
- Direct copy_to_user from the fifo and copy_from_user into the fifo.
- The kfifo itself is an in place member of the using data structure, this save an
indirection access and does not waste the kernel allocator.
- Lockless access: if only one reader and one writer is active on the fifo,
which is the common use case, no additional locking is necessary.
- Remove spinlock - give the user the freedom of choice what kind of locking to use if
one is required.
- Ability to handle records. Three type of records are supported:
- Variable length records between 0-255 bytes, with a record size
field of 1 bytes.
- Variable length records between 0-65535 bytes, with a record size
field of 2 bytes.
- Fixed size records, which no record size field.
- Preserve memory resource.
- Performance!
- Easy to use!
This patch:
Since most users want to have the kfifo as part of another object,
reorganize the code to allow including struct kfifo in another data
structure. This requires changing the kfifo_alloc and kfifo_init
prototypes so that we pass an existing kfifo pointer into them. This
patch changes the implementation and all existing users.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
First patch changes __inet_hash_nolisten() and __inet6_hash()
to get a timewait parameter to be able to unhash it from ehash
at same time the new socket is inserted in hash.
This makes sure timewait socket wont be found by a concurrent
writer in __inet_check_established()
Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...
Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c
Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission. Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.
Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.
Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.
In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.
Cc: "David Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
struct can_proto had a capability field which wasn't ever used. It is
dropped entirely.
struct inet_protosw had a capability field which can be more clearly
expressed in the code by just checking if sock->type = SOCK_RAW.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.
(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.
Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)
This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Might as well use the ipv6_addr_set_v4mapped() inline we created last
year.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This continues the previous patch, by applying the same change to CCID-3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes a redundancy in the CCID half-connection (hc) naming scheme:
* instead of 'hctx->tx_...', write 'hc->tx_...';
* instead of 'hcrx->rx_...', write 'hc->rx_...';
which works because the 'type' of the half-connection is encoded in the
'rx_' / 'tx_' prefixes.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements the new naming scheme also for CCID-3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch starts a less problematic naming convention for CCID structs.
The old naming convention used 'hc{tx,rx}->ccid?hc{tx,rx}->...' as
recurring prefixes, which made the code
* hard to write (not easy to fit into 80 characters);
* hard to read (most of the space is occupied by prefixes).
The new naming scheme:
* struct entries for the TX socket are prefixed by 'tx_';
* and those for the RX socket are prefixed by 'rx_'.
The identifiers then remain distinguishable when grep-ing through the tree:
(a) RX/TX sockets are distinguished by the naming scheme,
(b) individual CCIDs are distinguished by filename (ccid{2,3,4}.{c,h}).
This first patch implements the scheme for CCID-2.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.
Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.
Signed-off-by: David S. Miller <davem@davemloft.net>
Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.
Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove long removed "inet_protocol_base" declaration.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No code change, cosmetical changes only:
* whitespace cleanup via scripts/cleanfile,
* remove self-references to filename at top of files,
* fix coding style (extraneous brackets),
* fix documentation style (kernel-doc-nano-HOWTO).
Thanks are due to Ivo Augusto Calado who raised these issues by
submitting good-quality patches.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function block inet_connect_sock_af_ops contains no data
make it constant.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
String literals are constant, and usually, we can also tag the array
of pointers const too, moving it to the .rodata section.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu counter dccp_orphan_count is init in dccp_init() by
percpu_counter_init() while dccp module is loaded, but the
destroy of it is missing while dccp module is unloaded. We
can get the kernel WARNING about this. Reproduct by the
following commands:
$ modprobe dccp
$ rmmod dccp
$ modprobe dccp
WARNING: at lib/list_debug.c:26 __list_add+0x27/0x5c()
Hardware name: VMware Virtual Platform
list_add corruption. next->prev should be prev (c080c0c4), but was (null). (next
=ca7188cc).
Modules linked in: dccp(+) nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc
Pid: 1956, comm: modprobe Not tainted 2.6.31-rc5 #55
Call Trace:
[<c042f8fa>] warn_slowpath_common+0x6a/0x81
[<c053a6cb>] ? __list_add+0x27/0x5c
[<c042f94f>] warn_slowpath_fmt+0x29/0x2c
[<c053a6cb>] __list_add+0x27/0x5c
[<c053c9b3>] __percpu_counter_init+0x4d/0x5d
[<ca9c90c7>] dccp_init+0x19/0x2ed [dccp]
[<c0401141>] do_one_initcall+0x4f/0x111
[<ca9c90ae>] ? dccp_init+0x0/0x2ed [dccp]
[<c06971b5>] ? notifier_call_chain+0x26/0x48
[<c0444943>] ? __blocking_notifier_call_chain+0x45/0x51
[<c04516f7>] sys_init_module+0xac/0x1bd
[<c04028e4>] sysenter_do_call+0x12/0x22
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DCCP protocol tries to allocate some large hash tables during
initialisation using the largest size possible. This can be larger than
what the page allocator can provide so it prints a warning. However, the
caller is able to handle the situation so this patch suppresses the
warning.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
to wrap the memory barrier.
Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt
and __add_wait_queue updates stay in CPU caches.
CPU1 CPU2
sys_select receive packet
... ...
__add_wait_queue update tp->rcv_nxt
... ...
tp->rcv_nxt check sock_def_readable
... {
schedule ...
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep)
...
}
If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.
Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.
Calls to poll_wait in following modules were ommited:
net/bluetooth/af_bluetooth.c
net/irda/af_irda.c
net/irda/irnet/irnet_ppp.c
net/mac80211/rc80211_pid_debugfs.c
net/phonet/socket.c
net/rds/af_rds.c
net/rfkill/core.c
net/sunrpc/cache.c
net/sunrpc/rpc_pipe.c
net/tipc/socket.c
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change all the code that deals directly with ICMPv6 type and code
values to use u8 instead of a signed int as that's the actual data
type.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;
Delete skb->dst field
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb
Delete skb->rtable field
Setting rtable is not allowed, just set dst instead as rtable is an alias.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a problem caused by the overlap of the connection-setup and
established-state phases of DCCP connections.
During connection setup, the client retransmits Confirm Feature-Negotiation
options until a response from the server signals that it can move from the
half-established PARTOPEN into the OPEN state, whereupon the connection is
fully established on both ends (RFC 4340, 8.1.5).
However, since the client may already send data while it is in the PARTOPEN
state, consequences arise for the Maximum Packet Size: the problem is that the
initial option overhead is much higher than for the subsequent established
phase, as it involves potentially many variable-length list-type options
(server-priority options, RFC 4340, 6.4).
Applying the standard MPS is insufficient here: especially with larger
payloads this can lead to annoying, counter-intuitive EMSGSIZE errors.
On the other hand, reducing the MPS available for the established phase by
the added initial overhead is highly wasteful and inefficient.
The solution chosen therefore is a two-phase strategy:
If the payload length of the DataAck in PARTOPEN is too large, an Ack is sent
to carry the options, and the feature-negotiation list is then flushed.
This means that the server gets two Acks for one Response. If both Acks get
lost, it is probably better to restart the connection anyway and devising yet
another special-case does not seem worth the extra complexity.
The result is a higher utilisation of the available packet space for the data
transmission phase (established state) of a connection.
The patch (over-)estimates the initial overhead to be 32*4 bytes -- commonly
seen values were around 90 bytes for initial feature-negotiation options.
It uses sizeof(u32) to mean "aligned units of 4 bytes".
For consistency, another use of 4-byte alignment is adapted.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch resolves a long-standing FIXME to dynamically update the Maximum
Packet Size depending on actual options usage.
It uses the flags set by the feature-negotiation infrastructure to compute
the required header option size.
Most options are fixed-size, a notable exception are Ack Vectors (required
currently only by CCID-2). These can have any length between 3 and 1020
bytes. As a result of testing, 16 bytes (2 bytes for type/length plus 14 Ack
Vector cells) have been found to be sufficient for loss-free situations.
There are currently no CCID-specific header options which may appear on data
packets, thus it is not necessary to define a corresponding CCID field as
suggested in the old comment.
Further changes:
----------------
Adjusted the type of 'cur_mps' to match the unsigned return type of the
function.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since all feature-negotiation processing now takes place in feat.c,
functions for producing verbose debugging output are concentrated
there.
New functions to print out values, entry records, and options are
provided, and also a macro is defined to not always have the function
name in the output line.
Thanks a lot to Wei Yongjun and Giuseppe Galeota for help and
discussion with an earlier revision of this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch takes care of initialising and type-checking sysctls
related to feature negotiation. Type checking is important since some
of the sysctls now directly impact the feature-negotiation process.
The sysctls are initialised with the known default values for each
feature. For the type-checking the value constraints from RFC 4340
are used:
* Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes),
tested and confirmed that it works up to 4294967295 - for Gbps speed;
* Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer);
* CCIDs are between 0 .. 255;
* request_retries, retries1, retries2 also between 0..255 for good measure;
* tx_qlen is checked to be non-negative;
* sync_ratelimit remains as before.
Notes:
------
1. Die s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c.
2. As pointed out by Arnaldo, the pattern of type-checking repeats itself in
other places, sometimes with exactly the same kind of definitions (e.g.
"static int zero;"). It may be a good idea (kernel janitors?) to consolidate
type checking. For the sake of keeping the changeset small and in order not
to affect other subsystems, I have not strived to generalise here.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds full support for local/remote Sequence Window feature, from which the
* sequence-number-validity (W) and
* acknowledgment-number-validity (W') windows
derive as specified in RFC 4340, 7.5.3.
Specifically, the following is contained in this patch:
* integrated new socket fields into dccp_sk;
* updated the update_gsr/gss routines with regard to these fields;
* updated handler code: the Sequence Window feature is located at the TX side,
so the local feature is meant if the handler-rx flag is false;
* the initialisation of `rcv_wnd' in reqsk is removed, since
- rcv_wnd is not used by the code anywhere;
- sequence number checks are not done in the LISTEN state (cf. 7.5.3);
- dccp_check_req checks the Ack number validity more rigorously;
* the `struct dccp_minisock' became empty and is now removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This initialises feature negotiation from two tables, which are in
turn are initialised from sysctls.
As a novel feature, specifics of the implementation (e.g. that short
seqnos and ECN are not yet available) are advertised for robustness.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks to Wei and Arnaldo for pointing out the correct
new reference for CCID-3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removed the __exit annotation of tfrc_lib_exit(), in order to suppress the following section mismatch messages:
WARNING: net/dccp/dccp.o(.text+0xd9): Section mismatch in reference from the function ccid_cleanup_builtins() to the function .exit.text:tfrc_lib_exit()
The function ccid_cleanup_builtins() references a function in an exit section.
Often the function tfrc_lib_exit() has valid usage outside the exit section
and the fix is to remove the __exit annotation of tfrc_lib_exit.
WARNING: net/dccp/dccp.o(.init.text+0x48): Section mismatch in reference from the function ccid_initialize_builtins() to the function .exit.text:tfrc_lib_exit()
The function __init ccid_initialize_builtins() references
a function __exit tfrc_lib_exit().
This is often seen when error handling in the init function
uses functionality in the exit path.
The fix is often to remove the __exit annotation of
tfrc_lib_exit() so it may be used outside an exit section.
Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch integrates the TFRC library, which is a dependency of CCID-3 (and
CCID-4), with the new use of CCIDs in the DCCP module.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch cleans up after integrating the CCID modules and, in addition,
* moves the if/else cases from ccid_delete() into ccid_hc_{tx,rx}_delete();
* removes the 'gfp' argument to ccid_new() - since it is always gfp_any().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on Arnaldo's earlier patch, this patch integrates the standardised
CCID congestion control plugins (CCID-2 and CCID-3) of DCCP with dccp.ko:
* enables a faster connection path by eliminating the need to always go
through the CCID registration lock;
* updates the implementation to use only a single array whose size equals
the number of configured CCIDs instead of the maximum (256);
* since the CCIDs are now fixed array elements, synchronization is no
longer needed, simplifying use and implementation.
CCID-2 is suggested as minimum for a basic DCCP implementation (RFC 4340, 10);
CCID-3 is a standards-track CCID supported by RFC 4342 and RFC 5348.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we converted the protocol atomic counters such as the orphan
count and the total socket count deadlocks were introduced due to
the mismatch in BH status of the spots that used the percpu counter
operations.
Based on the diagnosis and patch by Peter Zijlstra, this patch
fixes these issues by disabling BH where we may be in process
context.
Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
And thus when we try to use 'ss -danemi' on these sockets that have no
ccid blocks (data collected using systemtap after I fixed the problem):
dccp_diag_get_info sk=0xffff8801220a3100, dp->dccps_hc_rx_ccid=0x0000000000000000, dp->dccps_hc_tx_ccid=0x0000000000000000
We get an OOPS:
mica.ghostprotocols.net login: BUG: unable to handle kernel NULL pointer
dereferenc0
IP: [<ffffffffa0136082>] dccp_diag_get_info+0x82/0xc0 [dccp_diag]
PGD 12106f067 PUD 122488067 PMD 0
Oops: 0000 [#1] PREEMPT
Fix is trivial, and 'ss -d' is working again:
[root@mica ~]# ss -danemi
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 0 *:5001 *:*
ino:7288 sk:220a3100ffff8801
mem:(r0,w0,f0,t0) cwnd:0 ssthresh:0
[root@mica ~]#
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the use of the sysctl and the minisock variable for the Send Ack
Vector feature, as it now is handled fully dynamically via feature negotiation
(i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
RFC 4341, 4.).
Using a sysctl in parallel to this implementation would open the door to
crashes, since much of the code relies on tests of the boolean minisock /
sysctl variable. Thus, this patch replaces all tests of type
if (dccp_msk(sk)->dccpms_send_ack_vector)
/* ... */
with
if (dp->dccps_hc_rx_ackvec != NULL)
/* ... */
The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
negotiation concluded that Ack Vectors are to be used on the half-connection.
Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
so that the test is a valid one.
The activation handler for Ack Vectors is called as soon as the feature
negotiation has concluded at the
* server when the Ack marking the transition RESPOND => OPEN arrives;
* client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.
Adding the sequence number of the Response packet to the Ack Vector has been
removed, since
(a) connection establishment implies that the Response has been received;
(b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
this entry will always be ignored;
(c) it can not be used for anything useful - to detect loss for instance, only
packets received after the loss can serve as pseudo-dupacks.
There was a FIXME to change the error code when dccp_ackvec_add() fails.
I removed this after finding out that:
* the check whether ackno < ISN is already made earlier,
* this Response is likely the 1st packet with an Ackno that the client gets,
* so when dccp_ackvec_add() fails, the reason is likely not a packet error.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updating the NDP count feature is handled automatically now:
* for CCID-2 it is disabled, since the code does not use NDP counts;
* for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.
Allowing the user to change NDP values leads to unpredictable and failing
behaviour, since it is then possible to disable NDP counts even when they
are needed (e.g. in CCID-3).
This means that only those user settings are sensible that agree with the
values for Send NDP Count implied by the choice of CCID. But those settings
are already activated by the feature negotiation (CCID dependency tracking),
hence this form of support is redundant.
At startup the initialisation of the NDP count feature uses the default
value of 0, which is done implicitly by the zeroing-out of the socket when
it is allocated. If the choice of CCID or feature negotiation enables NDP
count, this will then be updated via the NDP activation handler.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
case, their value equals initially that of the sysctl, but at the end of
feature negotiation may be something different.
The old interface removed by this patch thus has been replaced by the newer
interface to dynamically query the currently loaded CCIDs.
Also removed are the constructors for the TX CCID and the RX CCID, since the
switch "rx <-> non-rx" is done by the handler in minisocks.c (and the handler
is the only place in the code where CCIDs are loaded).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code removed by this patch is no longer referenced or used, the added
lines update documentation and copyrights.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This integrates feature-activation in the client:
1. When dccp_parse_options() fails, the reset code is already set; request_sent\
_state_process() currently overrides this with `Packet Error', which is not
intended - changed to use the reset code supplied by dccp_parse_options().
2. When feature negotiation fails, the socket should be marked as not usable,
so that the application is notified that an error occurred. This is achieved
by a new label 'unable_to_proceed': generating an error code of `Aborted',
setting the socket state to CLOSED, returning with ECOMM in sk_err.
3. Avoids parsing the Ack twice in Respond state by not doing option processing
again in dccp_rcv_respond_partopen_state_process (as option processing has
already been done on the request_sock in dccp_check_req).
Since this addresses congestion-control initialisation, a corresponding
FIXME has been removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch integrates the activation of features at the end of negotiation
into the server-side code.
Note regarding the removal of 'const':
--------------------------------------
The 'const' attribute has been removed from 'dreq' since dccp_activate_values()
needs to operate on dreq's feature list. Part of the activation is to remove
those options from the list that have already been confirmed, hence it is not
purely read-only.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This first patch out of three replaces the hardcoded default settings with
initialisation code for the dynamic feature negotiation.
The patch also ensures that the client feature-negotiation queue is flushed
only when entering the OPEN state.
Since confirmed Change options are removed as soon as they are confirmed
(in the DCCP-Response), this ensures that Confirm options are retransmitted.
Note on retransmitting Confirm options:
---------------------------------------
Implementation experience showed that it is necessary to retransmit Confirm
options. Thanks to Leandro Melo de Sales who reported a bug in an earlier
revision of the patch set, resulting from not retransmitting these options.
As long as the client is in PARTOPEN, it needs to retransmit the Confirm
options for the Change options received on the DCCP-Response from the server.
Otherwise, if the packet containing the Confirm options gets dropped in the
network, the connection aborts due to undefined feature negotiation state.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides the post-processing of feature negotiation state, after
the negotiation has completed.
To this purpose, handlers are used and added to the dccp_feat_table. Each
handler is passed a boolean flag whether the RX or TX side of the feature
is meant.
Several handlers are provided already, new handlers can easily be added.
The initialisation is now fully dynamic, i.e. CCIDs are activated only
after the feature negotiation. The integration of this dynamic activation
is done in the subsequent patches.
Thanks to Wei Yongjun for pointing out the necessity of skipping over empty
Confirm options while copying the negotiated feature values.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Analogous to the previous patch, this adds code to interpret incoming Confirm
feature-negotiation options. Both functions operate on the feature-negotiation
list of either the request_sock (server) or the dccp_sock (client).
Thanks to Wei Yongjun for pointing out that it is overly restrictive to check
the entire list of confirmed SP values.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds/replaces code for processing incoming ChangeL/R options.
The main difference is that:
* mandatory FN options are now interpreted inside the function
(there are too many individual cases to do this externally);
* the function returns an appropriate Reset code or 0,
which is then used to fill in the data for the Reset packet.
Old code, which is no longer used or referenced, has been removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides two functions to
* reconcile preference lists (with appropriate return codes) and
* reorder the preference list if successful reconciliation changed the
preferred value.
The patch also removes the old code for processing SP/NN Change options, since
new code to process these is mostly there already; related references have been
commented out.
The code for processing Change options follows in the next patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch implements insertion of feature negotiation at the server (listening
and request socket) and the client (connecting socket).
In dccp_insert_options(), several statements have been grouped together now
to achieve (it is hoped) better efficiency by reducing the number of tests
each packet has to go through:
- Ack Vectors are sent if the packet is neither a Data or a Request packet;
- a previous issue is corrected - feature negotiation options are allowed
on DataAck packets (5.8).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces the earlier insertion routine from options.c, so that
code specific to feature negotiation can remain in feat.c. This is possible
by calling a function already existing in options.c.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using one atomic_t per protocol, use a percpu_counter
for "orphan_count", to reduce cache line contention on
heavy duty network servers.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns
to flow_cache_lookup() and resolver callback.
Take it from socket or netdevice. Stub DECnet to init_net.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this warning:
net/dccp/options.c: In function ‘dccp_parse_options’:
net/dccp/options.c:67: warning: ‘value’ may be used uninitialized in this function
is a bogus GCC warning. The compiler does not recognize the relation
between "value" and "mandatory" variables: the code flow can ever reach
the "out_invalid_option:" label if 'mandatory' is set to 1, and when
'mandatory' is non-zero, we'll always have 'value' initialized.
Help out the compiler by annotating the variable.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch extends existing code:
* Confirm options divide into the confirmed value plus an optional preference
list for SP values. Previously only the preference list was echoed for SP
values, now the confirmed value is added as per RFC 4340, 6.1;
* length and sanity checks are added to avoid illegal memory (or NULL) access.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for Mandatory options is provided by this patch, which will
be used by subsequent feature-negotiation patches.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extends the scope of two available functions,
encode|decode_value_var, to work up to 6 (8) bytes, to match maximum
requirements in the RFC.
These functions are going to be used both by general option processing
and feature negotiation code, hence declarations have been put into
feat.h.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides function to query the current TX/RX CCID dynamically,
without reliance on the minisock value, using dynamic information
available in the currently loaded CCID module.
This query function is then used to
(a) provide the getsockopt part for getting/setting CCIDs via sockopts;
(b) replace the current test for "which CCID is in use" in probe.c.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, TX/RX CCIDs can now be changed on a per-connection
basis, which overrides the defaults set by the global sysctl variables
for TX/RX CCIDs.
To make full use of this facility, the remaining patches of this patch
set are needed, which track dependencies and activate negotiated
feature values.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch prepares RCU migration of listening_hash table for
TCP/DCCP protocols.
listening_hash table being small (32 slots per protocol), we add
a spinlock for each slot, instead of a single rwlock for whole table.
This should reduce hold time of readers, and writers concurrency.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This splits the setsockopt calls into two groups, depending on whether an
integer argument (val) is required and whether routines being called do
their own locking.
Some options (such as setting the CCID) use u8 rather than int, so that for
these the test with regard to integer-sizeof can not be used.
The second switch-case statement now only has those statements which need
locking and which make use of `val'.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: Eugene Teo <eugeneteo@kernel.sg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch deprecates the Ack Ratio sysctl, since
* Ack Ratio is entirely ignored by CCID-3 and CCID-4,
* Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
* even if it would work in CCID-2, there is no point for a user to change it:
- Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
- if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
(since waiting for Acks which will never arrive in this window),
- cwnd is not a user-configurable value.
The only reasonable place for Ack Ratio is to print it for debugging. It is
planned to do this later on, as part of e.g. dccp_probe.
With this patch Ack Ratio is now under full control of feature negotiation:
* Ack Ratio is resolved as a dependency of the selected CCID;
* if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
Ratio 2 for both endpoints";
* what happens then is part of another patch set, since it concerns the
dynamic update of Ack Ratio while the connection is in full flight.
Thanks to Tomasz Grobelny for discussion leading up to this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides feature negotiation for server minimum checksum coverage
which so far has been missing.
Since sender/receiver coverage values range only from 0...15, their
type has also been reduced in size from u16 to u4.
Feature-negotiation options are now generated for both sender and receiver
coverage, i.e. when the peer has `forgotten' to enable partial coverage
then feature negotiation will automatically enable (negotiate) the partial
coverage value for this connection.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous setsockopt interface, which passed socket options via struct
dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to
ugly code since the old approach did not distinguish between NN and SP values.
This patch removes the old setsockopt interface and replaces it with two new
functions to register NN/SP values for feature negotiation.
These are essentially wrappers around the internal __feat_register functions,
with checking added to avoid
* wrong usage (type);
* changing values while the connection is in progress.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a hook to resolve features whose value depends on the choice of
CCID. It is done at the server since it can only be done after the CCID
values have been negotiated; i.e. the client will add its CCID preference
list on the Change options sent in the Request, which will be reconciled
with the local preference list of the server.
The concept is documented on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
implementation_notes.html#ccid_dependencies
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.
This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.
Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.
__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)
Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides a missing link in the code chain, as several features implicitly
depend and/or rely on the choice of CCID. Most notably, this is the Send Ack Vector
feature, but also Ack Ratio and Send Loss Event Rate (also taken care of).
For Send Ack Vector, the situation is as follows:
* since CCID2 mandates the use of Ack Vectors, there is no point in allowing
endpoints which use CCID2 to disable Ack Vector features such a connection;
* a peer with a TX CCID of CCID2 will always expect Ack Vectors, and a peer
with a RX CCID of CCID2 must always send Ack Vectors (RFC 4341, sec. 4);
* for all other CCIDs, the use of (Send) Ack Vector is optional and thus
negotiable. However, this implies that the code negotiating the use of Ack
Vectors also supports it (i.e. is able to supply and to either parse or
ignore received Ack Vectors). Since this is not the case (CCID-3 has no Ack
Vector support), the use of Ack Vectors is here disabled, with a comment
in the source code.
An analogous consideration arises for the Send Loss Event Rate feature,
since the CCID-3 implementation does not support the loss interval options
of RFC 4342. To make such use explicit, corresponding feature-negotiation
options are inserted which signal the use of the loss event rate option,
as it is used by the CCID3 code.
Lastly, the values of the Ack Ratio feature are matched to the choice of CCID.
The patch implements this as a function which is called after the user has
made all other registrations for changing default values of features.
The table is variable-length, the reserved (and hence for feature-negotiation
invalid, confirmed by considering section 19.4 of RFC 4340) feature number `0'
is used to mark the end of the table.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides a data structure to record which CCIDs are locally supported
and three accessor functions:
- a test function for internal use which is used to validate CCID requests
made by the user;
- a copy function so that the list can be used for feature-negotiation;
- documented getsockopt() support so that the user can query capabilities.
The data structure is a table which is filled in at compile-time with the
list of available CCIDs (which in turn depends on the Kconfig choices).
Using the copy function for cloning the list of supported CCIDs is useful for
feature negotiation, since the negotiation is now with the full list of available
CCIDs (e.g. {2, 3}) instead of the default value {2}. This means negotiation
will not fail if the peer requests to use CCID3 instead of CCID2.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two registration routines, for SP and NN features, are provided by this patch,
replacing a previous routine which was used for both feature types.
These are internal-only routines and therefore start with `__feat_register'.
It further exports the known limits of Sequence Window and Ack Ratio as symbolic
constants.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch limits feature (capability) negotation to the connection setup phase:
1. Although it is theoretically possible to perform feature negotiation at any
time (and RFC 4340 supports this), in practice this is prohibitively complex,
as it requires to put traffic on hold for each new negotiation.
2. As a byproduct of restricting feature negotiation to connection setup, the
feature-negotiation retransmit timer is no longer required. This part is now
mapped onto the protocol-level retransmission.
Details indicating why timers are no longer needed can be found on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
implementation_notes.html
This patch disables anytime negotiation, subsequent patches work out full
feature negotiation support for connection setup.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This inserts the required de-allocation routines for memory allocated
by feature negotiation in the socket destructors, replacing
dccp_feat_clean() in one instance.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides feature-negotiation initialisation for both DCCP sockets
and DCCP request_sockets, to support feature negotiation during
connection setup.
It also resolves a FIXME regarding the congestion control
initialisation.
Thanks to Wei Yongjun for help with the IPv6 side of this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds list initial fields and list management functions for the
new feature negotiation implementation.
Thanks to Arnaldo for suggestions and improvements.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
A lookup table for feature-negotiation information, extracted from RFC
4340/42, is provided by this patch. All currently known features can
be found in this table, along with their feature location, their
default value, and type.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch prepares for the new and extended feature-negotiation
routines.
The following feature-negotiation data structures are provided:
* a container for the various (SP or NN) values,
* symbolic state names to track feature states,
* an entry struct which holds all current information together,
* elementary functions to fill in and process these structures.
Entry structs are arranged as FIFO for the following reason: RFC 4340
specifies that if multiple options of the same type are present, they
are processed in the order of their appearance in the packet; which
means that this order needs to be preserved in the local data
structure (the later insertion code also respects this order).
The struct list_head has been chosen for the following reasons: the most
frequent operations are
* add new entry at tail (when receiving Change or setting socket
options);
* delete entry (when Confirm has been received);
* deep copy of entire list (cloning from listening socket onto
request socket).
The NN value has been set to 64 bit, which is a currently sufficient
upper limit (Sequence Window feature has 48 bit).
Thanks to Arnaldo, who contributed the streamlined layout of the entry
struct.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a3116ac5c2 from 1st October ("tcp: Port
redirection support for TCP") broke DCCP skb lookup by changing inet_csk_clone,
which is used by DCCP to generate the child socket after the handshake.
This patch updates DCCP to use 'loc_port' instead of 'sport', which fixes the
problem, and thus inheriting port redirection support via the new interface.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some code here depends on CONFIG_KMOD to not try to load
protocol modules or similar, replace by CONFIG_MODULES
where more than just request_module depends on CONFIG_KMOD
and and also use try_then_request_module in ebtables.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements [RFC 3448, 4.5], which performs congestion avoidance behaviour
by reducing the transmit rate as the queueing delay (measured in terms of
long-term RTT) increases.
Oscillation can be turned on/off via a module option (do_osc_prev) and via sysfs
(using mode 0644), the default is off.
Overflow analysis:
------------------
* oscillation prevention is done after update_x(), so that t_ipi <= 64000;
* hence the multiplication "t_ipi * sqrt(R_sample)" needs 64 bits;
* done using u64 for sqrt_sample and explicit typecast of t_ipi;
* the divisor, R_sqmean, is non-zero because oscillation prevention is first
called when receiving the second feedback packet, and tfrc_scaled_rtt() > 0.
A detailed discussion of the algorithm (with plots) is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid3/sender_notes/oscillation_prevention/
The algorithm has negative side effects:
* when allowing to decrease t_ipi (leads to a large RTT) and
* when using it during slow-start;
both uses are therefore disabled.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch simplifies the computation of t_ipi, avoiding expensive computations
to enforce the minimum sending rate.
Both RFC 3448 and rfc3448bis (revision #06), as well as RFC 4342 sec 5., require
at various stages that at least one packet must be sent per t_mbi = 64 seconds.
This requires frequent divisions of the type X_min = s/t_mbi, which are later
converted back into an inter-packet-interval t_ipi_max = s/X_min = t_mbi.
The patch removes the expensive indirection; in the unlikely case of having
a sending rate less than one packet per 64 seconds, it also re-adjusts X.
The following cases document conformance with RFC 3448 / rfc3448bis-06:
1) Time until receiving the first feedback packet:
* if the sender has no initial RTT sample then X = s/1 Bps > s/t_mbi;
* if the sender has an initial RTT sample or when the first feedback
packet is received, X = W_init/R > s/t_mbi.
2) Slow-start (p == 0 and feedback packets come in):
* RFC 3448 (current code) enforces a minimum of s/R > s/t_mbi;
* rfc3448bis (future code) enforces an even higher minimum of W_init/R.
3) Congestion avoidance with no absence of feedback (p > 0):
* when X_calc or X_recv/2 are too low, the minimum of X_min = s/t_mbi
is enforced in update_x() when calling update_send_interval();
* update_send_interval() is, as before, only called when X changes
(i.e. either when increasing or decreasing, not when in equilibrium).
4) Reduction of X without prior feedback or during slow-start (p==0):
* both RFC 3448 and rfc3448bis here halve X directly;
* the associated constraint X >= s/t_mbi is nforced here by send_interval().
5) Reduction of X when p > 0:
* X is modified indirectly via X_recv (RFC 3448) or X_recv_set (rfc3448bis);
* in both cases, control goes back to section 4.3 (in both documents);
* since p > 0, both documents use X = max(min(...), s/t_mbi), which is
enforced in this patch by calling send_interval() from update_x().
I think that this analysis is exhaustive. Should I have forgotten a case,
the worst-case consideration arises when X sinks below s/t_mbi, and is then
increased back up to this minimum value. Even under this assumption, the
behaviour is correct, since all lower limits of X in RFC 3448 / rfc3448bis
are either equal to or greater than s/t_mbi.
Note on the condition X >= s/t_mbi <==> t_ipi = s/X <= t_mbi: since X is
scaled by 64, and all time units are in microseconds, the coded condition is:
t_ipi = s * 64 * 10^6 usec / X <= 64 * 10^6 usec
This simplifies to s / X <= 1 second <==> X * 1 second >= s > 0.
(A zero `s' is not allowed by the CCID-3 code).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
rfc3448bis allows three different ways of tracking the packet size `s':
1. using the MSS/MPS (at initialisation, 4.2, and in 4.1 (1));
2. using the average of `s' (in 4.1);
3. using the maximum of `s' (in 4.2).
Instead of hard-coding a single interpretation of rfc3448bis, this implements
a choice of all three alternatives and suggests the first as default, since it
is the option which is most consistent with other parts of the specification.
The patch further deprecates the update of t_ipi whenever `s' changes. The
gains of doing this are only small since a change of s takes effect at the
next instant X is updated:
* when the next feedback comes in (within one RTT or less);
* when the nofeedback timer expires (within at most 4 RTTs).
Further, there are complications caused by updating t_ipi whenever s changes:
* if t_ipi had previously been updated to effect oscillation prevention (4.5),
then it is impossible to make the same adjustment to t_ipi again, thus
counter-acting the algorithm;
* s may be updated any time and a modification of t_ipi depends on the current
state (e.g. no oscillation prevention is done in the absence of feedback);
* in rev-06 of rfc3448bis, there are more possible cases, depending on whether
the sender is in slow-start (t_ipi <= R/W_init), or in congestion-avoidance,
limited by X_recv or the throughput equation (t_ipi <= t_mbi).
Thus there are side effects of always updating t_ipi as s changes. These may not
be desirable. The only case I can think of where such an update makes sense is
to recompute X_calc when p > 0 and when s changes (not done by this patch).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The per-CCID menu has several dependencies on EXPERIMENTAL. These are redundant,
since net/dccp/ccids/Kconfig is sourced by net/dccp/Kconfig and since the
latter menu in turn asserts a dependency on EXPERIMENTAL.
The patch removes the redundant dependencies as well as the repeated reference
within the sub-menu.
Further changes:
----------------
Two single dependencies on CCID-3 are replaced with a single enclosing `if'.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The patch updates CCID-3 with regard to the latest rfc3448bis-06:
* in the first revisions of the draft, MSS was used for the RFC 3390 window;
* then (from revision #1 to revision #2), it used the packet size `s';
* now, in this revision (and apparently final), the value is back to MSS.
This change has an implication for the case when no RTT sample is available,
at the time of sending the first packet:
* with RTT sample, 2*MSS/RTT <= initial_rate <= 4*MSS/RTT;
* without RTT sample, the initial rate is one packet (s bytes) per second
(sec. 4.2), but using s instead of MSS here creates an imbalance, since
this would further reduce the initial sending rate.
Hence the patch uses MSS (called MPS in RFC 4340) in all places.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch is a requirement for enabling ECN support later on. With that change
in mind, the following preparations are done:
* renamed handle_loss() into congestion_event() since it returns true when a
congestion event happens (it will eventually also take care of ECN packets);
* lets tfrc_rx_congestion_event() always update the RX history records, since
this routine needs to be called for each non-duplicate packet anyway;
* made all involved boolean-type functions to have return type `bool';
Updating the RX history records is now only necessary for the packets received
up to sending the first feedback. The receiver code becomes again simpler.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This updates the computation of X_recv with regard to Errata 610/611 for
RFC 4342 and draft rfc3448bis-06, ensuring that at least an interval of 1
RTT is used to compute X_recv. The change is wrapped into a new function
ccid3_hc_rx_x_recv().
Further changes:
----------------
* feedback is not sent when no data packets arrived (bytes_recv == 0), as per
rfc3448bis-06, 6.2;
* take the timestamp for the feedback /after/ dccp_send_ack() returns, to avoid
taking the transmission time into account (in case layer-2 is busy);
* clearer handling of failure in ccid3_first_li().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This improves the receiver RTT sampling algorithm so that it tries harder to get
as many RTT samples as possible.
The algorithm is based the concepts presented in RFC 4340, 8.1, using timestamps
and the CCVal window counter. There exist 4 cases for the CCVal difference:
* == 0: less than RTT/4 passed since last packet -- unusable;
* > 4: (much) more than 1 RTT has passed since last packet -- also unusable;
* == 4: perfect sample (exactly one RTT has passed since last packet);
* 1..3: sub-optimal sample (between RTT/4 and 3*RTT/4 has passed).
In the last case the algorithm tried to optimise by storing away the candidate
and then re-trying next time. The problem is that
* a large number of samples is needed to smooth out the inaccuracies of the
algorithm;
* the sender may not be sending enough packets to warrant a "next time";
* hence it is better to use suboptimal samples whenever possible.
The algorithm now stores away the current sample only if the difference is 0.
Applicability and background
----------------------------
A realistic example is MP3 streaming where packets are sent at a rate of less
than one packet per RTT, which means that suitable samples are absent for a
very long time.
The effectiveness of using suboptimal samples (with a delta between 1 and 4) was
confirmed by instrumenting the algorithm with counters. The results of two 20
second test runs were:
* With the old algorithm and a total of 38442 function calls, only 394 of these
calls resulted in usable RTT samples (about 1%), and 378 out of these were
"perfect" samples and 28013 (unused) samples had a delta of 1..3.
* With the new algorithm and a total of 37057 function calls, 1702 usable RTT
samples were retrieved (about 4.6%), 5 out of these were "perfect" samples.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This extracts the clamping part of dccp_sample_rtt() and makes it available
to other parts of the code (as e.g. used in the next patch).
Note: The function dccp_sample_rtt() now reduces to subtracting the elapsed
time. This could be eliminated but would require shorter prefixes and thus
is not done by this patch - maybe an idea for later.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This updates the CCID-3 receiver in part with regard to errata 610 and 611
(http://www.rfc-editor.org/errata_list.php), which change RFC 4342 to use the
Receive Rate as specified in rfc3448bis, requiring to constantly sample the
RTT (or use a sender RTT).
Doing this requires reusing the RX history structure after dealing with a loss.
The patch does not resolve how to compute X_recv if the interval is less
than 1 RTT. A FIXME has been added (and is resolved in subsequent patch).
Furthermore, since this is all TFRC-based functionality, the RTT estimation
is now also performed by the dccp_tfrc_lib module. This further simplifies
the CCID-3 code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The only state information that the CCID-3 receiver keeps is whether initial
feedback has been sent or not. Further, this overlaps with use of feedback:
* state == TFRC_RSTATE_NO_DATA as long as no feedback has been sent;
* state == TFRC_RSTATE_DATA as soon as the first feedback has been sent.
This patch reduces the duplication, by memorising the type of the last feedback.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This migrates more TFRC-related code into the dccp_tfrc_lib:
* sampling of the packet size `s' (which is only needed until the first
loss interval is computed (ccid3_first_li));
* updating the byte-counter `bytes_recvd' in between sending feedbacks.
The result is a better separation of CCID-3 specific and TFRC specific
code, which aids future integration with ECN and e.g. CCID-4.
Further changes:
----------------
* replaced magic number of 536 with equivalent constant TCP_MIN_RCVMSS;
(this constant is also used when no estimate for `s' is available).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This changes the return type of tfrc_lh_update_i_mean() to void, since that
function returns always `false'. This is due to
len = dccp_delta_seqno(cur->li_seqno, DCCP_SKB_CB(skb)->dccpd_seq) + 1;
if (len - (s64)cur->li_length <= 0) /* duplicate or reordered */
return 0;
which means that update_i_mean can only increase the length of the open loss
interval I_0, and hence the value of I_tot0 (RFC 3448, 5.4). Consequently the
test `i_mean < old_i_mean' at the end of the function always evaluates to false.
There is no known way by which a loss interval can suddenly become shorter,
therefore the return type of the function is changed to void. (That is, under
the given circumstances step (3) in RFC 3448, 6.1 will not occur.)
Further changes:
----------------
* the function is now called from tfrc_rx_handle_loss, which is equivalent
to the previous way of calling from rx_packet_recv (it was called whenever
there was no new or pending loss, now it is also updated when there is
a pending loss - this increases the accuracy a bit);
* added a FIXME to possibly consider NDP counting as per RFC 4342 (this is
not implemented yet).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This enables the TFRC code to begin loss detection (as soon as the module
is loaded), using the latest updates from rfc3448bis-06, 6.3.1:
* when the first data packet(s) are lost or marked, set
* X_target = s/(2*R) => f(p) = s/(R * X_target) = 2,
* corresponding to a loss rate of ~ 20.64%.
The handle_loss() function is now called right at the begin of rx_packet_recv()
and thus no longer protected against duplicates: hence a call to rx_duplicate()
has been added. Such a call makes sense now, as the previous patch initialises
the first entry with a sequence number of GSR.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch
1) separates history allocation and initialisation, to facilitate early
loss detection (implemented by a subsequent patch);
2) removes duplication by using the existing tfrc_rx_hist_purge() if the
allocation fails. This is now possible, since the initialisation routine
3) zeroes out the entire history before using it.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
In the congestion-avoidance phase a decay of p towards 0 is natural once fewer
losses are encountered. Hence the warning message "p is below resolution" is
not necessary, and thus turned into a debug message by this patch.
The TFRC_SMALLEST_P is needed since in theory p never actually reaches 0. When
no further losses are encountered, the loss interval I_0 grows in length,
causing p to decrease towards 0, causing X_calc = s/(RTT * f(p)) to increase.
With the given minimum-resolution this congestion avoidance phase stops at some
fixed value, an approximation formula has been added to the documentation.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Since CCIDs are only used during the established phase of a connection,
they have very little internal state; this specifically reduces to:
* "no packet sent" if and only if s == 0, for the TX packet size s;
* when the first packet has been sent (i.e. `s' > 0), the question is whether
or not feedback has been received:
- if a feedback packet is received, "feedback = yes" is set,
- if the nofeedback timer expires, "feedback = no" is set.
Thus the CCID only needs to remember state about whether or not feedback
has been received. This is now implemented using a boolean flag, which is
toggled when a feedback packet arrives or the nofeedback timer expires.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The DCCP base time resolution is 10 microseconds (RFC 4340, 13.1 ... 13.3).
Using a timer with a lower resolution was found to trigger the following
bug warnings/problems on high-speed networks (e.g. local loopback):
* RTT samples are rounded down to 0 if below resolution;
* in some cases, negative RTT samples were observed;
* the CCID-3 feedback timer complains that the feedback interval is 0,
since the feedback interval is in the order of 1 RTT or less and RTT
measurement rounded this down to 0;
On an Intel computer this will for instance happen when using a
boot-time parameter of "clocksource=jiffies".
The following system log messages were observed:
11:24:00 kernel: BUG: delta (0) <= 0 at ccid3_hc_rx_send_feedback()
11:26:12 kernel: BUG: delta (0) <= 0 at ccid3_hc_rx_send_feedback()
11:26:30 kernel: dccp_sample_rtt: unusable RTT sample 0, using min
11:26:30 last message repeated 5 times
This patch defines a global constant for the time resolution, adds this in
timer.c, and checks the available clock resolution at CCID-3 module load time.
When the resolution is worse than 10 microseconds, module loading exits with
a message "socket type not supported".
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Ensure that cmsg->cmsg_type value is valid for qpolicy
that is currently in use.
Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch adds a generic infrastructure for policy-based dequeueing of
TX packets and provides two policies:
* a simple FIFO policy (which is the default) and
* a priority based policy (set via socket options).
Both policies honour the tx_qlen sysctl for the maximum size of the write
queue (can be overridden via socket options).
The priority policy uses skb->priority internally to assign an u32 priority
identifier, using the same ranking as SO_PRIORITY. The skb->priority field
is set to 0 when the packet leaves DCCP. The priority is supplied as ancillary
data using cmsg(3), the patch also provides the requisite parsing routines.
Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch rearranges the order of statements of the slow-path input processing
(i.e. any other state than OPEN), to resolve the following issues.
1. Dependencies: the order of statements now better matches RFC 4340, 8.5, i.e.
step 7 is before step 9 (previously 9 was before 7), and parsing options in
step 8 (which can consume resources) now comes after step 7.
2. Bug-fix: in state CLOSED, there should not be any sequence number checking
or option processing. This is why the test for CLOSED has been moved after
the test for LISTEN.
3. As before sequence number checks are omitted if in state LISTEN/REQUEST, due
to the note underneath the table in RFC 4340, 7.5.3.
4. Packets are now passed on to Ack Vector / CCID processing only after
- step 7 (receive unexpected packets),
- step 9 (receive Reset),
- step 13 (receive CloseReq),
- step 14 (receive Close)
and only if the state is PARTOPEN. This simplifies CCID processing:
- in LISTEN/CLOSED the CCIDs are non-existent;
- in RESPOND/REQUEST the CCIDs have not yet been negotiated;
- in CLOSEREQ and active-CLOSING the node has already closed this socket;
- in passive-CLOSING the client is waiting for its Reset.
In the last case, RFC 4340, 8.3 leaves it open to ignore further incoming
data, which is the approach taken here.
As a result of (3), CCID processing is now indeed confined to OPEN/PARTOPEN
states, i.e. congestion control is performed only on the flow of data packets.
This avoids pathological cases of doing congestion control on those messages
which set up and terminate the connection.
I have done a few checks to see if this creates a problem in other parts of
the code. This seems not to be the case; even if there were one, it would be
better to fix it than to perform congestion control on Close/Request/Response
messages. Similarly for Ack Vectors (as they depend on the negotiated CCID).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch consolidates the code common to TCP and CCID-2:
* TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
* CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Realising the following call pattern,
* first dccp_entail() is called to enqueue a new skb and
* then skb_clone() is called to transmit a clone of that skb,
this patch integrates both interrelated steps into dccp_entail().
Note: the return value of skb_clone is not checked. It may be an idea to add a
warning if this occurs. In both instances, however, a timer is set for
retransmission, so that cloning is re-tried via dccp_retransmit_skb().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the wrappers around the sk timer functions as it makes the code
clearer and not much is gained from using wrappers: the BUG_ON in
start_rto_timer will never trigger since that function was called only when
* the RTO timer expired (rto_expire, and then timer_pending() is false);
* in tx_packet_sent only if !timer_pending() (BUG_ON is redundant here);
* previously in new_ack, after stopping the timer (timer_pending() false).
One further motive behind this patch is to replace the RTO timer with the
icsk retransmission timer, as it is already part of the DCCP socket.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The current CCID-2 RTT estimator code is in parts broken and lags behind the
suggestions in RFC2988 of using scaled variants for SRTT/RTTVAR.
That code is replaced by the present patch, which reuses the Linux TCP RTT
estimator code - reasons for this code duplication are given below.
Further details:
----------------
1. The minimum RTO of previously one second has been replaced with TCP's, since
RFC4341, sec. 5 says that the minimum of 1 sec. (suggested in RFC2988, 2.4)
is not necessary. Instead, the TCP_RTO_MIN is used, which agrees with DCCP's
concept of a default RTT (RFC 4340, 3.4).
2. The maximum RTO has been set to DCCP_RTO_MAX (64 sec), which agrees with
RFC2988, (2.5).
3. De-inlined the function ccid2_new_ack().
4. Added a FIXME: the RTT is sampled several times per Ack Vector, which will
give the wrong estimate. It should be replaced with one sample per Ack.
However, at the moment this can not be resolved easily, since
- it depends on TX history code (which also needs some work),
- the cleanest solution is not to use the `sent' time at all (saves 4 bytes
per entry) and use DCCP timestamps / elapsed time to estimated the RTT,
which however is non-trivial to get right (but needs to be done).
Reasons for reusing the Linux TCP estimator algorithm:
------------------------------------------------------
Some time was spent to find a better alternative, using basic RFC2988 as a first
step. Further analysis and experimentation showed that the Linux TCP RTO
estimator is superior to a basic RFC2988 implementation. A summary is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid2/rto_estimator/
In addition, this estimator fared well in a recent empirical evaluation:
Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith.
A Performance Study of Loss Detection/Recovery in Real-world TCP
Implementations. Proceedings of 15th IEEE International
Conference on Network Protocols (ICNP-07). 2007.
Thus there is significant benefit in reusing the existing TCP code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the dec_pipe function and improves the way the RTO timer is rearmed
when a new acknowledgment comes in.
Details and justification for removal:
--------------------------------------
1) The BUG_ON in dec_pipe is never triggered: pipe is only decremented for TX
history entries between tail and head, for which it had previously been
incremented in tx_packet_sent; and it is not decremented twice for the same
entry, since it is
- either decremented when a corresponding Ack Vector cell in state 0 or 1
was received (and then ccid2s_acked==1),
- or it is decremented when ccid2s_acked==0, as part of the loss detection
in tx_packet_recv (and hence it can not have been decremented earlier).
2) Restarting the RTO timer happens for every single entry in each Ack Vector
parsed by tx_packet_recv (according to RFC 4340, 11.4 this can happen up to
16192 times per Ack Vector).
3) The RTO timer should not be restarted when all outstanding data has been
acknowledged. This is currently done similar to (2), in dec_pipe, when
pipe has reached 0.
The patch onsolidates the code which rearms the RTO timer, combining the
segments from new_ack and dec_pipe. As a result, the code becomes clearer
(compare with tcp_rearm_rto()).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the ccid2_hc_tx_check_sanity function: it is redundant.
Details:
========
The tx_check_sanity function performs three tests:
1) it checks that the circular TX list is sorted
- in ascending order of sequence number (ccid2s_seq)
- and time (ccid2s_sent),
- in the direction from `tail' (hctx_seqt) to `head' (hctx_seqh);
2) it ensures that the entire list has the length seqbufc * CCID2_SEQBUF_LEN;
3) it ensures that pipe equals the number of packets that were not
marked `acked' (ccid2s_acked) between `tail' and `head'.
The following argues that each of these tests is redundant, this can be verified
by going through the code.
(1) is not necessary, since both time and GSS increase from one packet to the
next, so that subsequent insertions in tx_packet_sent (which advance the `head'
pointer) will be in ascending order of time and sequence number.
In (2), the length of the list is always equal to seqbufc times CCID2_SEQBUF_LEN
(set to 1024) unless allocation caused an earlier failure, because:
* at initialisation (tx_init), there is one chunk of size 1024 and seqbufc=1;
* subsequent calls to tx_alloc_seq take place whenever head->next == tail in
tx_packet_sent; then a new chunk of size 1024 is inserted between head and
tail, and seqbufc is incremented by one.
To show that (3) is redundant requires looking at two cases.
The `pipe' variable of the TX socket is incremented only in tx_packet_sent, and
decremented in tx_packet_recv. When head == tail (TX history empty) then pipe
should be 0, which is the case directly after initialisation and after a
retransmission timeout has occurred (ccid2_hc_tx_rto_expire).
The first case involves parsing Ack Vectors for packets recorded in the live
portion of the buffer, between tail and head. For each packet marked by the
receiver as received (state 0) or ECN-marked (state 1), pipe is decremented by
one, so for all such packets the BUG_ON in tx_check_sanity will not trigger.
The second case is the loss detection in the second half of tx_packet_recv,
below the comment "Check for NUMDUPACK".
The first while-loop here ensures that the sequence number of `seqp' is either
above or equal to `high_ack', or otherwise equal to the highest sequence number
sent so far (of the entry head->prev, as head points to the next unsent entry).
The next while-loop ("while (1)") counts the number of acked packets starting
from that position of seqp, going backwards in the direction from head->prev to
tail. If NUMDUPACK=3 such packets were counted within this loop, `seqp' points
to the last acknowledged packet of these, and the "if (done == NUMDUPACK)" block
is entered next.
The while-loop contained within that block in turn traverses the list backwards,
from head to tail; the position of `seqp' is saved in the variable `last_acked'.
For each packet not marked as `acked', a congestion event is triggered within
the loop, and pipe is decremented. The loop terminates when `seqp' has reached
`tail', whereupon tail is set to the position previously stored in `last_acked'.
Thus, between `last_acked' and the previous position of `tail',
- pipe has been decremented earlier if the packet was marked as state 0 or 1;
- pipe was decremented if the packet was not marked as acked.
That is, pipe has been decremented by the number of packets between `last_acked'
and the previous position of `tail'. As a consequence, pipe now again reflects
the number of packets which have not (yet) been acked between the new position
of tail (at `last_acked') and head->prev, or 0 if head==tail. The result is that
the BUG_ON condition in check_sanity will also not be triggered, hence the test
(3) is also redundant.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This updates CCID2 to use the CCID dequeuing mechanism, converting from
previous constant-polling to a now event-driven mechanism.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This extends the existing wait-for-ccid routine so that it may be used with
different types of CCID. It further addresses the problems listed below.
The code looks if the write queue is non-empty and grants the TX CCID up to
`timeout' jiffies to drain the queue. It will instead purge that queue if
* the delay suggested by the CCID exceeds the time budget;
* a socket error occurred while waiting for the CCID;
* there is a signal pending (eg. annoyed user pressed Control-C);
* the CCID does not support delays (we don't know how long it will take).
D e t a i l s [can be removed]
-------------------------------
DCCP's sending mechanism functions a bit like non-blocking I/O: dccp_sendmsg()
will enqueue up to net.dccp.default.tx_qlen packets (default=5), without waiting
for them to be released to the network.
Rate-based CCIDs, such as CCID3/4, can impose sending delays of up to maximally
64 seconds (t_mbi in RFC 3448). Hence the write queue may still contain packets
when the application closes. Since the write queue is congestion-controlled by
the CCID, draining the queue is also under control of the CCID.
There are several problems that needed to be addressed:
1) The queue-drain mechanism only works with rate-based CCIDs. If CCID2 for
example has a full TX queue and becomes network-limited just as the
application wants to close, then waiting for CCID2 to become unblocked could
lead to an indefinite delay (i.e., application "hangs").
2) Since each TX CCID in turn uses a feedback mechanism, there may be changes
in its sending policy while the queue is being drained. This can lead to
further delays during which the application will not be able to terminate.
3) The minimum wait time for CCID3/4 can be expected to be the queue length
times the current inter-packet delay. For example if tx_qlen=100 and a delay
of 15 ms is used for each packet, then the application would have to wait
for a minimum of 1.5 seconds before being allowed to exit.
4) There is no way for the user/application to control this behaviour. It would
be good to use the timeout argument of dccp_close() as an upper bound. Then
the maximum time that an application is willing to wait for its CCIDs to can
be set via the SO_LINGER option.
These problems are addressed by giving the CCID a grace period of up to the
`timeout' value.
The wait-for-ccid function is, as before, used when the application
(a) has read all the data in its receive buffer and
(b) if SO_LINGER was set with a non-zero linger time, or
(c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ
state (client application closes after receiving CloseReq).
In addition, there is a catch-all case by calling __skb_queue_purge() after
waiting for the CCID. This is necessary since the write queue may still have
data when
(a) the host has been passively-closed,
(b) abnormal termination (unread data, zero linger time),
(c) wait-for-ccid could not finish within the given time limit.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This extends the packet dequeuing interface of dccp_write_xmit() to allow
1. CCIDs to take care of timing when the next packet may be sent;
2. delayed sending (as before, with an inter-packet gap up to 65.535 seconds).
The main purpose is to take CCID2 out of its polling mode (when it is network-
limited, it tries every millisecond to send, without interruption).
The interface can also be used to support other CCIDs.
The mode of operation for (2) is as follows:
* new packet is enqueued via dccp_sendmsg() => dccp_write_xmit(),
* ccid_hc_tx_send_packet() detects that it may not send (e.g. window full),
* it signals this condition via `CCID_PACKET_WILL_DEQUEUE_LATER',
* dccp_write_xmit() returns without further action;
* after some time the wait-condition for CCID becomes true,
* that CCID schedules the tasklet,
* tasklet function calls ccid_hc_tx_send_packet() via dccp_write_xmit(),
* since the wait-condition is now true, ccid_hc_tx_packet() returns "send now",
* packet is sent, and possibly more (since dccp_write_xmit() loops).
Code reuse: the taskled function calls dccp_write_xmit(), the timer function
reduces to a wrapper around the same code.
If the tasklet finds that the socket is locked, it re-schedules the tasklet
function (not the tasklet) after one jiffy.
Changed DCCP_BUG to dccp_pr_debug when transmit_skb returns an error (e.g. when a
local qdisc is used, NET_XMIT_DROP=1 can be returned for many packets).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch reorganises the return value convention of the CCID TX sending
function, to permit more flexible schemes, as required by subsequent patches.
Currently the convention is
* values < 0 mean error,
* a value == 0 means "send now", and
* a value x > 0 means "send in x milliseconds".
The patch provides symbolic constants and a function to interpret return values.
In addition, it caps the maximum positive return value to 0xFFFF milliseconds,
corresponding to 65.535 seconds.
This is possible since in CCID-3 the maximum inter-packet gap is t_mbi = 64 sec.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch replaces an almost identical replication of code: large parts
of dccp_parse_options() re-appeared as ccid2_ackvector() in ccid2.c.
Apart from the duplication, this caused two more problems:
1. CCIDs should not need to be concerned with parsing header options;
2. one can not assume that Ack Vectors appear as a contiguous area within an
skb, it is legal to insert other options and/or padding in between. The
current code would throw an error and stop reading in such a case.
The patch provides a new data structure and associated list housekeeping.
Only small changes were necessary to integrate with CCID-2: data structure
initialisation, adapt list traversal routine, and add call to the provided
cleanup routine.
The latter also lead to fixing the following BUG: CCID-2 so far ignored
Ack Vectors on all packets other than Ack/DataAck, which is incorrect,
since Ack Vectors can be present on any packet that has an Ack field.
Details:
--------
* received Ack Vectors are parsed by dccp_parse_options() alone, which passes
the result on to the CCID-specific routine ccid_hc_tx_parse_options();
* CCIDs interested in using/decoding Ack Vector information will add code
to fetch parsed Ack Vectors via this interface;
* a data structure, `struct dccp_ackvec_parsed' is provided as interface;
* this structure arranges Ack Vectors of the same skb into a FIFO order;
* a doubly-linked list is used to keep the required FIFO code small.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes
* functions for which updates have been provided in the preceding patches and
* the @av_vec_len field - it is no longer necessary since the buffer length is
now always computed dynamically;
* conditional debugging code (CONFIG_IP_DCCP_ACKVEC).
The reason for removing the conditional debugging code is that Ack Vectors are
an almost inevitable necessity - RFC 4341 says that for CCID-2, Ack Vectors must
be used. Furthermore, the code would be only interesting for coding - after some
extensive testing with this patch set, having the debug code around is no longer
of real help.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The problem with Ack Vectors is that
i) their length is variable and can in principle grow quite large,
ii) it is hard to predict exactly how large they will be.
Due to the second point it seems not a good idea to reduce the MPS; in
particular when on average there is enough room for the Ack Vector and an
increase in length is momentarily due to some burst loss, after which the
Ack Vector returns to its normal/average length.
The solution taken by this patch is to subtract a minimum-expected Ack Vector
length from the MPS (previous patch), and to defer any larger Ack Vectors onto
a separate Sync - but only if indeed there is no space left on the skb.
This patch provides the infrastructure to schedule Sync-packets for transporting
(urgent) out-of-band data. Its signalling is quicker than scheduling an Ack, since
it does not need to wait for new application data.
It can thus serve other parts of the DCCP code as well.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>