One of my test machine got a deadlock during "tc" sessions,
adding/deleting classes & filters, using traffic estimators.
After some analysis, I believe we have a potential use after free case
in est_timer() :
spin_lock(e->stats_lock); << HERE >>
read_lock(&est_lock);
if (e->bstats == NULL) << TEST >>
goto skip;
Test is done a bit late, because after estimator is killed, and before
rcu grace period elapsed, we might already have freed/reuse memory where
e->stats_locks points to (some qdisc->q.lock)
A possible fix is to respect a rcu grace period at Qdisc dismantle time.
On 64bit, sizeof(struct Qdisc) is exactly 192 bytes. Adding 16 bytes to
it (for struct rcu_head) is a problem because it might change
performance, given QDISC_ALIGNTO is 32 bytes.
This is why I also change QDISC_ALIGNTO to 64 bytes, to satisfy most
current alignment requirements.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds an additional queuing strategy, called pfifo_head_drop,
to remove the oldest skb in the case of an overflow within the queue -
the head element - instead of the last skb (tail). To remove the oldest
skb in congested situations is useful for sensor network environments
where newer packets reflect the superior information.
Reviewed-by: Florian Westphal <fw@strlen.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This cleanup patch puts struct/union/enum opening braces,
in first line to ease grep games.
struct something
{
becomes :
struct something {
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_queue_xmit enqueue's a skb and calls qdisc_run which
dequeue's the skb and xmits it. In most cases, the skb that
is enqueue'd is the same one that is dequeue'd (unless the
queue gets stopped or multiple cpu's write to the same queue
and ends in a race with qdisc_run). For default qdiscs, we
can remove the redundant enqueue/dequeue and simply xmit the
skb since the default qdisc is work-conserving.
The patch uses a new flag - TCQ_F_CAN_BYPASS to identify the
default fast queue. The controversial part of the patch is
incrementing qlen when a skb is requeued - this is to avoid
checks like the second line below:
+ } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> !q->gso_skb &&
+ !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {
Results of a 2 hour testing for multiple netperf sessions (1,
2, 4, 8, 12 sessions on a 4 cpu system-X). The BW numbers are
aggregate Mb/s across iterations tested with this version on
System-X boxes with Chelsio 10gbps cards:
----------------------------------
Size | ORG BW NEW BW |
----------------------------------
128K | 156964 159381 |
256K | 158650 162042 |
----------------------------------
Changes from ver1:
1. Move sch_direct_xmit declaration from sch_generic.h to
pkt_sched.h
2. Update qdisc basic statistics for direct xmit path.
3. Set qlen to zero in qdisc_reset.
4. Changed some function names to more meaningful ones.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's use TICKS instead of US, so PSCHED_TICKS2NS and PSCHED_NS2TICKS
(like in PSCHED_TICKS_PER_SEC already) to avoid misleading.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change PSCHED_SHIFT from 10 to 6 to increase schedulers time
resolution. This will increase 16x a number of (internal) ticks per
nanosecond, and is needed to improve accuracy of schedulers based on
rate tables, like HTB, TBF or CBQ, with rates above 100Mbit. It is
assumed this change is safe for 32bit accounting of time diffs up
to 2 minutes, which should be enough for common use (extremely low
rate values may overflow, so get inaccurate instead). To make full
use of this change an updated iproute2 will be needed. (But using
older iproute2 should be safe too.)
This change breaks ticks - microseconds similarity, so some minor code
fixes might be needed. It is also planned to change naming adequately
eg. to PSCHED_TICKS2NS() etc. in the near future.
Reported-by: Antonio Almeida <vexwek@gmail.com>
Tested-by: Antonio Almeida <vexwek@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use PSCHED_SHIFT constant instead of '10' in PSCHED_US2NS() and
PSCHED_NS2US() macros to enable changing this value later.
Additionally use PSCHED_SHIFT in sch_hfsc SM_SHIFT and ISM_SHIFT
definitions. This part of the patch is based on feedback from
Patrick McHardy <kaber@trash.net>.
Reported-by: Antonio Almeida <vexwek@gmail.com>
Tested-by: Antonio Almeida <vexwek@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy <kaber@trash.net> suggested:
> How about making this flag and the warning message (in a out-of-line
> function) globally available? Other qdiscs (f.i. HFSC) can't deal with
> inner non-work-conserving qdiscs as well.
This patch uses qdisc->flags field of "suspected" child qdisc.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current check wrongly uses the state of one (currently the first)
tx queue for all tx queues in case of non-default qdiscs. This check
mainly prevented requeuing loop with __netif_schedule(), but now it's
controlled inside __qdisc_run(), while dequeuing. The wrongness of
this check was first noticed by Herbert Xu.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
without rtnl_lock(), adding and deleting from a qdisc list needs
additional locking. This patch adds global spinlock qdisc_list_lock
and wrapper functions for modifying the list. It is considered as a
temporary solution until hfsc_dequeue(), netem_dequeue() and
tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.
With feedback from Herbert Xu and David S. Miller.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a bug report by Andrew Gallatin on netdev
with subject "CPU utilization increased in 2.6.27rc"
In commit 37437bb2e1
("pkt_sched: Schedule qdiscs instead of netdev_queue.")
the test of the queue being stopped was erroneously
removed from qdisc_run().
When the TX queue of the device fills up, this omission
causes lots of extraneous useless work to be queued up
to softirq context, where we'll just return immediately
because the device is still stuffed up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add size table functions for qdiscs and calculate packet size in
qdisc_enqueue().
Based on patch by Patrick McHardy
http://marc.info/?l=linux-netdev&m=115201979221729&w=2
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have shared qdiscs, packets come out of the qdiscs
for multiple transmit queues.
Therefore it doesn't make any sense to schedule the transmit
queue when logically we cannot know ahead of time the TX
queue of the SKB that the qdisc->dequeue() will give us.
Just for sanity I added a BUG check to make sure we never
get into a state where the noop_qdisc is scheduled.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is associated with a netdev_queue, but when we have
qdisc sharing that no longer makes any sense.
Signed-off-by: David S. Miller <davem@davemloft.net>
This effectively "flips the switch" by making the core networking
and multiqueue-aware drivers use the new TX multiqueue structures.
Non-multiqueue drivers need no changes. The interfaces they use such
as netif_stop_queue() degenerate into an operation on TX queue zero.
So everything "just works" for them.
Code that really wants to do "X" to all TX queues now invokes a
routine that does so, such as netif_tx_wake_all_queues(),
netif_tx_stop_all_queues(), etc.
pktgen and netpoll required a little bit more surgery than the others.
In particular the pktgen changes, whilst functional, could be largely
improved. The initial check in pktgen_xmit() will sometimes check the
wrong queue, which is mostly harmless. The thing to do is probably to
invoke fill_packet() earlier.
The bulk of the netpoll changes is to make the code operate solely on
the TX queue indicated by by the SKB queue mapping.
Setting of the SKB queue mapping is entirely confined inside of
net/core/dev.c:dev_pick_tx(). If we end up needing any kind of
special semantics (drops, for example) it will be implemented here.
Finally, we now have a "real_num_tx_queues" which is where the driver
indicates how many TX queues are actually active.
With IGB changes from Jeff Kirsher.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert packet schedulers to use the netlink API. Unfortunately a gradual
conversion is not possible without breaking compilation in the middle or
adding lots of casts, so this patch converts them all in one step. The
patch has been mostly generated automatically with some minor edits to
at least allow seperate conversion of classifiers and actions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since hardware header operations are part of the protocol class
not the device instance, make them into a separate object and
save memory.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The behaviour of NET_CLS_POLICE for TC_POLICE_RECLASSIFY was to return
it to the qdisc, which could handle it internally or ignore it. With
NET_CLS_ACT however, tc_classify starts over at the first classifier
and never returns it to the qdisc. This makes it impossible to support
qdisc-internal reclassification, which in turn makes it impossible to
remove the old NET_CLS_POLICE code without breaking compatibility since
we have two qdiscs (CBQ and ATM) that support this.
This patch adds a tc_classify_compat function that handles
reclassification the old way and changes CBQ and ATM to use it.
This again is of course not fully backwards compatible with the previous
NET_CLS_ACT behaviour. Unfortunately there is no way to fully maintain
compatibility *and* support qdisc internal reclassification with
NET_CLS_ACT, but this seems like the better choice over keeping the two
incompatible options around forever.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we're now holding the rtnl during the entire dump operation, we
can remove qdisc_tree_lock, whose only purpose is to protect dump
callbacks from concurrent changes to the qdisc tree.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all packet schedulers have been converted to hrtimers most users
of PSCHED_JIFFIE2US and PSCHED_US2JIFFIE are gone. The remaining users use
it to convert external time units to packet scheduler clock ticks, so use
PSCHED_TICKS_PER_SEC instead.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of the manual clock source selection mess and use ktime. Also
use a scalar representation, which allows to clean up pkt_sched.h a bit
more and results in less ktime_to_ns() calls in most cases.
The PSCHED_US2JIFFIE/PSCHED_JIFFIE2US macros are implemented quite
inefficient by this patch, following patches will convert all qdiscs
to hrtimers and get rid of them entirely.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In PSCHED_TADD and PSCHED_TADD2, if delta is less than tv.tv_usec (so,
less than USEC_PER_SEC too) then tv_res will be smaller than tv. The
affectation "(tv_res).tv_usec = __delta;" is wrong. The fix is to
revert to the original code before
4ee303dfea and change the 'if' in
'while'.
[Shuya MAEDA: "while (__delta >= USEC_PER_SEC){ ... }" instead of
"while (__delta > USEC_PER_SEC){ ... }"]
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having two or more qdisc_run's contend against each other is bad because
it can induce packet reordering if the packets have to be requeued. It
appears that this is an unintended consequence of relinquinshing the queue
lock while transmitting. That in turn is needed for devices that spend a
lot of time in their transmit routine.
There are no advantages to be had as devices with queues are inherently
single-threaded (the loopback device is not but then it doesn't have a
queue).
Even if you were to add a queue to a parallel virtual device (e.g., bolt
a tbf filter in front of an ipip tunnel device), you would still want to
process the queue in sequence to ensure that the packets are ordered
correctly.
The solution here is to steal a bit from net_device to prevent this.
BTW, as qdisc_restart is no longer used by anyone as a module inside the
kernel (IIRC it used to with netif_wake_queue), I have not exported the
new __qdisc_run function.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds qdisc_alloc() to share code between qdisc_create()
and qdisc_create_dflt(). Hides the qdisc alignment behind
macros and makes use of them.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!