Commit Graph

1777 Commits

Author SHA1 Message Date
Jiri Pirko db50514f9a net: sched: add termination action to allow goto chain
Introduce new type of termination action called "goto_chain". This allows
user to specify a chain to be processed. This action type is
then processed as a return value in tcf_classify loop in similar
way as "reclassify" is, only it does not reset to the first filter
in chain but rather reset to the first filter of the desired chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 9fb9f251d2 net: sched: push tp down to action init
Tp pointer will be needed by the next patch in order to get the chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 5bc1701881 net: sched: introduce multichain support for filters
Instead of having only one filter per block, introduce a list of chains
for every block. Create chain 0 by default. UAPI is extended so the user
can specify which chain he wants to change. If the new attribute is not
specified, chain 0 is used. That allows to maintain backward
compatibility. If chain does not exist and user wants to manipulate with
it, new chain is created with specified index. Also, when last filter is
removed from the chain, the chain is destroyed.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko acb31fae3b net: sched: push chain dump to a separate function
Since there will be multiple chains to dump, push chain dumping code to
a separate function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 2190d1d094 net: sched: introduce helpers to work with filter chains
Introduce struct tcf_chain object and set of helpers around it. Wraps up
insertion, deletion and search in the filter chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 7961973a00 net: sched: move TC_H_MAJ macro call into tcf_auto_prio
Call the helper from the function rather than to always adjust the
return value of the function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 9d36d9e545 net: sched: replace nprio by a bool to make the function more readable
The use of "nprio" variable in tc_ctl_tfilter is a bit cryptic and makes
a reader wonder what is going on for a while. So help him to understand
this priority allocation dance a litte bit better.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko fbe9c5b01f net: sched: rename tcf_destroy_chain helper
Make the name consistent with the rest of the helpers around.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 6529eaba33 net: sched: introduce tcf block infractructure
Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 87d83093bf net: sched: move tc_classify function to cls_api.c
Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Eric Dumazet 218af599fa tcp: internal implementation for pacing
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:43:31 -04:00
Eric Dumazet cb395b2010 net: sched: optimize class dumps
In commit 59cc1f61f0 ("net: sched: convert qdisc linked list to
hashtable") we missed the opportunity to considerably speed up
tc_dump_tclass_root() if a qdisc handle is provided by user.

Instead of iterating all the qdiscs, use qdisc_match_from_root()
to directly get the one we look for.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 21:37:40 -04:00
Michal Hocko da6bc57a8f net: use kvmalloc with __GFP_REPEAT rather than open coded variant
fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
vmalloc fallback.  Use the kvmalloc variant instead.  Keep the
__GFP_REPEAT flag based on explanation from Eric:

 "At the time, tests on the hardware I had in my labs showed that
  vmalloc() could deliver pages spread all over the memory and that was
  a small penalty (once memory is fragmented enough, not at boot time)"

The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code.  This is rather disruptive
for something that can be achived with the fallback.  On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests.  So
the effect of this patch is that requests which fit into 32kB will fall
back to vmalloc easier now.

Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko 752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Colin Ian King 985538eee0 net/sched: remove redundant null check on head
head is previously null checked and so the 2nd null check on head
is redundant and therefore can be removed.

Detected by CoverityScan, CID#1399505 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04 11:01:30 -04:00
Jiri Pirko 9da3242e6a net: sched: add helpers to handle extended actions
Jump is now the only one using value action opcode. This is going to
change soon. So introduce helpers to work with this. Convert TC_ACT_JUMP.

This also fixes the TC_ACT_JUMP check, which is incorrectly done as a
bit check, not a value check.

Fixes: e0ee84ded7 ("net sched actions: Complete the JUMPX opcode")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:33:54 -04:00
Benjamin LaHaise 1a7fca63cd flower: check unused bits in MPLS fields
Since several of the the netlink attributes used to configure the flower
classifier's MPLS TC, BOS and Label fields have additional bits which are
unused, check those bits to ensure that they are actually 0 as suggested
by Jamal.

Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Cc: David Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jakub Kicinski <kubakici@wp.pl>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 11:12:21 -04:00
Jamal Hadi Salim e0ee84ded7 net sched actions: Complete the JUMPX opcode
per discussion at netconf/netdev:
When we have an action that is capable of branching (example a policer),
we can achieve a continuation of the action graph by programming a
"continue" where we find an exact replica of the same filter rule with a lower
priority and the remainder of the action graph. When you have 100s of thousands
of filters which require such a feature it gets very inefficient to do two
lookups.

This patch completes a leftover feature of action codes. Its time has come.

Example below where a user labels packets with a different skbmark on ingress
of a port depending on whether they have/not exceeded the configured rate.
This mark is then used to make further decisions on some egress port.

 #rate control, very low so we can easily see the effect
sudo $TC actions add action police rate 1kbit burst 90k \
conform-exceed pipe/jump 2 index 10
 # skbedit index 11 will be used if the user conforms
sudo $TC actions add action skbedit mark 11 ok index 11
 # skbedit index 12 will be used if the user does not conform
sudo $TC actions add action skbedit mark 12 ok index 12

 #lets bind the user ..
sudo $TC filter add dev $ETH parent ffff: protocol ip prio 8 u32 \
match ip dst 127.0.0.8/32 flowid 1:10 \
action police index 10 \
action skbedit index 11 \
action skbedit index 12

 #run a ping -f and see what happens..
 #
jhs@foobar:~$ sudo $TC -s filter ls dev $ETH parent ffff: protocol ip
filter pref 8 u32
filter pref 8 u32 fh 800: ht divisor 1
filter pref 8 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2800 success 1005)
  match 7f000008/ffffffff at 16 (success 1005 )
	action order 1:  police 0xa rate 1Kbit burst 23440b mtu 2Kb action pipe/jump 2 overhead 0b
	ref 2 bind 1 installed 207 sec used 122 sec
	Action statistics:
	Sent 84420 bytes 1005 pkt (dropped 0, overlimits 721 requeues 0)
	backlog 0b 0p requeues 0

	action order 2:  skbedit mark 11 pass
	 index 11 ref 2 bind 1 installed 204 sec used 122 sec
 	Action statistics:
	Sent 60564 bytes 721 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

	action order 3:  skbedit mark 12 pass
	 index 12 ref 2 bind 1 installed 201 sec used 122 sec
 	Action statistics:
	Sent 23856 bytes 284 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

Not bad, about 28% non-conforming packets..

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:30:06 -04:00
Benjamin LaHaise a577d8f793 cls_flower: add support for matching MPLS fields (v2)
Add support to the tc flower classifier to match based on fields in MPLS
labels (TTL, Bottom of Stack, TC field, Label).

Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Gao Feng <fgao@ikuai8.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:30:46 -04:00
David S. Miller fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
WANG Cong 4392053879 net_sched: remove useless NULL to tp->root
There is no need to NULL tp->root in ->destroy(), since tp is
going to be freed very soon, and existing readers are still
safe to read them.

For cls_route, we always init its tp->root, so it can't be NULL,
we can drop more useless code.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:58:15 -04:00
WANG Cong 763dbf6328 net_sched: move the empty tp check from ->destroy() to ->delete()
We could have a race condition where in ->classify() path we
dereference tp->root and meanwhile a parallel ->destroy() makes it
a NULL. Daniel cured this bug in commit d936377414
("net, sched: respect rcu grace period on cls destruction").

This happens when ->destroy() is called for deleting a filter to
check if we are the last one in tp, this tp is still linked and
visible at that time. The root cause of this problem is the semantic
of ->destroy(), it does two things (for non-force case):

1) check if tp is empty
2) if tp is empty we could really destroy it

and its caller, if cares, needs to check its return value to see if it
is really destroyed. Therefore we can't unlink tp unless we know it is
empty.

As suggested by Daniel, we could actually move the test logic to ->delete()
so that we can safely unlink tp after ->delete() tells us the last one is
just deleted and before ->destroy().

Fixes: 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
Cc: Roi Dayan <roid@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:58:15 -04:00
Wolfgang Bumiller e0535ce58b net sched actions: allocate act cookie early
Policing filters do not use the TCA_ACT_* enum and the tb[]
nlattr array in tcf_action_init_1() doesn't get filled for
them so we should not try to look for a TCA_ACT_COOKIE
attribute in the then uninitialized array.
The error handling in cookie allocation then calls
tcf_hash_release() leading to invalid memory access later
on.
Additionally, if cookie allocation fails after an already
existing non-policing filter has successfully been changed,
tcf_action_release() should not be called, also we would
have to roll back the changes in the error handling, so
instead we now allocate the cookie early and assign it on
success at the end.

CVE-2017-7979
Fixes: 1045ba77a5 ("net sched actions: Add support for user cookies")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 16:32:07 -04:00
David Ahern c21ef3e343 net: rtnetlink: plumb extended ack to doit function
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.

This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:35:38 -04:00
stephen hemminger 8ea3e43911 Subject: net: allow configuring default qdisc
Since 3.12 it has been possible to configure the default queuing
discipline via sysctl. This patch adds ability to configure the
default queue discipline in kernel configuration. This is useful for
environments where configuring the value from userspace is difficult
to manage.

The default is still the same as before (pfifo_fast) and it is
possible to change after kernel init with sysctl. This is similar
to how TCP congestion control works.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:23:06 -04:00
David S. Miller 6b6cbc1471 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simply overlapping changes.  In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.

In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-15 21:16:30 -04:00
Johannes Berg fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
WANG Cong 92f9170621 net_sched: check noop_qdisc before qdisc_hash_add()
Dmitry reported a crash when injecting faults in
attach_one_default_qdisc() and dev->qdisc is still
a noop_disc, the check before qdisc_hash_add() fails
to catch it because it tests NULL. We should test
against noop_qdisc since it is the default qdisc
at this point.

Fixes: 59cc1f61f0 ("net: sched: convert qdisc linked list to hashtable")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 12:28:39 -07:00
Dan Carpenter 2c099ccbd0 net: sched: choke: remove some dead code
We accidentally left this dead code behind after commit 5952fde10c
("net: sched: choke: remove dead filter classify code").

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 06:53:42 -07:00
Geliang Tang 3b1af93cf1 net_sched: use setup_deferrable_timer
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 14:42:52 -07:00
Jiri Pirko 5952fde10c net: sched: choke: remove dead filter classify code
sch_choke is classless qdisc so it does not define cl_ops. Therefore
filter_list cannot be ever changed, being NULL all the time.
Reason is this check in tc_ctl_tfilter:

	/* Is it classful? */
	cops = q->ops->cl_ops;
	if (!cops)
		return -EINVAL;

So remove this dead code.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 12:47:10 -07:00
Davide Caratti add641e7de sched: act_csum: don't mangle TCP and UDP GSO packets
after act_csum computes the checksum on skbs carrying GSO TCP/UDP packets,
subsequent segmentation fails because skb_needs_check(skb, true) returns
true. Because of that, skb_warn_bad_offload() is invoked and the following
message is displayed:

WARNING: CPU: 3 PID: 28 at net/core/dev.c:2553 skb_warn_bad_offload+0xf0/0xfd
<...>

  [<ffffffff8171f486>] skb_warn_bad_offload+0xf0/0xfd
  [<ffffffff8161304c>] __skb_gso_segment+0xec/0x110
  [<ffffffff8161340d>] validate_xmit_skb+0x12d/0x2b0
  [<ffffffff816135d2>] validate_xmit_skb_list+0x42/0x70
  [<ffffffff8163c560>] sch_direct_xmit+0xd0/0x1b0
  [<ffffffff8163c760>] __qdisc_run+0x120/0x270
  [<ffffffff81613b3d>] __dev_queue_xmit+0x23d/0x690
  [<ffffffff81613fa0>] dev_queue_xmit+0x10/0x20

Since GSO is able to compute checksum on individual segments of such skbs,
we can simply skip mangling the packet.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 17:40:33 -07:00
David S. Miller 16ae1f2236 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmmii.c
	drivers/net/hyperv/netvsc.c
	kernel/bpf/hashtab.c

Almost entirely overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 16:41:27 -07:00
Eric Dumazet aea92fb2e0 sch_dsmark: fix invalid skb_cow() usage
skb_cow(skb, sizeof(ip header)) is not very helpful in this context.

First we need to use pskb_may_pull() to make sure the ip header
is in skb linear part, then use skb_try_make_writable() to
address clones issues.

Fixes: 4c30719f4f ("[PKT_SCHED] dsmark: handle cloned and non-linear skb's")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 17:21:27 -07:00
Nik Unger 5080f39e8c netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:

1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
  machines

The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html

There appear to be two underlying bugs causing these effects:

- The first issue causes long delays when the rate is slow and no
  delay is configured (e.g., "rate 512kbit"). This is because SKBs are
  not orphaned when no delay is configured, so orphaning does not
  occur until *after* the rate-induced delay has been applied. For
  this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
  dramatically increases the measured bandwidth.

- The second issue is that rate-induced delays are not correctly
  applied, allowing SKB delays to occur in parallel. The indended
  approach is to compute the delay for an SKB and to add this delay to
  the end of the current queue. However, the code does not detect
  existing SKBs in the queue due to improperly testing sch->q.qlen,
  which is nonzero even when packets exist only in the
  rbtree. Consequently, new SKBs do not wait for the current queue to
  empty. When packet delays vary significantly (e.g., if packet sizes
  are different), then this also causes unintended reordering.

I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.

Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:14:06 -07:00
Or Gerlitz a5e6a3b022 net/sched: fq_codel: Avoid set-but-unused variable
The code introduced by commit 2ccccf5fb4 ("net_sched: update
hierarchical backlog too") only sets prev_backlog in fq_codel_dequeue()
but not using that anywhere, remove that setting.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 12:02:14 -07:00
Or Gerlitz 4dba87b073 net/sched: act_ife: Staticfy find_decode_metaid()
As it's used only on that file.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 12:02:14 -07:00
Amritha Nambiar 56f36acd21 mqprio: Modify mqprio to pass user parameters via ndo_setup_tc.
The configurable priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class, overriding the
hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
option is non-zero, the hardware QOS defaults are used.

This patch makes it so that we can pass the data the user provided to
ndo_setup_tc. This allows us to pull in the queue configuration if the
user requested it as well as any additional hardware offload type
requested by using a value other than 1 for the hw value.

Finally it also provides a means for the device driver to return the level
supported for the offload type via the qopt->hw value. Previously we were
just always assuming the value to be 1, in the future values beyond just 1
may be supported.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:20:27 -07:00
Alexander Duyck 2026fecf51 mqprio: Change handling of hw u8 to allow for multiple hardware offload modes
This patch is meant to allow for support of multiple hardware offload type
for a single device. There is currently no bounds checking for the hw
member of the mqprio_qopt structure.  This results in us being able to pass
values from 1 to 255 with all being treated the same.  On retreiving the
value it is returned as 1 for anything 1 or greater being set.

With this change we are currently adding limited bounds checking by
defining an enum and using those values to limit the reported hardware
offloads.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:20:27 -07:00
David S. Miller 101c431492 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmgenet.c
	net/core/sock.c

Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 11:59:10 -07:00
David S. Miller e33cc31630 sch_tbf: Remove bogus semicolon in if() conditional.
Fixes: 49b499718f ("net: sched: make default fifo qdiscs appear in the dump")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 00:00:03 -07:00
Etienne Noss 52491c7607 act_connmark: avoid crashing on malformed nlattrs with null parms
tcf_connmark_init does not check in its configuration if TCA_CONNMARK_PARMS
is set, resulting in a null pointer dereference when trying to access it.

[501099.043007] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[501099.043039] IP: [<ffffffffc10c60fb>] tcf_connmark_init+0x8b/0x180 [act_connmark]
...
[501099.044334] Call Trace:
[501099.044345]  [<ffffffffa47270e8>] ? tcf_action_init_1+0x198/0x1b0
[501099.044363]  [<ffffffffa47271b0>] ? tcf_action_init+0xb0/0x120
[501099.044380]  [<ffffffffa47250a4>] ? tcf_exts_validate+0xc4/0x110
[501099.044398]  [<ffffffffc0f5fa97>] ? u32_set_parms+0xa7/0x270 [cls_u32]
[501099.044417]  [<ffffffffc0f60bf0>] ? u32_change+0x680/0x87b [cls_u32]
[501099.044436]  [<ffffffffa4725d1d>] ? tc_ctl_tfilter+0x4dd/0x8a0
[501099.044454]  [<ffffffffa44a23a1>] ? security_capable+0x41/0x60
[501099.044471]  [<ffffffffa470ca01>] ? rtnetlink_rcv_msg+0xe1/0x220
[501099.044490]  [<ffffffffa470c920>] ? rtnl_newlink+0x870/0x870
[501099.044507]  [<ffffffffa472cc61>] ? netlink_rcv_skb+0xa1/0xc0
[501099.044524]  [<ffffffffa47073f4>] ? rtnetlink_rcv+0x24/0x30
[501099.044541]  [<ffffffffa472c634>] ? netlink_unicast+0x184/0x230
[501099.044558]  [<ffffffffa472c9d8>] ? netlink_sendmsg+0x2f8/0x3b0
[501099.044576]  [<ffffffffa46d8880>] ? sock_sendmsg+0x30/0x40
[501099.044592]  [<ffffffffa46d8e03>] ? SYSC_sendto+0xd3/0x150
[501099.044608]  [<ffffffffa425fda1>] ? __do_page_fault+0x2d1/0x510
[501099.044626]  [<ffffffffa47fbd7b>] ? system_call_fast_compare_end+0xc/0x9b

Fixes: 22a5dc0e5e ("net: sched: Introduce connmark action")
Signed-off-by: Étienne Noss <etienne.noss@wifirst.fr>
Signed-off-by: Victorien Molle <victorien.molle@wifirst.fr>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:32:41 -07:00
Jiri Kosina 49b499718f net: sched: make default fifo qdiscs appear in the dump
The original reason [1] for having hidden qdiscs (potential scalability
issues in qdisc_match_from_root() with single linked list in case of large
amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
qdisc linked list to hashtable").

This allows us for bringing more clarity and determinism into the dump by
making default pfifo qdiscs visible.

We're not turning this on by default though, at it was deemed [2] too
intrusive / unnecessary change of default behavior towards userspace.
Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
applications to request complete qdisc hierarchy dump, including the
ones that have always been implicit/invisible.

Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
about singletons would require quite some surgery with very little gain
(seeing no qdisc or seeing noop qdisc in the dump is probably setting
the same user expectation).

[1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
[2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 22:53:02 -07:00
Alexey Khoroshilov 6c4dc75c25 net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump
Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:13:03 -08:00
Ingo Molnar 4f17722c72 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/loadavg.h>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Roman Mashak 37f1c63e3e net sched actions: do not overwrite status of action creation.
nla_memdup_cookie was overwriting err value, declared at function
scope and earlier initialized with result of ->init(). At success
nla_memdup_cookie() returns 0, and thus module refcnt decremented,
although the action was installed.

$ sudo tc actions add action pass index 1 cookie 1234
$ sudo tc actions ls action gact

        action order 0: gact action pass
         random type none pass val 0
         index 1 ref 1 bind 0
$
$ lsmod
Module                  Size  Used by
act_gact               16384  0
...
$
$ sudo rmmod act_gact
[   52.310283] ------------[ cut here ]------------
[   52.312551] WARNING: CPU: 1 PID: 455 at kernel/module.c:1113
module_put+0x99/0xa0
[   52.316278] Modules linked in: act_gact(-) crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel psmouse pcbc evbug aesni_intel aes_x86_64 crypto_simd
serio_raw glue_helper pcspkr cryptd
[   52.322285] CPU: 1 PID: 455 Comm: rmmod Not tainted 4.10.0+ #11
[   52.324261] Call Trace:
[   52.325132]  dump_stack+0x63/0x87
[   52.326236]  __warn+0xd1/0xf0
[   52.326260]  warn_slowpath_null+0x1d/0x20
[   52.326260]  module_put+0x99/0xa0
[   52.326260]  tcf_hashinfo_destroy+0x7f/0x90
[   52.326260]  gact_exit_net+0x27/0x40 [act_gact]
[   52.326260]  ops_exit_list.isra.6+0x38/0x60
[   52.326260]  unregister_pernet_operations+0x90/0xe0
[   52.326260]  unregister_pernet_subsys+0x21/0x30
[   52.326260]  tcf_unregister_action+0x68/0xa0
[   52.326260]  gact_cleanup_module+0x17/0xa0f [act_gact]
[   52.326260]  SyS_delete_module+0x1ba/0x220
[   52.326260]  entry_SYSCALL_64_fastpath+0x1e/0xad
[   52.326260] RIP: 0033:0x7f527ffae367
[   52.326260] RSP: 002b:00007ffeb402a598 EFLAGS: 00000202 ORIG_RAX:
00000000000000b0
[   52.326260] RAX: ffffffffffffffda RBX: 0000559b069912a0 RCX: 00007f527ffae367
[   52.326260] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559b06991308
[   52.326260] RBP: 0000000000000003 R08: 00007f5280264420 R09: 00007ffeb4029511
[   52.326260] R10: 000000000000087b R11: 0000000000000202 R12: 00007ffeb4029580
[   52.326260] R13: 0000000000000000 R14: 0000000000000000 R15: 0000559b069912a0
[   52.354856] ---[ end trace 90d89401542b0db6 ]---
$

With the fix:

$ sudo modprobe act_gact
$ lsmod
Module                  Size  Used by
act_gact               16384  0
...
$ sudo tc actions add action pass index 1 cookie 1234
$ sudo tc actions ls action gact

        action order 0: gact action pass
         random type none pass val 0
         index 1 ref 1 bind 0
$
$ lsmod
Module                  Size  Used by
act_gact               16384  1
...
$ sudo rmmod act_gact
rmmod: ERROR: Module act_gact is in use
$
$ sudo /home/mrv/bin/tc actions del action gact index 1
$ sudo rmmod act_gact
$ lsmod
Module                  Size  Used by
$

Fixes: 1045ba77a ("net sched actions: Add support for user cookies")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26 21:31:32 -05:00
Roman Mashak edb9d1bff4 net sched actions: decrement module reference count after table flush.
When tc actions are loaded as a module and no actions have been installed,
flushing them would result in actions removed from the memory, but modules
reference count not being decremented, so that the modules would not be
unloaded.

Following is example with GACT action:

% sudo modprobe act_gact
% lsmod
Module                  Size  Used by
act_gact               16384  0
%
% sudo tc actions ls action gact
%
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  1
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  2
% sudo rmmod act_gact
rmmod: ERROR: Module act_gact is in use
....

After the fix:
% lsmod
Module                  Size  Used by
act_gact               16384  0
%
% sudo tc actions add action pass index 1
% sudo tc actions add action pass index 2
% sudo tc actions add action pass index 3
% lsmod
Module                  Size  Used by
act_gact               16384  3
%
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  0
%
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  0
% sudo rmmod act_gact
% lsmod
Module                  Size  Used by
%

Fixes: f97017cdef ("net-sched: Fix actions flushing")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26 21:28:41 -05:00
Gao Feng 806a837650 pkt_sched: Remove useless qdisc_stab_lock
The qdisc_stab_lock is used in qdisc_get_stab and qdisc_put_stab.
These two functions are invoked in qdisc_create, qdisc_change, and
qdisc_destroy which run fully under RTNL.

So it already makes sure only one could access the qdisc_stab_list at
the same time. Then it is unnecessary to use qdisc_stab_lock now.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 15:10:18 -05:00
Or Gerlitz 5cecb6cc00 net/sched: cls_bpf: Reflect HW offload status
BPF classifier support for the "in hw" offloading flags.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Amir Vadai <amir@vadai.me>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 12:08:06 -05:00
Or Gerlitz 24d3dc6d27 net/sched: cls_u32: Reflect HW offload status
U32 support for the "in hw" offloading flags.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 12:08:06 -05:00