Commit Graph

2678 Commits

Author SHA1 Message Date
Vlad Buslov 2142236b45 net: sched: act_bpf: remove dependency on rtnl lock
Use tcf spinlock to protect bpf action private data from concurrent
modification during dump and init. Remove rtnl lock assertion that is no
longer necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Jiri Pirko 63cc5bcc9f net: sched: fix block->refcnt decrement
Currently the refcnt is never decremented in case the value is not 1.
Fix it by adding decrement in case the refcnt is not 1.

Reported-by: Vlad Buslov <vladbu@mellanox.com>
Fixes: f71e0ca4db ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-09 14:12:04 -07:00
Vlad Buslov 9ca6163005 net: sched: cls_flower: set correct offload data in fl_reoffload
fl_reoffload implementation sets following members of struct
tc_cls_flower_offload incorrectly:
 - masked key instead of mask
 - key instead of masked key

Fix fl_reoffload to provide correct data to offload callback.

Fixes: 31533cba43 ("net: sched: cls_flower: implement offload tcf_proto_op")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:35:17 -07:00
Pieter Jansen van Vuuren 0a6e77784f net/sched: allow flower to match tunnel options
Allow matching on options in Geneve tunnel headers.
This makes use of existing tunnel metadata support.

The options can be described in the form
CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is
represented as a 16bit hexadecimal value, TYPE as an 8bit
hexadecimal value and DATA as a variable length hexadecimal value.

e.g.
 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev geneve0 ingress
 # tc filter add dev geneve0 protocol ip parent ffff: \
     flower \
       enc_src_ip 10.0.99.192 \
       enc_dst_ip 10.0.99.193 \
       enc_key_id 11 \
       geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \
       ip_proto udp \
       action mirred egress redirect dev eth1

This patch adds support for matching Geneve options in the order
supplied by the user. This leads to an efficient implementation in
the software datapath (and in our opinion hardware datapaths that
offload this feature). It is also compatible with Geneve options
matching provided by the Open vSwitch kernel datapath which is
relevant here as the Flower classifier may be used as a mechanism
to program flows into hardware as a form of Open vSwitch datapath
offload (sometimes referred to as OVS-TC). The netlink
Kernel/Userspace API may be extended, for example by adding a flag,
if other matching options are desired, for example matching given
options in any order. This would require an implementation in the
TC software datapath. And be done in a way that drivers that
facilitate offload of the Flower classifier can reject or accept
such flows based on hardware datapath capabilities.

This approach was discussed and agreed on at Netconf 2017 in Seoul.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:22:15 -07:00
Dan Carpenter 1cbc36a53b net: sched: cls_flower: Fix an error code in fl_tmplt_create()
We forgot to set the error code on this path, so we return NULL instead
of an error pointer.  In the current code kzalloc() won't fail for small
allocations so this doesn't really affect runtime.

Fixes: b95ec7eb3b ("net: sched: cls_flower: implement chain templates")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:25:46 -07:00
Jiri Pirko 5ca8a25c14 net: sched: fix flush on non-existing chain
User was able to perform filter flush on chain 0 even if it didn't have
any filters in it. With the patch that avoided implicit chain 0
creation, this changed. So in case user wants filter flush on chain
which does not exist, just return success. There's no reason for non-0
chains to behave differently than chain 0, so do the same for them.

Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: f71e0ca4db ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 09:44:37 -07:00
Jiri Pirko 290b1c8b1a net: sched: make tcf_chain_{get,put}() static
These are no longer used outside of cls_api.c so make them static.
Move tcf_chain_flush() to avoid fwd declaration of tcf_chain_put().

Signed-off-by: Jiri Pirko <jiri@mellanox.com>

v1->v2:
- new patch

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:06:19 -07:00
Jiri Pirko 5368140730 net: sched: fix notifications for action-held chains
Chains that only have action references serve as placeholders.
Until a non-action reference is created, user should not be aware
of the chain. Also he should not receive any notifications about it.
So send notifications for the new chain only in case the chain gets
the first non-action reference. Symmetrically to that, when
the last non-action reference is dropped, send the notification about
deleted chain.

Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

v1->v2:
- made __tcf_chain_{get,put}() static as suggested by Cong

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:06:19 -07:00
Jiri Pirko 3d32f4c548 net: sched: change name of zombie chain to "held_by_acts_only"
As mentioned by Cong and Jakub during the review process, it is a bit
odd to sometimes (act flow) create a new chain which would be
immediately a "zombie". So just rename it to "held_by_acts_only".

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:06:19 -07:00
Paolo Abeni e5cf1baf92 act_mirred: use TC_ACT_REINSERT when possible
When mirred is invoked from the ingress path, and it wants to redirect
the processed packet, it can now use the TC_ACT_REINSERT action,
filling the tcf_result accordingly, and avoiding a per packet
skb_clone().

Overall this gives a ~10% improvement in forwarding performance for the
TC S/W data path and TC S/W performances are now comparable to the
kernel openvswitch datapath.

v1 -> v2: use ACT_MIRRED instead of ACT_REDIRECT
v2 -> v3: updated after action rename, fixed typo into the commit
	message
v3 -> v4: updated again after action rename, added more comments to
	the code (JiriP), skip the optimization if the control action
	need to touch the tcf_result (Paolo)
v4 -> v5: fix sparse warning (kbuild bot)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30 09:31:14 -07:00
Paolo Abeni 7fd4b288ea tc/act: remove unneeded RCU lock in action callback
Each lockless action currently does its own RCU locking in ->act().
This allows using plain RCU accessor, even if the context
is really RCU BH.

This change drops the per action RCU lock, replace the accessors
with the _bh variant, cleans up a bit the surrounding code and
documents the RCU status in the relevant header.
No functional nor performance change is intended.

The goal of this patch is clarifying that the RCU critical section
used by the tc actions extends up to the classifier's caller.

v1 -> v2:
 - preserve rcu lock in act_bpf: it's needed by eBPF helpers,
   as pointed out by Daniel

v3 -> v4:
 - fixed some typos in the commit message (JiriP)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30 09:31:13 -07:00
Paolo Abeni 802bfb1915 net/sched: user-space can't set unknown tcfa_action values
Currently, when initializing an action, the user-space can specify
and use arbitrary values for the tcfa_action field. If the value
is unknown by the kernel, is implicitly threaded as TC_ACT_UNSPEC.

This change explicitly checks for unknown values at action creation
time, and explicitly convert them to TC_ACT_UNSPEC. No functional
changes are introduced, but this will allow introducing tcfa_action
values not exposed to user-space in a later patch.

Note: we can't use the above to hide TC_ACT_REDIRECT from user-space,
as the latter is already part of uAPI.

v3 -> v4:
 - use an helper to check for action validity (JiriP)
 - emit an extack for invalid actions (JiriP)
v4 -> v5:
 - keep messages on a single line, drop net_warn (Marcelo)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30 09:31:13 -07:00
YueHaibing 3f6bcc5162 act_bpf: Use kmemdup instead of duplicating it in tcf_bpf_init_from_ops
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-29 13:20:16 -07:00
YueHaibing f9562fa4a5 cls_bpf: Use kmemdup instead of duplicating it in cls_bpf_prog_from_ops
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-29 13:19:49 -07:00
YueHaibing 0a80848ec5 act_pedit: remove unnecessary semicolon
net/sched/act_pedit.c:289:2-3: Unneeded semicolon

Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-29 13:19:20 -07:00
Dave Taht 2db6dc2662 sch_cake: Make gso-splitting configurable from userspace
This patch restores cake's deployed behavior at line rate to always
split gso, and makes gso splitting configurable from userspace.

running cake unlimited (unshaped) at 1gigE, local traffic:

no-split-gso bql limit: 131966
split-gso bql limit:   ~42392-45420

On this 4 stream test splitting gso apart results in halving the
observed interpacket latency at no loss in throughput.

Summary of tcp_nup test run 'gso-split' (at 2018-07-26 16:03:51.824728):

 Ping (ms) ICMP :         0.83         0.81 ms              341
 TCP upload avg :       235.43       235.39 Mbits/s         301
 TCP upload sum :       941.71       941.56 Mbits/s         301
 TCP upload::1  :       235.45       235.43 Mbits/s         271
 TCP upload::2  :       235.45       235.41 Mbits/s         289
 TCP upload::3  :       235.40       235.40 Mbits/s         288
 TCP upload::4  :       235.41       235.40 Mbits/s         291

verses

Summary of tcp_nup test run 'no-split-gso' (at 2018-07-26 16:37:23.563960):

                           avg       median          # data pts
 Ping (ms) ICMP :         1.67         1.73 ms              348
 TCP upload avg :       234.56       235.37 Mbits/s         301
 TCP upload sum :       938.24       941.49 Mbits/s         301
 TCP upload::1  :       234.55       235.38 Mbits/s         285
 TCP upload::2  :       234.57       235.37 Mbits/s         286
 TCP upload::3  :       234.58       235.37 Mbits/s         274
 TCP upload::4  :       234.54       235.42 Mbits/s         288

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-27 13:38:20 -07:00
Jiri Pirko 1f3ed383fb net: sched: don't dump chains only held by actions
In case a chain is empty and not explicitly created by a user,
such chain should not exist. The only exception is if there is
an action "goto chain" pointing to it. In that case, don't show the
chain in the dump. Track the chain references held by actions and
use them to find out if a chain should or should not be shown
in chain dump.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-27 09:38:46 -07:00
Jiri Pirko c921d7db3d net: sched: unmark chain as explicitly created on delete
Once user manually deletes the chain using "chain del", the chain cannot
be marked as explicitly created anymore.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-26 14:12:58 -07:00
Gustavo A. R. Silva 2ed9db3074 net: sched: cls_api: fix dead code in switch
Code at line 1850 is unreachable. Fix this by removing the break
statement above it, so the code for case RTM_GETCHAIN can be
properly executed.

Addresses-Coverity-ID: 1472050 ("Structurally dead code")
Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-26 14:09:27 -07:00
Vinicius Costa Gomes 990e35ecba cbs: Add support for the graft function
This will allow to install a child qdisc under cbs. The main use case
is to install ETF (Earliest TxTime First) qdisc under cbs, so there's
another level of control for time-sensitive traffic.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-26 13:58:30 -07:00
Jianbo Liu 158abbf170 net/sched: cls_flower: Use correct inline function for assignment of vlan tpid
This fixes the following sparse warning:

net/sched/cls_flower.c:1356:36: warning: incorrect type in argument 3 (different base types)
net/sched/cls_flower.c:1356:36: expected unsigned short [unsigned] [usertype] value
net/sched/cls_flower.c:1356:36: got restricted __be16 [usertype] vlan_tpid

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-25 16:33:02 -07:00
Nishanth Devarajan aea5f654e6 net/sched: add skbprio scheduler
Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
according to their skb->priority field. Under congestion, already-enqueued lower
priority packets will be dropped to make space available for higher priority
packets. Skbprio was conceived as a solution for denial-of-service defenses that
need to route packets with different priorities as a means to overcome DoS
attacks.

v5
*Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set
default sch->limit to 64.

v4
*Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man
page for Skbprio, in iproute2.

v3
*Drop max_limit parameter in struct skbprio_sched_data and instead use
sch->limit.

*Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
qdisc (previously being referenced every time qdisc changes).

*Move qdisc's detailed description from in-code to Documentation/networking.

*When qdisc is saturated, enqueue incoming packet first before dequeueing
lowest priority packet in queue - improves usage of call stack registers.

*Introduce and use overlimit stat to keep track of number of dropped packets.

v2
*Use skb->priority field rather than DS field. Rename queueing discipline as
SKB Priority Queue (previously Gatekeeper Priority Queue).

*Queueing discipline is made classful to expose Skbprio's internal priority
queues.

Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com>
Reviewed-by: Sachin Paryani <sachin.paryani@gmail.com>
Reviewed-by: Cody Doucette <doucette@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24 14:44:00 -07:00
Stephen Hemminger 50f699b1f8 sched: fix trailing whitespace
Remove trailing whitespace and blank lines at EOF

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24 14:10:42 -07:00
Jiri Pirko 3473845273 net: sched: cls_flower: propagate chain teplate creation and destruction to drivers
Introduce a couple of flower offload commands in order to propagate
template creation/destruction events down to device drivers.
Drivers may use this information to prepare HW in an optimal way
for future filter insertions.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Jiri Pirko b95ec7eb3b net: sched: cls_flower: implement chain templates
Use the previously introduced template extension and implement
callback to create, destroy and dump chain template. The existing
parsing and dumping functions are re-used. Also, check if newly added
filters fit the template if it is set.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Jiri Pirko 33fb5cba11 net: sched: cls_flower: change fl_init_dissector to accept mask and dissector
This function is going to be used for templates as well, so we need to
pass the pointer separately.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Jiri Pirko f5749081f0 net: sched: cls_flower: move key/mask dumping into a separate function
Push key/mask dumping from fl_dump() into a separate function
fl_dump_key(), that will be reused for template dumping.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Jiri Pirko 9f407f1768 net: sched: introduce chain templates
Allow user to set a template for newly created chains. Template lock
down the chain for particular classifier type/options combinations.
The classifier needs to support templates, otherwise kernel would
reply with error.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Jiri Pirko 32a4f5ecd7 net: sched: introduce chain object to uapi
Allow user to create, destroy, get and dump chain objects. Do that by
extending rtnl commands by the chain-specific ones. User will now be
able to explicitly create or destroy chains (so far this was done only
automatically according the filter/act needs and refcounting). Also, the
user will receive notification about any chain creation or destuction.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Jiri Pirko f71e0ca4db net: sched: Avoid implicit chain 0 creation
Currently, chain 0 is implicitly created during block creation. However
that does not align with chain object exposure, creation and destruction
api introduced later on. So make the chain 0 behave the same way as any
other chain and only create it when it is needed. Since chain 0 is
somehow special as the qdiscs need to hold pointer to the first chain
tp, this requires to move the chain head change callback infra to the
block structure.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Jiri Pirko f34e8bff58 net: sched: push ops lookup bits into tcf_proto_lookup_ops()
Push all bits that take care of ops lookup, including module loading
outside tcf_proto_create() function, into tcf_proto_lookup_ops()

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 20:44:12 -07:00
Gustavo A. R. Silva baa2d2b17e net: sched: use PTR_ERR_OR_ZERO macro in tcf_block_cb_register
This line makes up what macro PTR_ERR_OR_ZERO already does. So,
make use of PTR_ERR_OR_ZERO rather than an open-code version.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-21 16:17:08 -07:00
David S. Miller c4c5551df1 Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux
All conflicts were trivial overlapping changes, so reasonably
easy to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 21:17:12 -07:00
Or Gerlitz 0e2c17b64d net/sched: cls_flower: Support matching on ip tos and ttl for tunnels
Allow users to set rules matching on ipv4 tos and ttl or
ipv6 traffic-class and hoplimit of tunnel headers.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-19 23:26:01 -07:00
Or Gerlitz 07a557f47d net/sched: tunnel_key: Allow to set tos and ttl for tc based ip tunnels
Allow user-space to provide tos and ttl to be set for the tunnel headers.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-19 23:26:01 -07:00
YueHaibing 5318918390 net: sched: Using NULL instead of plain integer
Fixes the following sparse warnings:

net/sched/cls_api.c:1101:43: warning: Using plain integer as NULL pointer
net/sched/cls_api.c:1492:75: warning: Using plain integer as NULL pointer

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18 13:44:07 -07:00
Toke Høiland-Jørgensen 301f935be9 sch_cake: Fix tin order when set through skb->priority
In diffserv mode, CAKE stores tins in a different order internally than
the logical order exposed to userspace. The order remapping was missing
in the handling of 'tc filter' priority mappings through skb->priority,
resulting in bulk and best effort mappings being reversed relative to
how they are displayed.

Fix this by adding the missing mapping when reading skb->priority.

Fixes: 83f8fd69af ("sch_cake: Add DiffServ handling")
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 14:47:45 -07:00
Vlad Buslov 01683a1469 net: sched: refactor flower walk to iterate over idr
Extend struct tcf_walker with additional 'cookie' field. It is intended to
be used by classifier walk implementations to continue iteration directly
from particular filter, instead of iterating 'skip' number of times.

Change flower walk implementation to save filter handle in 'cookie'. Each
time flower walk is called, it looks up filter with saved handle directly
with idr, instead of iterating over filter linked list 'skip' number of
times. This change improves complexity of dumping flower classifier from
quadratic to linearithmic. (assuming idr lookup has logarithmic complexity)

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reported-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13 18:24:27 -07:00
Davide Caratti c749cdda90 net/sched: act_skbedit: don't use spinlock in the data path
use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on
act_skbedit configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 14:54:12 -07:00
Davide Caratti 6f3dfb0dc8 net/sched: skbedit: use per-cpu counters
use per-CPU counters, instead of sharing a single set of stats with all
cores: this removes the need of spinlocks when stats are read/updated.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 14:54:12 -07:00
Jacob Keller 83fe6b8709 sch_fq_codel: zero q->flows_cnt when fq_codel_init fails
When fq_codel_init fails, qdisc_create_dflt will cleanup by using
qdisc_destroy. This function calls the ->reset() op prior to calling the
->destroy() op.

Unfortunately, during the failure flow for sch_fq_codel, the ->flows
parameter is not initialized, so the fq_codel_reset function will null
pointer dereference.

   kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
   kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
   kernel: PGD 0 P4D 0
   kernel: Oops: 0000 [#1] SMP PTI
   kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci
   kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e]
   kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G           OE    4.16.13custom-fq-codel-test+ #3
   kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015
   kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
   kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
   kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
   kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
   kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
   kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
   kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
   kernel: FS:  00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000
   kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
   kernel: Call Trace:
   kernel:  qdisc_destroy+0x56/0x140
   kernel:  qdisc_create_dflt+0x8b/0xb0
   kernel:  mq_init+0xc1/0xf0
   kernel:  qdisc_create_dflt+0x5a/0xb0
   kernel:  dev_activate+0x205/0x230
   kernel:  __dev_open+0xf5/0x160
   kernel:  __dev_change_flags+0x1a3/0x210
   kernel:  dev_change_flags+0x21/0x60
   kernel:  do_setlink+0x660/0xdf0
   kernel:  ? down_trylock+0x25/0x30
   kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
   kernel:  ? rtnl_newlink+0x816/0x990
   kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
   kernel:  ? _cond_resched+0x15/0x30
   kernel:  ? kmem_cache_alloc+0x20/0x1b0
   kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
   kernel:  ? rtnl_calcit.isra.30+0x100/0x100
   kernel:  ? netlink_rcv_skb+0x4c/0x120
   kernel:  ? netlink_unicast+0x19e/0x260
   kernel:  ? netlink_sendmsg+0x1ff/0x3c0
   kernel:  ? sock_sendmsg+0x36/0x40
   kernel:  ? ___sys_sendmsg+0x295/0x2f0
   kernel:  ? ebitmap_cmp+0x6d/0x90
   kernel:  ? dev_get_by_name_rcu+0x73/0x90
   kernel:  ? skb_dequeue+0x52/0x60
   kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
   kernel:  ? bit_waitqueue+0x30/0x30
   kernel:  ? fsnotify_grab_connector+0x3c/0x60
   kernel:  ? __sys_sendmsg+0x51/0x90
   kernel:  ? do_syscall_64+0x74/0x180
   kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
   kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
   kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
   kernel: CR2: 0000000000000008
   kernel: ---[ end trace e81a62bede66274e ]---

This is caused because flows_cnt is non-zero, but flows hasn't been
initialized. fq_codel_init has left the private data in a partially
initialized state.

To fix this, reset flows_cnt to 0 when we fail to initialize.
Additionally, to make the state more consistent, also cleanup the flows
pointer when the allocation of backlogs fails.

This fixes the NULL pointer dereference, since both the for-loop and
memset in fq_codel_reset will be no-ops when flow_cnt is zero.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 12:32:09 -07:00
Vlad Buslov e0479b670d net: sched: fix unprotected access to rcu cookie pointer
Fix action attribute size calculation function to take rcu read lock and
access act_cookie pointer with rcu dereference.

Fixes: eec94fdb04 ("net: sched: use rcu for action cookie update")
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11 23:01:02 -07:00
Vlad Buslov 01e866bf07 net: sched: act_ife: fix memory leak in ife init
Free params if tcf_idr_check_alloc() returned error.

Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11 22:53:00 -07:00
Jianbo Liu 5e9a0fe492 net/sched: flower: Fix null pointer dereference when run tc vlan command
Zahari issued tc vlan command without setting vlan_ethtype, which will
crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE]
is not null before use it.
Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case.

Fixes: d64efd0926 ('net/sched: flower: Add supprt for matching on QinQ vlan headers')
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reported-by: Zahari Doychev <zahari.doychev@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11 22:48:13 -07:00
Toke Høiland-Jørgensen 0c850344d3 sch_cake: Conditionally split GSO segments
At lower bandwidths, the transmission time of a single GSO segment can add
an unacceptable amount of latency due to HOL blocking. Furthermore, with a
software shaper, any tuning mechanism employed by the kernel to control the
maximum size of GSO segments is thrown off by the artificial limit on
bandwidth. For this reason, we split GSO segments into their individual
packets iff the shaper is active and configured to a bandwidth <= 1 Gbps.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen a729b7f0bd sch_cake: Add overhead compensation support to the rate shaper
This commit adds configurable overhead compensation support to the rate
shaper. With this feature, userspace can configure the actual bottleneck
link overhead and encapsulation mode used, which will be used by the shaper
to calculate the precise duration of each packet on the wire.

This feature is needed because CAKE is often deployed one or two hops
upstream of the actual bottleneck (which can be, e.g., inside a DSL or
cable modem). In this case, the link layer characteristics and overhead
reported by the kernel does not match the actual bottleneck. Being able to
set the actual values in use makes it possible to configure the shaper rate
much closer to the actual bottleneck rate (our experience shows it is
possible to get with 0.1% of the actual physical bottleneck rate), thus
keeping latency low without sacrificing bandwidth.

The overhead compensation has three tunables: A fixed per-packet overhead
size (which, if set, will be accounted from the IP packet header), a
minimum packet size (MPU) and a framing mode supporting either ATM or PTM
framing. We include a set of common keywords in TC to help users configure
the right parameters. If no overhead value is set, the value reported by
the kernel is used.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen 83f8fd69af sch_cake: Add DiffServ handling
This adds support for DiffServ-based priority queueing to CAKE. If the
shaper is in use, each priority tier gets its own virtual clock, which
limits that tier's rate to a fraction of the overall shaped rate, to
discourage trying to game the priority mechanism.

CAKE defaults to a simple, three-tier mode that interprets most code points
as "best effort", but places CS1 traffic into a low-priority "bulk" tier
which is assigned 1/16 of the total rate, and a few code points indicating
latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7)
into a "latency sensitive" high-priority tier, which is assigned 1/4 rate.
The other supported DiffServ modes are a 4-tier mode matching the 802.11e
precedence rules, as well as two 8-tier modes, one of which implements
strict precedence of the eight priority levels.

This commit also adds an optional DiffServ 'wash' mode, which will zero out
the DSCP fields of any packet passing through CAKE. While this can
technically be done with other mechanisms in the kernel, having the feature
available in CAKE significantly decreases configuration complexity; and the
implementation cost is low on top of the other DiffServ-handling code.

Filters and applications can set the skb->priority field to override the
DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches
CAKE's qdisc handle, the minor number will be interpreted as a priority
tier if it is less than or equal to the number of configured priority
tiers.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen ea82511518 sch_cake: Add NAT awareness to packet classifier
When CAKE is deployed on a gateway that also performs NAT (which is a
common deployment mode), the host fairness mechanism cannot distinguish
internal hosts from each other, and so fails to work correctly.

To fix this, we add an optional NAT awareness mode, which will query the
kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
and use that in the flow and host hashing.

When the shaper is enabled and the host is already performing NAT, the cost
of this lookup is negligible. However, in unlimited mode with no NAT being
performed, there is a significant CPU cost at higher bandwidths. For this
reason, the feature is turned off by default.

Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen 8b7138814f sch_cake: Add optional ACK filter
The ACK filter is an optional feature of CAKE which is designed to improve
performance on links with very asymmetrical rate limits. On such links
(which are unfortunately quite prevalent, especially for DSL and cable
subscribers), the downstream throughput can be limited by the number of
ACKs capable of being transmitted in the *upstream* direction.

Filtering ACKs can, in general, have adverse effects on TCP performance
because it interferes with ACK clocking (especially in slow start), and it
reduces the flow's resiliency to ACKs being dropped further along the path.
To alleviate these drawbacks, the ACK filter in CAKE tries its best to
always keep enough ACKs queued to ensure forward progress in the TCP flow
being filtered. It does this by only filtering redundant ACKs. In its
default 'conservative' mode, the filter will always keep at least two
redundant ACKs in the queue, while in 'aggressive' mode, it will filter
down to a single ACK.

The ACK filter works by inspecting the per-flow queue on every packet
enqueue. Starting at the head of the queue, the filter looks for another
eligible packet to drop (so the ACK being dropped is always closer to the
head of the queue than the packet being enqueued). An ACK is eligible only
if it ACKs *fewer* bytes than the new packet being enqueued, including any
SACK options. This prevents duplicate ACKs from being filtered, to avoid
interfering with retransmission logic. In addition, we check TCP header
options and only drop those that are known to not interfere with sender
state. In particular, packets with unknown option codes are never dropped.

In aggressive mode, an eligible packet is always dropped, while in
conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
(with no data segments) are considered eligible for dropping, but when an
ACK with data segments is enqueued, this can cause another pure ACK to
become eligible for dropping.

The approach described above ensures that this ACK filter avoids most of
the drawbacks of a naive filtering mechanism that only keeps flow state but
does not inspect the queue. This is the rationale for including the ACK
filter in CAKE itself rather than as separate module (as the TC filter, for
instance).

Our performance evaluation has shown that on a 30/1 Mbps link with a
bidirectional traffic test (RRUL), turning on the ACK filter on the
upstream link improves downstream throughput by ~20% (both modes) and
upstream throughput by ~12% in conservative mode and ~40% in aggressive
mode, at the cost of ~5ms of inter-flow latency due to the increased
congestion.

In *really* pathological cases, the effect can be a lot more; for instance,
the ACK filter increases the achievable downstream throughput on a link
with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
Mbps to ~25 Mbps).

Finally, even though we consider the ACK filter to be safer than most, we
do not recommend turning it on everywhere: on more symmetrical link
bandwidths the effect is negligible at best.

Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen 7298de9cd7 sch_cake: Add ingress mode
The ingress mode is meant to be enabled when CAKE runs downlink of the
actual bottleneck (such as on an IFB device). The mode changes the shaper
to also account dropped packets to the shaped rate, as these have already
traversed the bottleneck.

Enabling ingress mode will also tune the AQM to always keep at least two
packets queued *for each flow*. This is done by scaling the minimum queue
occupancy level that will disable the AQM by the number of active bulk
flows. The rationale for this is that retransmits are more expensive in
ingress mode, since dropped packets have to traverse the bottleneck again
when they are retransmitted; thus, being more lenient and keeping a minimum
number of packets queued will improve throughput in cases where the number
of active flows are so large that they saturate the bottleneck even at
their minimum window size.

This commit also adds a separate switch to enable ingress mode rate
autoscaling. If enabled, the autoscaling code will observe the actual
traffic rate and adjust the shaper rate to match it. This can help avoid
latency increases in the case where the actual bottleneck rate decreases
below the shaped rate. The scaling filters out spikes by an EWMA filter.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen 046f6fd5da sched: Add Common Applications Kept Enhanced (cake) qdisc
sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
  variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
International Symposium on Local and Metropolitan Area Networks (LANMAN).

This patch adds the base shaper and packet scheduler, while subsequent
commits add the optional (configurable) features. The full userspace API
and most data structures are included in this commit, but options not
understood in the base version will be ignored.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth2
 qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
 Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
  backlog 0b 0p requeues 12
  memory used: 1053008b of 15140Kb
  capacity estimate: 970Mbit
  min/max network layer size:           28 /    1500
  min/max overhead-adjusted size:       84 /    1538
  average network hdr offset:           14
                    Bulk  Best Effort        Voice
   thresh      62500Kbit        1Gbit      250Mbit
   target          5.0ms        5.0ms        5.0ms
   interval      100.0ms      100.0ms      100.0ms
   pk_delay          5us          5us          6us
   av_delay          3us          2us          2us
   sp_delay          2us          1us          1us
   backlog            0b           0b           0b
   pkts          3164050     25030267      9530280
   bytes      3227519915  35396974782  12879808898
   way_inds            0            8            0
   way_miss           21          366           25
   way_cols            0            0            0
   drops               5            0            1
   marks               0            0            0
   ack_drop            0            0            0
   sp_flows            1            3            0
   bk_flows            0            1            1
   un_flows            0            0            0
   max_len         68130        68130        68130

Tested-by: Pete Heist <peteheist@gmail.com>
Tested-by: Georgios Amanakis <gamanakis@gmail.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10 20:06:34 -07:00
David S. Miller 0dbc81eab4 net: sched: Fix warnings from xchg() on RCU'd cookie pointer.
The kbuild test robot reports:

>> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different address spaces) @@    expected struct tc_cookie [noderef] <asn:4>*__ret @@    got [noderef] <asn:4>*__ret @@
   net/sched/act_api.c:71:15:    expected struct tc_cookie [noderef] <asn:4>*__ret
   net/sched/act_api.c:71:15:    got struct tc_cookie *new_cookie
>> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different address spaces) @@    expected struct tc_cookie *old @@    got struct tc_cookie [noderef] <struct tc_cookie *old @@
   net/sched/act_api.c:71:13:    expected struct tc_cookie *old
   net/sched/act_api.c:71:13:    got struct tc_cookie [noderef] <asn:4>*[assigned] __ret
>> net/sched/act_api.c:132:48: sparse: dereference of noderef expression

Handle this in the usual way by force casting away the __rcu annotation
when we are using xchg() on it.

Fixes: eec94fdb04 ("net: sched: use rcu for action cookie update")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 17:02:59 +09:00
Vlad Buslov 90b73b77d0 net: sched: change action API to use array of pointers to actions
Act API used linked list to pass set of actions to functions. It is
intrusive data structure that stores list nodes inside action structure
itself, which means it is not safe to modify such list concurrently.
However, action API doesn't use any linked list specific operations on this
set of actions, so it can be safely refactored into plain pointer array.

Refactor action API to use array of pointers to tc_actions instead of
linked list. Change argument 'actions' type of exported action init,
destroy and dump functions.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov 0190c1d452 net: sched: atomically check-allocate action
Implement function that atomically checks if action exists and either takes
reference to it, or allocates idr slot for action index to prevent
concurrent allocations of actions with same index. Use EBUSY error pointer
to indicate that idr slot is reserved.

Implement cleanup helper function that removes temporary error pointer from
idr. (in case of error between idr allocation and insertion of newly
created action to specified index)

Refactor all action init functions to insert new action to idr using this
API.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov cae422f379 net: sched: use reference counting action init
Change action API to assume that action init function always takes
reference to action, even when overwriting existing action. This is
necessary because action API continues to use action pointer after init
function is done. At this point action becomes accessible for concurrent
modifications, so user must always hold reference to it.

Implement helper put list function to atomically release list of actions
after action API init code is done using them.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov 4e8ddd7f17 net: sched: don't release reference on action overwrite
Return from action init function with reference to action taken,
even when overwriting existing action.

Action init API initializes its fourth argument (pointer to pointer to tc
action) to either existing action with same index or newly created action.
In case of existing index(and bind argument is zero), init function returns
without incrementing action reference counter. Caller of action init then
proceeds working with action, without actually holding reference to it.
This means that action could be deleted concurrently.

Change action init behavior to always take reference to action before
returning successfully, in order to protect from concurrent deletion.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov 16af606739 net: sched: implement reference counted action release
Implement helper delete function that uses new action ops 'delete', instead
of destroying action directly. This is required so act API could delete
actions by index, without holding any references to action that is being
deleted.

Implement function __tcf_action_put() that releases reference to action and
frees it, if necessary. Refactor action deletion code to use new put
function and not to rely on rtnl lock. Remove rtnl lock assertions that are
no longer needed.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov b409074e66 net: sched: add 'delete' function to action ops
Extend action ops with 'delete' function. Each action type to implements
its own delete function that doesn't depend on rtnl lock.

Implement delete function that is required to delete actions without
holding rtnl lock. Use action API function that atomically deletes action
only if it is still in action idr. This implementation prevents concurrent
threads from deleting same action twice.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov 2a2ea34970 net: sched: implement action API that deletes action by index
Implement new action API function that atomically finds and deletes action
from idr by index. Intended to be used by lockless actions that do not rely
on rtnl lock.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Vlad Buslov 3f7c72bc42 net: sched: always take reference to action
Without rtnl lock protection it is no longer safe to use pointer to tc
action without holding reference to it. (it can be destroyed concurrently)

Remove unsafe action idr lookup function. Instead of it, implement safe tcf
idr check function that atomically looks up action in idr and increments
its reference and bind counters. Implement both action search and check
using new safe function

Reference taken by idr check is temporal and should not be accounted by
userspace clients (both logically and to preserver current API behavior).
Subtract temporal reference when dumping action to userspace using existing
tca_get_fill function arguments.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Vlad Buslov 789871bb2a net: sched: implement unlocked action init API
Add additional 'rtnl_held' argument to act API init functions. It is
required to implement actions that need to release rtnl lock before loading
kernel module and reacquire if afterwards.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Vlad Buslov 036bb44327 net: sched: change type of reference and bind counters
Change type of action reference counter to refcount_t.

Change type of action bind counter to atomic_t.
This type is used to allow decrementing bind counter without testing
for 0 result.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Vlad Buslov eec94fdb04 net: sched: use rcu for action cookie update
Implement functions to atomically update and free action cookie
using rcu mechanism.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Davide Caratti 38230a3e0e net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used
the control action in the common member of struct tcf_tunnel_key must be a
valid value, as it can contain the chain index when 'goto chain' is used.
Ensure that the control action can be read as x->tcfa_action, when x is a
pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to
prevent the following command:

 # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
 > $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1

from causing a NULL dereference when a matching packet is received:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
 PGD 80000001097ac067 P4D 80000001097ac067 PUD 103b0a067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 3491 Comm: mausezahn Tainted: G            E     4.18.0-rc2.auguri+ #421
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
 RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
 R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
 R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
 FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
 Call Trace:
  <IRQ>
  fl_classify+0x1ad/0x1c0 [cls_flower]
  ? __update_load_avg_se.isra.47+0x1ca/0x1d0
  ? __update_load_avg_se.isra.47+0x1ca/0x1d0
  ? update_load_avg+0x665/0x690
  ? update_load_avg+0x665/0x690
  ? kmem_cache_alloc+0x38/0x1c0
  tcf_classify+0x89/0x140
  __netif_receive_skb_core+0x5ea/0xb70
  ? enqueue_entity+0xd0/0x270
  ? process_backlog+0x97/0x150
  process_backlog+0x97/0x150
  net_rx_action+0x14b/0x3e0
  __do_softirq+0xde/0x2b4
  do_softirq_own_stack+0x2a/0x40
  </IRQ>
  do_softirq.part.18+0x49/0x50
  __local_bh_enable_ip+0x49/0x50
  __dev_queue_xmit+0x4ab/0x8a0
  ? wait_woken+0x80/0x80
  ? packet_sendmsg+0x38f/0x810
  ? __dev_queue_xmit+0x8a0/0x8a0
  packet_sendmsg+0x38f/0x810
  sock_sendmsg+0x36/0x40
  __sys_sendto+0x10e/0x140
  ? do_vfs_ioctl+0xa4/0x630
  ? syscall_trace_enter+0x1df/0x2e0
  ? __audit_syscall_exit+0x22a/0x290
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fd67e18dc93
 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
 RSP: 002b:00007ffe0189b748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 00000000020ca010 RCX: 00007fd67e18dc93
 RDX: 0000000000000062 RSI: 00000000020ca322 RDI: 0000000000000003
 RBP: 00007ffe0189b780 R08: 00007ffe0189b760 R09: 0000000000000014
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
 R13: 00000000020ca322 R14: 00007ffe0189b760 R15: 0000000000000003
 Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me
  mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca
 CR2: 0000000000000000
 ---[ end trace 1ab8b5b5d4639dfc ]---
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
 RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
 R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
 R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
 FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: 0x11400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: d0f6dd8a91 ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 22:01:08 +09:00
Davide Caratti 11a245e2f7 net/sched: act_csum: fix NULL dereference when 'goto chain' is used
the control action in the common member of struct tcf_csum must be a valid
value, as it can contain the chain index when 'goto chain' is used. Ensure
that the control action can be read as x->tcfa_action, when x is a pointer
to struct tc_action and x->ops->type is TCA_ACT_CSUM, to prevent the
following command:

  # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
  > $tcflags dst_mac $h2mac action csum ip or tcp or udp or sctp goto chain 1

from triggering a NULL pointer dereference when a matching packet is
received.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
 PGD 800000010416b067 P4D 800000010416b067 PUD 1041be067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 3072 Comm: mausezahn Tainted: G            E     4.18.0-rc2.auguri+ #421
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
 RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
 RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
 R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
 R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
 FS:  00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
 Call Trace:
  <IRQ>
  fl_classify+0x1ad/0x1c0 [cls_flower]
  ? arp_rcv+0x121/0x1b0
  ? __x2apic_send_IPI_dest+0x40/0x40
  ? smp_reschedule_interrupt+0x1c/0xd0
  ? reschedule_interrupt+0xf/0x20
  ? reschedule_interrupt+0xa/0x20
  ? device_is_rmrr_locked+0xe/0x50
  ? iommu_should_identity_map+0x49/0xd0
  ? __intel_map_single+0x30/0x140
  ? e1000e_update_rdt_wa.isra.52+0x22/0xb0 [e1000e]
  ? e1000_alloc_rx_buffers+0x233/0x250 [e1000e]
  ? kmem_cache_alloc+0x38/0x1c0
  tcf_classify+0x89/0x140
  __netif_receive_skb_core+0x5ea/0xb70
  ? enqueue_task_fair+0xb6/0x7d0
  ? process_backlog+0x97/0x150
  process_backlog+0x97/0x150
  net_rx_action+0x14b/0x3e0
  __do_softirq+0xde/0x2b4
  do_softirq_own_stack+0x2a/0x40
  </IRQ>
  do_softirq.part.18+0x49/0x50
  __local_bh_enable_ip+0x49/0x50
  __dev_queue_xmit+0x4ab/0x8a0
  ? wait_woken+0x80/0x80
  ? packet_sendmsg+0x38f/0x810
  ? __dev_queue_xmit+0x8a0/0x8a0
  packet_sendmsg+0x38f/0x810
  sock_sendmsg+0x36/0x40
  __sys_sendto+0x10e/0x140
  ? do_vfs_ioctl+0xa4/0x630
  ? syscall_trace_enter+0x1df/0x2e0
  ? __audit_syscall_exit+0x22a/0x290
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f5a45cbec93
 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
 RSP: 002b:00007ffd0ee6d748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000001161010 RCX: 00007f5a45cbec93
 RDX: 0000000000000062 RSI: 0000000001161322 RDI: 0000000000000003
 RBP: 00007ffd0ee6d780 R08: 00007ffd0ee6d760 R09: 0000000000000014
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
 R13: 0000000001161322 R14: 00007ffd0ee6d760 R15: 0000000000000003
 Modules linked in: act_csum act_gact cls_flower sch_ingress vrf veth act_tunnel_key(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic hp_wmi iTCO_wdt sparse_keymap rfkill mei_wdt iTCO_vendor_support wmi_bmof gpio_ich irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_intel crypto_simd cryptd snd_hda_codec glue_helper snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm pcspkr i2c_i801 snd_timer snd sg lpc_ich soundcore wmi mei_me
  mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ahci libahci crc32c_intel i915 ixgbe serio_raw libata video dca i2c_algo_bit sfc drm_kms_helper syscopyarea mtd sysfillrect mdio sysimgblt fb_sys_fops drm e1000e i2c_core
 CR2: 0000000000000000
 ---[ end trace 3c9e9d1a77df4026 ]---
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
 RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
 RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
 R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
 R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
 FS:  00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: 0x26400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: 9c5f69bbd7 ("net/sched: act_csum: don't use spinlock in the fast path")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 22:01:08 +09:00
Jianbo Liu d64efd0926 net/sched: flower: Add supprt for matching on QinQ vlan headers
As support dissecting of QinQ inner and outer vlan headers, user can
add rules to match on QinQ vlan headers.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 20:51:53 +09:00
Jianbo Liu d30695126f net/sched: flower: Dump the ethertype encapsulated in vlan
Currently the encapsulated ethertype is not dumped as it's the same as
TCA_FLOWER_KEY_ETH_TYPE keyvalue. But the dumping result is inconsistent
with input, we add dumping it with TCA_FLOWER_KEY_VLAN_ETH_TYPE.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 20:51:53 +09:00
Jianbo Liu aaab08344d net/sched: flower: Add support for matching on vlan ethertype
As flow dissector stores vlan ethertype, tc flower now can match on that.
It is to make preparation for supporting QinQ.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 20:51:53 +09:00
Jesus Sanchez-Palencia 4b15c70753 net/sched: Make etf report drops on error_queue
Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.

Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.

Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:

    ((__u64) serr->ee_data << 32) + serr->ee_info

This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia 88cab77162 net/sched: Add HW offloading capability to ETF
Add infra so etf qdisc supports HW offload of time-based transmission.

For hw offload, the time sorted list is still used, so packets are
dequeued always in order of txtime.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
	   clockid CLOCK_REALTIME

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. The hrtimer used for
packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
as reference and packets leave the Qdisc "delta" (100000) nanoseconds
before their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as
expected.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
Vinicius Costa Gomes 25db26a913 net/sched: Introduce the ETF Qdisc
The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
           clockid CLOCK_TAI

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_TAI and packets
will leave the qdisc "delta" (100000) nanoseconds before its transmission
time.

The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.

The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().

Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
to be configured, but it's been decided that usage of CLOCK_TAI should
be enforced until we decide to allow for other clockids to be used.
The rationale here is that PTP times are usually in the TAI scale, thus
no other clocks should be necessary. For now, the qdisc will return
EINVAL if any clocks other than CLOCK_TAI are used.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
Vinicius Costa Gomes 860b642b9c net/sched: Allow creating a Qdisc watchdog with other clocks
This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
Wei Yongjun 30e99ed6db net: sched: act_pedit: fix possible memory leak in tcf_pedit_init()
'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init()
but not all of the error handle path free it, this may cause memory
leak. This patch fix it.

Fixes: 71d0ed7079 ("net/act_pedit: Support using offset relative to the conventional network headers")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:08:21 +09:00
Qiaobin Fu e7e3728bd7 net:sched: add action inheritdsfield to skbedit
The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v5:
*Update the drop counter for TC_ACT_SHOT

v4:
*Not allow setting flags other than the expected ones.

*Allow dumping the pure flags.

v3:
*Use optional flags, so that it won't break old versions of tc.

*Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.

v2:
*Fix the style issue

*Move the code from skbmod to skbedit

Original idea by Jamal Hadi Salim <jhs@mojatatu.com>

Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 21:27:42 +09:00
Roman Mashak 9868c0b2eb net sched actions: add extack messages in pedit action
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-02 22:44:06 +09:00
Simon Horman 0ed5269f9e net/sched: add tunnel option support to act_tunnel_key
Allow setting tunnel options using the act_tunnel_key action.

Options are expressed as class:type:data and multiple options
may be listed using a comma delimiter.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_proto udp \
        action tunnel_key \
            set src_ip 10.0.99.192 \
            dst_ip 10.0.99.193 \
            dst_port 6081 \
            id 11 \
            geneve_opts 0102:80:00800022,0102:80:00800022 \
    action mirred egress redirect dev geneve0

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Simon Horman 9d7298cd1d net/sched: act_tunnel_key: add extended ack support
Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
during validation of user input.

Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Simon Horman a1165b5919 net/sched: act_tunnel_key: disambiguate metadata dst error cases
Metadata may be NULL for one of two reasons:
* Missing user input
* Failure to allocate the metadata dst

Disambiguate these case by returning -EINVAL for the former and -ENOMEM
for the latter rather than -EINVAL for both cases.

This is in preparation for using extended ack to provide more information
to users when parsing their input.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Roman Mashak 4305274153 net sched actions: avoid bitwise operation on signed value in pedit
Since char can be unsigned or signed, and bitwise operators may have
implementation-dependent results when performed on signed operands,
declare 'u8 *' operand instead.

Suggested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak 95b0d2dc13 net sched actions: fix misleading text strings in pedit action
Change "tc filter pedit .." to "tc actions pedit .." in error
messages to clearly refer to pedit action.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak 6ff7586e38 net sched actions: use sizeof operator for buffer length
Replace constant integer with sizeof() to clearly indicate
the destination buffer length in skb_header_pointer() calls.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak 544377cd25 net sched actions: fix sparse warning
The variable _data in include/asm-generic/sections.h defines sections,
this causes sparse warning in pedit:

net/sched/act_pedit.c:293:35: warning: symbol '_data' shadows an earlier one
./include/asm-generic/sections.h:36:13: originally declared here

Therefore rename the variable.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak 80f0f574cc net sched actions: fix coding style in pedit action
Fix coding style issues in tc pedit action detected by the
checkpatch script.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Yousuk Seung 0a9fe5c375 netem: slotting with non-uniform distribution
Extend slotting with support for non-uniform distributions. This is
similar to netem's non-uniform distribution delay feature.

Commit f043efeae2f1 ("netem: support delivering packets in delayed
time slots") added the slotting feature to approximate the behaviors
of media with packet aggregation but only supported a uniform
distribution for delays between transmission attempts. Tests with TCP
BBR with emulated wifi links with non-uniform distributions produced
more useful results.

Syntax:
   slot dist DISTRIBUTION DELAY JITTER [packets MAX_PACKETS] \
      [bytes MAX_BYTES]

The syntax and use of the distribution table is the same as in the
non-uniform distribution delay feature. A file DISTRIBUTION must be
present in TC_LIB_DIR (e.g. /usr/lib/tc) containing numbers scaled by
NETEM_DIST_SCALE. A random value x is selected from the table and it
takes DELAY + ( x * JITTER ) as delay. Correlation between values is not
supported.

Examples:
  Normal distribution delay with mean = 800us and stdev = 100us.
  > tc qdisc add dev eth0 root netem slot dist normal 800us 100us

  Optionally set the max slot size in bytes and/or packets.
  > tc qdisc add dev eth0 root netem slot dist normal 800us 100us \
    bytes 64k packets 42

Signed-off-by: Yousuk Seung <ysseung@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:06:24 +09:00
John Hurley 326367427c net: sched: call reoffload op on block callback reg
Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a
block when a callback tries to register to a block that already has
offloaded rules. If all existing rules cannot be offloaded then the
registration is rejected. This replaces the previous policy of rejecting
such callback registration outright.

On unregistration of a callback, the rules are flushed for that given cb.
The implementation of block sharing in the NFP driver, for example,
duplicates shared rules to all devs bound to a block. This meant that
rules could still exist in hw even after a device is unbound from a block
(assuming the block still remains active).

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:33 +09:00
John Hurley 7e916b7680 net: sched: cls_bpf: implement offload tcf_proto_op
Add the offload tcf_proto_op in cls_bpf to generate an offload message for
each bpf prog in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' prog.

A prog contains a flag to indicate if it is in hardware or not. To
ensure the offload function properly maintains this flag, keep a reference
counter for the number of instances of the prog that are in hardware. Only
update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:33 +09:00
John Hurley 530d995123 net: sched: cls_u32: implement offload tcf_proto_op
Add the offload tcf_proto_op in cls_u32 to generate an offload message for
each filter and the hashtable in the given tcf_proto. Call the specified
callback with this new offload message. The function only returns an error
if the callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the offload function properly maintains this flag, keep a reference
counter for the number of instances of the filter that are in hardware.
Only update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:33 +09:00
John Hurley 0efd1b3a13 net: sched: cls_matchall: implement offload tcf_proto_op
Add the reoffload tcf_proto_op in matchall to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

Ensure matchall flags correctly report if the rule is in hw by keeping a
reference counter for the number of instances of the rule offloaded. Only
update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:33 +09:00
John Hurley 31533cba43 net: sched: cls_flower: implement offload tcf_proto_op
Add the reoffload tcf_proto_op in flower to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the reoffload function properly maintains this flag, keep a
reference counter for the number of instances of the filter that are in
hardware. Only update the flag when this counter changes from or to 0. Add
a generic helper function to implement this behaviour.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
John Hurley 60513bd82c net: sched: pass extack pointer to block binds and cb registration
Pass the extact struct from a tc qdisc add to the block bind function and,
in turn, to the setup_tc ndo of binding device via the tc_block_offload
struct. Pass this back to any block callback registrations to allow
netlink logging of fails in the bind process.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
David S. Miller 9ff3b40e41 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-06-26 08:07:17 +09:00
Cong Wang fe0b082fed net_sched: remove unused htb drop_list
After commit a09ceb0e08 ("sched: remove qdisc->drop"),
it is no longer used.

Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-24 16:42:46 +09:00
Cong Wang 35b42da69e net_sched: remove a bogus warning in hfsc
In update_vf():

  cftree_remove(cl);
  update_cfmin(cl->cl_parent);

the cl_cfmin of cl->cl_parent is intentionally updated to 0
when that parent only has one child. And if this parent is
root qdisc, we could end up, in hfsc_schedule_watchdog(),
that we can't decide the next schedule time for qdisc watchdog.
But it seems safe that we can just skip it, as this watchdog is
not always scheduled anyway.

Thanks to Marco for testing all the cases, nothing is broken.

Reported-by: Marco Berizzi <pupilla@libero.it>
Tested-by: Marco Berizzi <pupilla@libero.it>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-23 10:58:46 +09:00
Paolo Abeni 44a5cd436e cls_flower: fix use after free in flower S/W path
If flower filter is created without the skip_sw flag, fl_mask_put()
can race with fl_classify() and we can destroy the mask rhashtable
while a lookup operation is accessing it.

 BUG: unable to handle kernel paging request at 00000000000911d1
 PGD 0 P4D 0
 SMP PTI
 CPU: 3 PID: 5582 Comm: vhost-5541 Not tainted 4.18.0-rc1.vanilla+ #1950
 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
 RIP: 0010:rht_bucket_nested+0x20/0x60
 Code: 31 c8 c1 c1 18 29 c8 c3 66 90 8b 4f 04 ba 01 00 00 00 8b 07 48 8b bf 80 00 00 0
 RSP: 0018:ffffafc5cfbb7a48 EFLAGS: 00010206
 RAX: 0000000000001978 RBX: ffff9f12dff88a00 RCX: 00000000ffff9f12
 RDX: 00000000000911d1 RSI: 0000000000000148 RDI: 0000000000000001
 RBP: ffff9f12dff88a00 R08: 000000005f1cc119 R09: 00000000a715fae2
 R10: ffffafc5cfbb7aa8 R11: ffff9f1cb4be804e R12: ffff9f1265e13000
 R13: 0000000000000000 R14: ffffafc5cfbb7b48 R15: ffff9f12dff88b68
 FS:  0000000000000000(0000) GS:ffff9f1d3f0c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000000911d1 CR3: 0000001575a94006 CR4: 00000000001626e0
 Call Trace:
  fl_lookup+0x134/0x140 [cls_flower]
  fl_classify+0xf3/0x180 [cls_flower]
  tcf_classify+0x78/0x150
  __netif_receive_skb_core+0x69e/0xa50
  netif_receive_skb_internal+0x42/0xf0
  tun_get_user+0xdd5/0xfd0 [tun]
  tun_sendmsg+0x52/0x70 [tun]
  handle_tx+0x2b3/0x5f0 [vhost_net]
  vhost_worker+0xab/0x100 [vhost]
  kthread+0xf8/0x130
  ret_from_fork+0x35/0x40
 Modules linked in: act_mirred act_gact cls_flower vhost_net vhost tap sch_ingress
 CR2: 00000000000911d1

Fix the above waiting for a RCU grace period before destroying the
rhashtable: we need to use tcf_queue_work(), as rhashtable_destroy()
must run in process context, as pointed out by Cong Wang.

v1 -> v2: use tcf_queue_work to run rhashtable_destroy().

Fixes: 05cd271fd6 ("cls_flower: Support multiple masks per priority")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 07:24:07 +09:00
Davide Caratti cbf56c2962 net/sched: act_ife: preserve the action control in case of error
in the following script

 # tc actions add action ife encode allow prio pass index 42
 # tc actions replace action ife encode allow tcindex drop index 42

the action control should remain equal to 'pass', if the kernel failed
to replace the TC action. Pospone the assignment of the action control,
to ensure it is not overwritten in the error path of tcf_ife_init().

Fixes: ef6980b6be ("introduce IFE action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-20 09:03:58 +09:00
Davide Caratti 0a889b9404 net/sched: act_ife: fix recursive lock and idr leak
a recursive lock warning [1] can be observed with the following script,

 # $TC actions add action ife encode allow prio pass index 42
 IFE type 0xED3E
 # $TC actions replace action ife encode allow tcindex pass index 42

in case the kernel was unable to run the last command (e.g. because of
the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
the kernel can leak idr in the error path of tcf_ife_init(), because
tcf_idr_release() is not called after successful idr reservation:

 # $TC actions add action ife encode allow tcindex index 47
 IFE type 0xED3E
 RTNETLINK answers: No such file or directory
 We have an error talking to the kernel
 # $TC actions add action ife encode allow tcindex index 47
 IFE type 0xED3E
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # $TC actions add action ife encode use mark 7 type 0xfefe pass index 47
 IFE type 0xFEFE
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel

Since tcfa_lock is already taken when the action is being edited, a call
to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
again. On the other hand, tcf_idr_release() needs to be called in the
error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
Fix both problems in tcf_ife_init().
Since the cleanup() routine can now be called when ife->params is NULL,
also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).

 [1]
 ============================================
 WARNING: possible recursive locking detected
 4.17.0-rc4.kasan+ #417 Tainted: G            E
 --------------------------------------------
 tc/3932 is trying to acquire lock:
 000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_cleanup+0x19/0x80 [act_ife]

 but task is already holding lock:
 000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_init+0xf6d/0x13c0 [act_ife]

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(&p->tcfa_lock)->rlock);
   lock(&(&p->tcfa_lock)->rlock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by tc/3932:
  #0: 000000007ca8e990 (rtnl_mutex){+.+.}, at: tcf_ife_init+0xf61/0x13c0 [act_ife]
  #1: 000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_init+0xf6d/0x13c0 [act_ife]

 stack backtrace:
 CPU: 3 PID: 3932 Comm: tc Tainted: G            E     4.17.0-rc4.kasan+ #417
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 Call Trace:
  dump_stack+0x9a/0xeb
  __lock_acquire+0xf43/0x34a0
  ? debug_check_no_locks_freed+0x2b0/0x2b0
  ? debug_check_no_locks_freed+0x2b0/0x2b0
  ? debug_check_no_locks_freed+0x2b0/0x2b0
  ? __mutex_lock+0x62f/0x1240
  ? kvm_sched_clock_read+0x1a/0x30
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1d0
  ? lock_acquire+0x10b/0x330
  lock_acquire+0x10b/0x330
  ? tcf_ife_cleanup+0x19/0x80 [act_ife]
  _raw_spin_lock_bh+0x38/0x70
  ? tcf_ife_cleanup+0x19/0x80 [act_ife]
  tcf_ife_cleanup+0x19/0x80 [act_ife]
  __tcf_idr_release+0xff/0x350
  tcf_ife_init+0xdde/0x13c0 [act_ife]
  ? ife_exit_net+0x290/0x290 [act_ife]
  ? __lock_is_held+0xb4/0x140
  tcf_action_init_1+0x67b/0xad0
  ? tcf_action_dump_old+0xa0/0xa0
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? kvm_sched_clock_read+0x1a/0x30
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? memset+0x1f/0x40
  tcf_action_init+0x30f/0x590
  ? tcf_action_init_1+0xad0/0xad0
  ? memset+0x1f/0x40
  tc_ctl_action+0x48e/0x5e0
  ? mutex_lock_io_nested+0x1160/0x1160
  ? tca_action_gd+0x990/0x990
  ? sched_clock+0x5/0x10
  ? find_held_lock+0x39/0x1d0
  rtnetlink_rcv_msg+0x4da/0x990
  ? validate_linkmsg+0x680/0x680
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1d0
  netlink_rcv_skb+0x127/0x350
  ? validate_linkmsg+0x680/0x680
  ? netlink_ack+0x970/0x970
  ? __kmalloc_node_track_caller+0x304/0x3a0
  netlink_unicast+0x40f/0x5d0
  ? netlink_attachskb+0x580/0x580
  ? _copy_from_iter_full+0x187/0x760
  ? import_iovec+0x90/0x390
  netlink_sendmsg+0x67f/0xb50
  ? netlink_unicast+0x5d0/0x5d0
  ? copy_msghdr_from_user+0x206/0x340
  ? netlink_unicast+0x5d0/0x5d0
  sock_sendmsg+0xb3/0xf0
  ___sys_sendmsg+0x60a/0x8b0
  ? copy_msghdr_from_user+0x340/0x340
  ? lock_downgrade+0x5e0/0x5e0
  ? tty_write_lock+0x18/0x50
  ? kvm_sched_clock_read+0x1a/0x30
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1d0
  ? lock_downgrade+0x5e0/0x5e0
  ? lock_acquire+0x10b/0x330
  ? __audit_syscall_entry+0x316/0x690
  ? current_kernel_time64+0x6b/0xd0
  ? __fget_light+0x55/0x1f0
  ? __sys_sendmsg+0xd2/0x170
  __sys_sendmsg+0xd2/0x170
  ? __ia32_sys_shutdown+0x70/0x70
  ? syscall_trace_enter+0x57a/0xd60
  ? rcu_read_lock_sched_held+0xdc/0x110
  ? __bpf_trace_sys_enter+0x10/0x10
  ? do_syscall_64+0x22/0x480
  do_syscall_64+0xa5/0x480
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fd646988ba0
 RSP: 002b:00007fffc9fab3c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fffc9fab4f0 RCX: 00007fd646988ba0
 RDX: 0000000000000000 RSI: 00007fffc9fab440 RDI: 0000000000000003
 RBP: 000000005b28c8b3 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fffc9faae20 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fffc9fab504 R14: 0000000000000001 R15: 000000000066c100

Fixes: 4e8c861550 ("net sched: net sched: ife action fix late binding")
Fixes: ef6980b6be ("introduce IFE action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-20 09:03:58 +09:00
Konstantin Khlebnikov 7e85dc8cb3 net_sched: blackhole: tell upper qdisc about dropped packets
When blackhole is used on top of classful qdisc like hfsc it breaks
qlen and backlog counters because packets are disappear without notice.

In HFSC non-zero qlen while all classes are inactive triggers warning:
WARNING: ... at net/sched/sch_hfsc.c:1393 hfsc_dequeue+0xba4/0xe90 [sch_hfsc]
and schedules watchdog work endlessly.

This patch return __NET_XMIT_BYPASS in addition to NET_XMIT_SUCCESS,
this flag tells upper layer: this packet is gone and isn't queued.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-17 08:42:33 +09:00
Kees Cook 778e1cdd81 treewide: kvzalloc() -> kvcalloc()
The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
patch replaces cases of:

        kvzalloc(a * b, gfp)

with:
        kvcalloc(a * b, gfp)

as well as handling cases of:

        kvzalloc(a * b * c, gfp)

with:

        kvzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kvcalloc(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kvzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kvzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kvzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kvzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kvzalloc
+ kvcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kvzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kvzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kvzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kvzalloc(C1 * C2 * C3, ...)
|
  kvzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kvzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kvzalloc(sizeof(THING) * C2, ...)
|
  kvzalloc(sizeof(TYPE) * C2, ...)
|
  kvzalloc(C1 * C2 * C3, ...)
|
  kvzalloc(C1 * C2, ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Davide Caratti 8d499533e0 net/sched: act_simple: fix parsing of TCA_DEF_DATA
use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
end with '\0' character.

v2: fix errors in the commit message, thanks Hangbin Liu

Fixes: fa1b1cff3d ("net_cls_act: Make act_simple use of netlink policy.")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-08 18:49:19 -04:00
Linus Torvalds 1c8c5a9d38 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add Maglev hashing scheduler to IPVS, from Inju Song.

 2) Lots of new TC subsystem tests from Roman Mashak.

 3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

 5) Add ttl inherit support to vxlan, from Hangbin Liu.

 6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

 8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

11) Plumb extack down into fib_rules, from Roopa Prabhu.

12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

13) Add UDP GSO support, from Willem de Bruijn.

14) Add documentation for eBPF helpers, from Quentin Monnet.

15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

24) Add TCP SACK compression, from Eric Dumazet.

25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

28) Add lots of forwarding selftests, from Petr Machata.

29) Add generic network device failover driver, from Sridhar Samudrala.

* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
  strparser: Add __strp_unpause and use it in ktls.
  rxrpc: Fix terminal retransmission connection ID to include the channel
  net: hns3: Optimize PF CMDQ interrupt switching process
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Fix for VF mailbox cannot receiving PF response
  bnx2x: use the right constant
  Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
  net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
  enic: fix UDP rss bits
  netdev-FAQ: clarify DaveM's position for stable backports
  rtnetlink: validate attributes in do_setlink()
  mlxsw: Add extack messages for port_{un, }split failures
  netdevsim: Add extack error message for devlink reload
  devlink: Add extack to reload and port_{un, }split operations
  net: metrics: add proper netlink validation
  ipmr: fix error path when ipmr_new_table fails
  ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
  net: hns3: remove unused hclgevf_cfg_func_mta_filter
  netfilter: provide udp*_lib_lookup for nf_tproxy
  qed*: Utilize FW 8.37.2.0
  ...
2018-06-06 18:39:49 -07:00
David S. Miller 9a99dc1c41 Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
This reverts commit d96a43c664.

This potentially breaks things, so reverting as per
request by Jakub Kicinski.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-06 13:55:47 -04:00
Paul Blakey d96a43c664 net: sched: cls: Fix offloading when ingress dev is vxlan
When using a vxlan device as the ingress dev, we count it as a
"no offload dev", so when such a rule comes and err stop is true,
we fail early and don't try the egdev route which can offload it
through the egress device.

Fix that by not calling the block offload if one of the devices
attached to it is not offload capable, but make sure egress on such case
is capable instead.

Fixes: caa7260156 ("net: sched: keep track of offloaded filters [..]")
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05 10:29:58 -04:00
Vlad Buslov 0e3990356d net: sched: return error code when tcf proto is not found
If requested tcf proto is not found, get and del filter netlink protocol
handlers output error message to extack, but do not return actual error
code. Add check to return ENOENT when result of tp find function is NULL
pointer.

Fixes: c431f89b18 ("net: sched: split tc_ctl_tfilter into three handlers")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 17:31:44 -04:00
Paul Blakey f6521c587a cls_flower: Fix comparing of old filter mask with new filter
We incorrectly compare the mask and the result is that we can't modify
an already existing rule.

Fix that by comparing correctly.

Fixes: 05cd271fd6 ("cls_flower: Support multiple masks per priority")
Reported-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 17:03:37 -04:00
Paul Blakey de9dc650f0 cls_flower: Fix missing free of rhashtable
When destroying the instance, destroy the head rhashtable.

Fixes: 05cd271fd6 ("cls_flower: Support multiple masks per priority")
Reported-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 17:03:37 -04:00
Linus Torvalds cf626b0da7 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
 "Christoph's proc_create_... cleanups series"

* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
  xfs, proc: hide unused xfs procfs helpers
  isdn/gigaset: add back gigaset_procinfo assignment
  proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
  tty: replace ->proc_fops with ->proc_show
  ide: replace ->proc_fops with ->proc_show
  ide: remove ide_driver_proc_write
  isdn: replace ->proc_fops with ->proc_show
  atm: switch to proc_create_seq_private
  atm: simplify procfs code
  bluetooth: switch to proc_create_seq_data
  netfilter/x_tables: switch to proc_create_seq_private
  netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
  neigh: switch to proc_create_seq_data
  hostap: switch to proc_create_{seq,single}_data
  bonding: switch to proc_create_seq_data
  rtc/proc: switch to proc_create_single_data
  drbd: switch to proc_create_single
  resource: switch to proc_create_seq_data
  staging/rtl8192u: simplify procfs code
  jfs: simplify procfs code
  ...
2018-06-04 10:00:01 -07:00
David S. Miller 9c54aeb03a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Filling in the padding slot in the bpf structure as a bug fix in 'ne'
overlapped with actually using that padding area for something in
'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-03 09:31:58 -04:00
Vlad Buslov c431f89b18 net: sched: split tc_ctl_tfilter into three handlers
tc_ctl_tfilter handles three netlink message types: RTM_NEWTFILTER,
RTM_DELTFILTER, RTM_GETTFILTER. However, implementation of this function
involves a lot of branching on specific message type because most of the
code is message-specific. This significantly complicates adding new
functionality and doesn't provide much benefit of code reuse.

Split tc_ctl_tfilter to three standalone functions that handle filter new,
delete and get requests.

The only truly protocol independent part of tc_ctl_tfilter is code that
looks up queue, class, and block. Refactor this code to standalone
tcf_block_find function that is used by all three new handlers.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-01 11:13:50 -04:00
Paul Blakey 8258d2da9f cls_flower: Fix incorrect idr release when failing to modify rule
When we fail to modify a rule, we incorrectly release the idr handle
of the unmodified old rule.

Fix that by checking if we need to release it.

Fixes: fe2502e49b ("net_sched: remove cls_flower idr on failure")
Reported-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-31 15:18:44 -04:00
Song Liu 4341f8308d net: remove bypassed check in sch_direct_xmit()
Checking netif_xmit_frozen_or_stopped() at the end of sch_direct_xmit()
is being bypassed. This is because "ret" from sch_direct_xmit() will be
either NETDEV_TX_OK or NETDEV_TX_BUSY, and only ret == NETDEV_TX_OK == 0
will reach the condition:

    if (ret && netif_xmit_frozen_or_stopped(txq))
        return false;

This patch cleans up the code by removing the whole condition.

For more discussion about this, please refer to
   https://marc.info/?t=152727195700008

Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-31 13:26:19 -04:00
Jakub Kicinski 47c669a406 net: sched: mq: request stats from offloads
MQ doesn't hold any statistics on its own, however, statistic
from offloads are requested starting from the root, hence MQ
will read the old values for its sums.  Call into the drivers,
because of the additive nature of the stats drivers are aware
of how much "pending updates" they have to children of the MQ.
Since MQ reset its stats on every dump we can simply offset
the stats, predicting how stats of offloaded children will
change.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 09:49:16 -04:00
Jakub Kicinski f971b13230 net: sched: mq: add simple offload notification
mq offload is trivial, we just need to let the device know
that the root qdisc is mq.  Alternative approach would be
to export qdisc_lookup() and make drivers check the root
type themselves, but notification via ndo_setup_tc is more
in line with other qdiscs.

Note that mq doesn't hold any stats on it's own, it just
adds up stats of its children.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 09:49:16 -04:00
David S. Miller 5b79c2af66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of easy overlapping changes in the confict
resolutions here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26 19:46:15 -04:00
Cong Wang aaa908ffbe net_sched: switch to rcu_work
Commit 05f0fe6b74 ("RCU, workqueue: Implement rcu_work") introduces
new API's for dispatching work in a RCU callback. Now we can just
switch to the new API's for tc filters. This could get rid of a lot
of code.

Cc: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24 22:56:15 -04:00
Or Gerlitz f8f4bef322 net : sched: cls_api: deal with egdev path only if needed
When dealing with ingress rule on a netdev, if we did fine through the
conventional path, there's no need to continue into the egdev route,
and we can stop right there.

Not doing so may cause a 2nd rule to be added by the cls api layer
with the ingress being the egdev.

For example, under sriov switchdev scheme, a user rule of VFR A --> VFR B
will end up with two HW rules (1) VF A --> VF B and (2) uplink --> VF B

Fixes: 208c0f4b52 ('net: sched: use tc_setup_cb_call to call per-block callbacks')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24 22:12:22 -04:00
Vlad Buslov 290aa0ad74 net: sched: don't disable bh when accessing action idr
Initial net_device implementation used ingress_lock spinlock to synchronize
ingress path of device. This lock was used in both process and bh context.
In some code paths action map lock was obtained while holding ingress_lock.
Commit e1e992e52f ("[NET_SCHED] protect action config/dump from irqs")
modified actions to always disable bh, while using action map lock, in
order to prevent deadlock on ingress_lock in softirq. This lock was removed
from net_device, so disabling bh, while accessing action map, is no longer
necessary.

Replace all action idr spinlock usage with regular calls that do not
disable bh.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-22 15:34:34 -04:00
David S. Miller 6f6e434aa2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.

TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.

The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.

Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21 16:01:54 -04:00
Paolo Abeni 44a63b137f net: sched: red: avoid hashing NULL child
Hangbin reported an Oops triggered by the syzkaller qdisc rules:

 kasan: GPF could be caused by NULL-ptr deref or user memory access
 general protection fault: 0000 [#1] SMP KASAN PTI
 Modules linked in: sch_red
 CPU: 0 PID: 28699 Comm: syz-executor5 Not tainted 4.17.0-rc4.kcov #1
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:qdisc_hash_add+0x26/0xa0
 RSP: 0018:ffff8800589cf470 EFLAGS: 00010203
 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff824ad971
 RDX: 0000000000000007 RSI: ffffc9000ce9f000 RDI: 000000000000003c
 RBP: 0000000000000001 R08: ffffed000b139ea2 R09: ffff8800589cf4f0
 R10: ffff8800589cf50f R11: ffffed000b139ea2 R12: ffff880054019fc0
 R13: ffff880054019fb4 R14: ffff88005c0af600 R15: ffff880054019fb0
 FS:  00007fa6edcb1700(0000) GS:ffff88005ce00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000020000740 CR3: 000000000fc16000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  red_change+0x2d2/0xed0 [sch_red]
  qdisc_create+0x57e/0xef0
  tc_modify_qdisc+0x47f/0x14e0
  rtnetlink_rcv_msg+0x6a8/0x920
  netlink_rcv_skb+0x2a2/0x3c0
  netlink_unicast+0x511/0x740
  netlink_sendmsg+0x825/0xc30
  sock_sendmsg+0xc5/0x100
  ___sys_sendmsg+0x778/0x8e0
  __sys_sendmsg+0xf5/0x1b0
  do_syscall_64+0xbd/0x3b0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x450869
 RSP: 002b:00007fa6edcb0c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fa6edcb16b4 RCX: 0000000000450869
 RDX: 0000000000000000 RSI: 00000000200000c0 RDI: 0000000000000013
 RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
 R13: 0000000000008778 R14: 0000000000702838 R15: 00007fa6edcb1700
 Code: e9 0b fe ff ff 0f 1f 44 00 00 55 53 48 89 fb 89 f5 e8 3f 07 f3 fe 48 8d 7b 3c 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 51
 RIP: qdisc_hash_add+0x26/0xa0 RSP: ffff8800589cf470

When a red qdisc is updated with a 0 limit, the child qdisc is left
unmodified, no additional scheduler is created in red_change(),
the 'child' local variable is rightfully NULL and must not add it
to the hash table.

This change addresses the above issue moving qdisc_hash_add() right
after the child qdisc creation. It additionally removes unneeded checks
for noop_qdisc.

Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Fixes: 49b499718f ("net: sched: make default fifo qdiscs appear in the dump")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18 13:52:32 -04:00
Paolo Abeni 021a17ed79 pfifo_fast: drop unneeded additional lock on dequeue
After the previous patch, for NOLOCK qdiscs, q->seqlock is
always held when the dequeue() is invoked, we can drop
any additional locking to protect such operation.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 12:46:54 -04:00
Paolo Abeni 96009c7d50 sched: replace __QDISC_STATE_RUNNING bit with a spin lock
So that we can use lockdep on it.
The newly introduced sequence lock has the same scope of busylock,
so it shares the same lockdep annotation, but it's only used for
NOLOCK qdiscs.

With this changeset we acquire such lock in the control path around
flushing operation (qdisc reset), to allow more NOLOCK qdisc perf
improvement in the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 12:46:54 -04:00
Davide Caratti 5a4931ae01 net/sched: fix refcnt leak in the error path of tcf_vlan_init()
Similarly to what was done with commit a52956dfc5 ("net sched actions:
fix refcnt leak in skbmod"), fix the error path of tcf_vlan_init() to avoid
refcnt leaks when wrong value of TCA_VLAN_PUSH_VLAN_PROTOCOL is given.

Fixes: 5026c9b1ba ("net sched: vlan action fix late binding")
CC: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 14:41:44 -04:00
Paolo Abeni 32f7b44d0f sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers
Currently NOLOCK qdiscs pay a measurable overhead to atomically
manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per
packet in the uncontended scenario with packet rate below the
line rate: on packed dequeue and on the next, failing dequeue attempt.

This changeset moves the bit manipulation into the qdisc_run_{begin,end}
helpers, so that the bit is now flipped only once per packet, with
measurable performance improvement in the uncontended scenario.

This also allows simplifying the qdisc teardown code path - since
qdisc_is_running() is now effective for each qdisc type - and avoid a
possible race between qdisc_run() and dev_deactivate_many(), as now
the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy
dequeuing packets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 12:26:25 -04:00
Christoph Hellwig 3f3942aca6 proc: introduce proc_create_single{,_data}
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
David S. Miller b2d6cee117 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The bpf syscall and selftests conflicts were trivial
overlapping changes.

The r8169 change involved moving the added mdelay from 'net' into a
different function.

A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts.  I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 20:53:22 -04:00
Roman Mashak a52956dfc5 net sched actions: fix refcnt leak in skbmod
When application fails to pass flags in netlink TLV when replacing
existing skbmod action, the kernel will leak refcnt:

$ tc actions get action skbmod index 1
total acts 0

        action order 0: skbmod pipe set smac 00:11:22:33:44:55
         index 1 ref 1 bind 0

For example, at this point a buggy application replaces the action with
index 1 with new smac 00:aa:22:33:44:55, it fails because of zero flags,
however refcnt gets bumped:

$ tc actions get actions skbmod index 1
total acts 0

        action order 0: skbmod pipe set smac 00:11:22:33:44:55
         index 1 ref 2 bind 0
$

Tha patch fixes this by calling tcf_idr_release() on existing actions.

Fixes: 86da71b573 ("net_sched: Introduce skbmod action")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 16:37:03 -04:00
Jiri Pirko d68d75fdc3 net: sched: fix error path in tcf_proto_create() when modules are not configured
In case modules are not configured, error out when tp->ops is null
and prevent later null pointer dereference.

Fixes: 33a48927c1 ("sched: push TC filter protocol creation into a separate function")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 16:34:38 -04:00
Roman Mashak af5d01842f net sched actions: fix invalid pointer dereferencing if skbedit flags missing
When application fails to pass flags in netlink TLV for a new skbedit action,
the kernel results in the following oops:

[    8.307732] BUG: unable to handle kernel paging request at 0000000000021130
[    8.309167] PGD 80000000193d1067 P4D 80000000193d1067 PUD 180e0067 PMD 0
[    8.310595] Oops: 0000 [#1] SMP PTI
[    8.311334] Modules linked in: kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw
[    8.314190] CPU: 1 PID: 397 Comm: tc Not tainted 4.17.0-rc3+ #357
[    8.315252] RIP: 0010:__tcf_idr_release+0x33/0x140
[    8.316203] RSP: 0018:ffffa0718038f840 EFLAGS: 00010246
[    8.317123] RAX: 0000000000000001 RBX: 0000000000021100 RCX: 0000000000000000
[    8.319831] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000021100
[    8.321181] RBP: 0000000000000000 R08: 000000000004adf8 R09: 0000000000000122
[    8.322645] R10: 0000000000000000 R11: ffffffff9e5b01ed R12: 0000000000000000
[    8.324157] R13: ffffffff9e0d3cc0 R14: 0000000000000000 R15: 0000000000000000
[    8.325590] FS:  00007f591292e700(0000) GS:ffff8fcf5bc40000(0000) knlGS:0000000000000000
[    8.327001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.327987] CR2: 0000000000021130 CR3: 00000000180e6004 CR4: 00000000001606a0
[    8.329289] Call Trace:
[    8.329735]  tcf_skbedit_init+0xa7/0xb0
[    8.330423]  tcf_action_init_1+0x362/0x410
[    8.331139]  ? try_to_wake_up+0x44/0x430
[    8.331817]  tcf_action_init+0x103/0x190
[    8.332511]  tc_ctl_action+0x11a/0x220
[    8.333174]  rtnetlink_rcv_msg+0x23d/0x2e0
[    8.333902]  ? _cond_resched+0x16/0x40
[    8.334569]  ? __kmalloc_node_track_caller+0x5b/0x2c0
[    8.335440]  ? rtnl_calcit.isra.31+0xf0/0xf0
[    8.336178]  netlink_rcv_skb+0xdb/0x110
[    8.336855]  netlink_unicast+0x167/0x220
[    8.337550]  netlink_sendmsg+0x2a7/0x390
[    8.338258]  sock_sendmsg+0x30/0x40
[    8.338865]  ___sys_sendmsg+0x2c5/0x2e0
[    8.339531]  ? pagecache_get_page+0x27/0x210
[    8.340271]  ? filemap_fault+0xa2/0x630
[    8.340943]  ? page_add_file_rmap+0x108/0x200
[    8.341732]  ? alloc_set_pte+0x2aa/0x530
[    8.342573]  ? finish_fault+0x4e/0x70
[    8.343332]  ? __handle_mm_fault+0xbc1/0x10d0
[    8.344337]  ? __sys_sendmsg+0x53/0x80
[    8.345040]  __sys_sendmsg+0x53/0x80
[    8.345678]  do_syscall_64+0x4f/0x100
[    8.346339]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.347206] RIP: 0033:0x7f591191da67
[    8.347831] RSP: 002b:00007fff745abd48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[    8.349179] RAX: ffffffffffffffda RBX: 00007fff745abe70 RCX: 00007f591191da67
[    8.350431] RDX: 0000000000000000 RSI: 00007fff745abdc0 RDI: 0000000000000003
[    8.351659] RBP: 000000005af35251 R08: 0000000000000001 R09: 0000000000000000
[    8.352922] R10: 00000000000005f1 R11: 0000000000000246 R12: 0000000000000000
[    8.354183] R13: 00007fff745afed0 R14: 0000000000000001 R15: 00000000006767c0
[    8.355400] Code: 41 89 d4 53 89 f5 48 89 fb e8 aa 20 fd ff 85 c0 0f 84 ed 00
00 00 48 85 db 0f 84 cf 00 00 00 40 84 ed 0f 85 cd 00 00 00 45 84 e4 <8b> 53 30
74 0d 85 d2 b8 ff ff ff ff 0f 8f b3 00 00 00 8b 43 2c
[    8.358699] RIP: __tcf_idr_release+0x33/0x140 RSP: ffffa0718038f840
[    8.359770] CR2: 0000000000021130
[    8.360438] ---[ end trace 60c66be45dfc14f0 ]---

The caller calls action's ->init() and passes pointer to "struct tc_action *a",
which later may be initialized to point at the existing action, otherwise
"struct tc_action *a" is still invalid, and therefore dereferencing it is an
error as happens in tcf_idr_release, where refcnt is decremented.

So in case of missing flags tcf_idr_release must be called only for
existing actions.

v2:
    - prepare patch for net tree

Fixes: 5e1567aeb7 ("net sched: skbedit action fix late binding")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 15:52:43 -04:00
David S. Miller a7b15ab887 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes in selftests Makefile.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04 09:58:56 -04:00
Craig Dillabaugh 29e6eee192 net sched: Implemented get_fill_size routine for act_csum.
Signed-off-by: Craig Dillabaugh <cdillaba@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-03 11:15:58 -04:00
Eric Dumazet 7df40c2673 net_sched: fq: take care of throttled flows before reuse
Normally, a socket can not be freed/reused unless all its TX packets
left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
this to happen.

With commit fc59d5bdf1 ("pkt_sched: fq: clear time_next_packet for
reused flows") we cleared f->time_next_packet but took no special
action if the flow was still in the throttled rb-tree.

Since f->time_next_packet is the key used in the rb-tree searches,
blindly clearing it might break rb-tree integrity. We need to make
sure the flow is no longer in the rb-tree to avoid this problem.

Fixes: fc59d5bdf1 ("pkt_sched: fq: clear time_next_packet for reused flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 16:37:38 -04:00
Paul Blakey 05cd271fd6 cls_flower: Support multiple masks per priority
Currently flower doesn't support inserting filters with different masks
on a single priority, even if the actual flows (key + mask) inserted
aren't overlapping, as with the use case of offloading openvswitch
datapath flows. Instead one must go up one level, and assign different
priorities for each mask, which will create a different flower
instances.

This patch opens flower to support more than one mask per priority,
and a single flower instance. It does so by adding another hash table
on top of the existing one which will store the different masks,
and the filters that share it.

The user is left with the responsibility of ensuring non overlapping
flows, otherwise precedence is not guaranteed.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 14:14:15 -04:00
Alexander Aring cc74eddd0f net: sched: ife: handle malformed tlv length
There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Alexander Aring f6cd14537f net: sched: ife: signal not finding metaid
We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Cong Wang f12c643209 net_sched: fix a missing idr_remove() in u32_delete_key()
When we delete a u32 key via u32_delete_key(), we forget to
call idr_remove() to remove its handle from IDR.

Fixes: e7614370d6 ("net_sched: use idr to allocate u32 filter handles")
Reported-by: Marcin Kabiesz <admin@hostcenter.eu>
Tested-by: Marcin Kabiesz <admin@hostcenter.eu>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 12:36:45 -04:00
Davide Caratti 3239534a79 net/sched: fix NULL dereference in the error path of tcf_bpf_init()
when tcf_bpf_init_from_ops() fails (e.g. because of program having invalid
number of instructions), tcf_bpf_cfg_cleanup() calls bpf_prog_put(NULL) or
bpf_prog_destroy(NULL). Unless CONFIG_BPF_SYSCALL is unset, this causes
the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
 PGD 800000007345a067 P4D 800000007345a067 PUD 340e1067 PMD 0
 Oops: 0000 [#1] SMP PTI
 Modules linked in: act_bpf(E) ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_generic pcbc snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd glue_helper cryptd joydev snd_timer snd virtio_balloon pcspkr soundcore i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk drm virtio_net virtio_console i2c_core crc32c_intel serio_raw virtio_pci ata_piix libata virtio_ring floppy virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_bpf]
 CPU: 3 PID: 5654 Comm: tc Tainted: G            E    4.16.0.bpf_test+ #408
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__bpf_prog_put+0xc/0xc0
 RSP: 0018:ffff9594003ef728 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff9594003ef758 RCX: 0000000000000024
 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
 R10: 0000000000000220 R11: ffff8a7ab9f17131 R12: 0000000000000000
 R13: ffff8a7ab7c3c8e0 R14: 0000000000000001 R15: ffff8a7ab88f1054
 FS:  00007fcb2f17c740(0000) GS:ffff8a7abfd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000020 CR3: 000000007c888006 CR4: 00000000001606e0
 Call Trace:
  tcf_bpf_cfg_cleanup+0x2f/0x40 [act_bpf]
  tcf_bpf_cleanup+0x4c/0x70 [act_bpf]
  __tcf_idr_release+0x79/0x140
  tcf_bpf_init+0x125/0x330 [act_bpf]
  tcf_action_init_1+0x2cc/0x430
  ? get_page_from_freelist+0x3f0/0x11b0
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.29+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? mem_cgroup_commit_charge+0x80/0x130
  ? page_add_new_anon_rmap+0x73/0xc0
  ? do_anonymous_page+0x2a2/0x560
  ? __handle_mm_fault+0xc75/0xe20
  __sys_sendmsg+0x58/0xa0
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7fcb2e58eba0
 RSP: 002b:00007ffc93c496c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffc93c497f0 RCX: 00007fcb2e58eba0
 RDX: 0000000000000000 RSI: 00007ffc93c49740 RDI: 0000000000000003
 RBP: 000000005ac6a646 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007ffc93c49120 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffc93c49804 R14: 0000000000000001 R15: 000000000066afa0
 Code: 5f 00 48 8b 43 20 48 c7 c7 70 2f 7c b8 c7 40 10 00 00 00 00 5b e9 a5 8b 61 00 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 <48> 8b 47 20 f0 ff 08 74 05 5b 5d 41 5c c3 41 89 f4 0f 1f 44 00
 RIP: __bpf_prog_put+0xc/0xc0 RSP: ffff9594003ef728
 CR2: 0000000000000020

Fix it in tcf_bpf_cfg_cleanup(), ensuring that bpf_prog_{put,destroy}(f)
is called only when f is not NULL.

Fixes: bbc09e7842 ("net/sched: fix idr leak on the error path of tcf_bpf_init()")
Reported-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 22:26:06 -04:00
Li RongQing da01ec4ee5 net: sched: do not emit messages while holding spinlock
move messages emitting out of sch_tree_lock to avoid holding
this lock too long.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 20:53:50 -04:00
David S. Miller c0b458a946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:

1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
   MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE

2) In 'net-next' params->log_rq_size is renamed to be
   params->log_rq_mtu_frames.

3) In 'net-next' params->hard_mtu is added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 19:49:34 -04:00
Kirill Tkhai 2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Craig Dillabaugh 734549eb55 net sched actions: fix dumping which requires several messages to user space
Fixes a bug in the tcf_dump_walker function that can cause some actions
to not be reported when dumping a large number of actions. This issue
became more aggrevated when cookies feature was added. In particular
this issue is manifest when large cookie values are assigned to the
actions and when enough actions are created that the resulting table
must be dumped in multiple batches.

The number of actions returned in each batch is limited by the total
number of actions and the memory buffer size.  With small cookies
the numeric limit is reached before the buffer size limit, which avoids
the code path triggering this bug. When large cookies are used buffer
fills before the numeric limit, and the erroneous code path is hit.

For example after creating 32 csum actions with the cookie
aaaabbbbccccdddd

$ tc actions ls action csum
total acts 26

    action order 0: csum (tcp) action continue
    index 1 ref 1 bind 0
    cookie aaaabbbbccccdddd

    .....

    action order 25: csum (tcp) action continue
    index 26 ref 1 bind 0
    cookie aaaabbbbccccdddd
total acts 6

    action order 0: csum (tcp) action continue
    index 28 ref 1 bind 0
    cookie aaaabbbbccccdddd

    ......

    action order 5: csum (tcp) action continue
    index 32 ref 1 bind 0
    cookie aaaabbbbccccdddd

Note that the action with index 27 is omitted from the report.

Fixes: 4b3550ef53 ("[NET_SCHED]: Use nla_nest_start/nla_nest_end")"
Signed-off-by: Craig Dillabaugh <cdillaba@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:59:12 -04:00
John Fastabend eb82a99447 net: sched, fix OOO packets with pfifo_fast
After the qdisc lock was dropped in pfifo_fast we allow multiple
enqueue threads and dequeue threads to run in parallel. On the
enqueue side the skb bit ooo_okay is used to ensure all related
skbs are enqueued in-order. On the dequeue side though there is
no similar logic. What we observe is with fewer queues than CPUs
it is possible to re-order packets when two instances of
__qdisc_run() are running in parallel. Each thread will dequeue
a skb and then whichever thread calls the ndo op first will
be sent on the wire. This doesn't typically happen because
qdisc_run() is usually triggered by the same core that did the
enqueue. However, drivers will trigger __netif_schedule()
when queues are transitioning from stopped to awake using the
netif_tx_wake_* APIs. When this happens netif_schedule() calls
qdisc_run() on the same CPU that did the netif_tx_wake_* which
is usually done in the interrupt completion context. This CPU
is selected with the irq affinity which is unrelated to the
enqueue operations.

To resolve this we add a RUNNING bit to the qdisc to ensure
only a single dequeue per qdisc is running. Enqueue and dequeue
operations can still run in parallel and also on multi queue
NICs we can still have a dequeue in-flight per qdisc, which
is typically per CPU.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Reported-by: Jakob Unterwurzacher <jakob.unterwurzacher@theobroma-systems.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:36:23 -04:00
Davide Caratti 94cb549240 net/sched: act_vlan: declare push_vid with host byte order
use u16 in place of __be16 to suppress the following sparse warnings:

 net/sched/act_vlan.c:150:26: warning: incorrect type in assignment (different base types)
 net/sched/act_vlan.c:150:26: expected restricted __be16 [usertype] push_vid
 net/sched/act_vlan.c:150:26: got unsigned short
 net/sched/act_vlan.c:151:21: warning: restricted __be16 degrades to integer
 net/sched/act_vlan.c:208:26: warning: incorrect type in assignment (different base types)
 net/sched/act_vlan.c:208:26: expected unsigned short [unsigned] [usertype] tcfv_push_vid
 net/sched/act_vlan.c:208:26: got restricted __be16 [usertype] push_vid

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 21:54:07 -04:00
Davide Caratti affaa0c724 net/sched: remove tcf_idr_cleanup()
tcf_idr_cleanup() is no more used, so remove it.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 21:52:19 -04:00
David S. Miller 03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
Davide Caratti f29cdfbe33 net/sched: fix idr leak in the error path of tcf_skbmod_init()
tcf_skbmod_init() can fail after the idr has been successfully reserved.
When this happens, every subsequent attempt to configure skbmod rules
using the same idr value will systematically fail with -ENOSPC, unless
the first attempt was done using the 'replace' keyword:

 # tc action add action skbmod swap mac index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action skbmod swap mac index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action skbmod swap mac index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in tcf_skbmod_init(), ensuring that tcf_idr_release() is called
on the error path when the idr has been reserved, but not yet inserted.
Also, don't test 'ovr' in the error path, to avoid a 'replace' failure
implicitly become a 'delete' that leaks refcount in act_skbmod module:

 # rmmod act_skbmod; modprobe act_skbmod
 # tc action add action skbmod swap mac index 100
 # tc action add action skbmod swap mac continue index 100
 RTNETLINK answers: File exists
 We have an error talking to the kernel
 # tc action replace action skbmod swap mac continue index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action list action skbmod
 #
 # rmmod  act_skbmod
 rmmod: ERROR: Module act_skbmod is in use

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:37 -04:00
Davide Caratti d7f2001573 net/sched: fix idr leak in the error path of tcf_vlan_init()
tcf_vlan_init() can fail after the idr has been successfully reserved.
When this happens, every subsequent attempt to configure vlan rules using
the same idr value will systematically fail with -ENOSPC, unless the first
attempt was done using the 'replace' keyword.

 # tc action add action vlan pop index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action vlan pop index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action vlan pop index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in tcf_vlan_init(), ensuring that tcf_idr_release() is called on
the error path when the idr has been reserved, but not yet inserted. Also,
don't test 'ovr' in the error path, to avoid a 'replace' failure implicitly
become a 'delete' that leaks refcount in act_vlan module:

 # rmmod act_vlan; modprobe act_vlan
 # tc action add action vlan push id 5 index 100
 # tc action replace action vlan push id 7 index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action list action vlan
 #
 # rmmod act_vlan
 rmmod: ERROR: Module act_vlan is in use

Fixes: 4c5b9d9642 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:27 -04:00
Davide Caratti 1e46ef1762 net/sched: fix idr leak in the error path of __tcf_ipt_init()
__tcf_ipt_init() can fail after the idr has been successfully reserved.
When this happens, subsequent attempts to configure xt/ipt rules using
the same idr value systematically fail with -ENOSPC:

 # tc action add action xt -j LOG --log-prefix test1 index 100
 tablename: mangle hook: NF_IP_POST_ROUTING
         target:  LOG level warning prefix "test1" index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 Command "(null)" is unknown, try "tc actions help".
 # tc action add action xt -j LOG --log-prefix test1 index 100
 tablename: mangle hook: NF_IP_POST_ROUTING
         target:  LOG level warning prefix "test1" index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 Command "(null)" is unknown, try "tc actions help".
 # tc action add action xt -j LOG --log-prefix test1 index 100
 tablename: mangle hook: NF_IP_POST_ROUTING
         target:  LOG level warning prefix "test1" index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in the error path of __tcf_ipt_init(), calling tcf_idr_release()
in place of tcf_idr_cleanup(). Since tcf_ipt_release() can now be called
when tcfi_t is NULL, we also need to protect calls to ipt_destroy_target()
to avoid NULL pointer dereference.

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:16 -04:00
Davide Caratti 94fa3f929e net/sched: fix idr leak in the error path of tcp_pedit_init()
tcf_pedit_init() can fail to allocate 'keys' after the idr has been
successfully reserved. When this happens, subsequent attempts to configure
a pedit rule using the same idr value systematically fail with -ENOSPC:

 # tc action add action pedit munge ip ttl set 63 index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action pedit munge ip ttl set 63 index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action pedit munge ip ttl set 63 index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in the error path of tcf_act_pedit_init(), calling
tcf_idr_release() in place of tcf_idr_cleanup().

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:08 -04:00
Davide Caratti 5bf7f8185f net/sched: fix idr leak in the error path of tcf_act_police_init()
tcf_act_police_init() can fail after the idr has been successfully
reserved (e.g., qdisc_get_rtab() may return NULL). When this happens,
subsequent attempts to configure a police rule using the same idr value
systematiclly fail with -ENOSPC:

 # tc action add action police rate 1000 burst 1000 drop index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action police rate 1000 burst 1000 drop index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action police rate 1000 burst 1000 drop index 100
 RTNETLINK answers: No space left on device
 ...

Fix this in the error path of tcf_act_police_init(), calling
tcf_idr_release() in place of tcf_idr_cleanup().

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:00 -04:00
Davide Caratti 60e10b3adc net/sched: fix idr leak in the error path of tcf_simp_init()
if the kernel fails to duplicate 'sdata', creation of a new action fails
with -ENOMEM. However, subsequent attempts to install the same action
using the same value of 'index' systematically fail with -ENOSPC, and
that value of 'index' will no more be usable by act_simple, until rmmod /
insmod of act_simple.ko is done:

 # tc actions add action simple sdata hello index 100
 # tc actions list action simple

        action order 0: Simple <hello>
         index 100 ref 1 bind 0
 # tc actions flush action simple
 # tc actions add action simple sdata hello index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc actions flush action simple
 # tc actions add action simple sdata hello index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc actions add action simple sdata hello index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in the error path of tcf_simp_init(), calling tcf_idr_release()
in place of tcf_idr_cleanup().

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:11:53 -04:00
Davide Caratti bbc09e7842 net/sched: fix idr leak on the error path of tcf_bpf_init()
when the following command sequence is entered

 # tc action add action bpf bytecode '4,40 0 0 12,31 0 1 2048,6 0 0 262144,6 0 0 0' index 100
 RTNETLINK answers: Invalid argument
 We have an error talking to the kernel
 # tc action add action bpf bytecode '4,40 0 0 12,21 0 1 2048,6 0 0 262144,6 0 0 0' index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel

act_bpf correctly refuses to install the first TC rule, because 31 is not
a valid instruction. However, it refuses to install the second TC rule,
even if the BPF code is correct. Furthermore, it's no more possible to
install any other rule having the same value of 'index' until act_bpf
module is unloaded/inserted again. After the idr has been reserved, call
tcf_idr_release() instead of tcf_idr_cleanup(), to fix this issue.

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:11:46 -04:00
Davide Caratti 2d43361017 net/sched: fix NULL dereference on the error path of tcf_skbmod_init()
when the following command

 # tc action replace action skbmod swap mac index 100

is run for the first time, and tcf_skbmod_init() fails to allocate struct
tcf_skbmod_params, tcf_skbmod_cleanup() calls kfree_rcu(NULL), thus
causing the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
 IP: __call_rcu+0x23/0x2b0
 PGD 8000000034057067 P4D 8000000034057067 PUD 74937067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_skbmod(E) psample ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel snd_hda_codec crct10dif_pclmul mbcache jbd2 crc32_pclmul snd_hda_core ghash_clmulni_intel snd_hwdep pcbc snd_seq snd_seq_device snd_pcm aesni_intel snd_timer crypto_simd glue_helper snd cryptd virtio_balloon joydev soundcore pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_console virtio_net virtio_blk ata_piix libata crc32c_intel virtio_pci serio_raw virtio_ring virtio i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_skbmod]
 CPU: 3 PID: 3144 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:ffffbd2e403e7798 EFLAGS: 00010246
 RAX: ffffffffc0872080 RBX: ffff981d34bff780 RCX: 00000000ffffffff
 RDX: ffffffff922a5f00 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000021f
 R10: 000000003d003000 R11: 0000000000aaaaaa R12: 0000000000000000
 R13: ffffffff922a5f00 R14: 0000000000000001 R15: ffff981d3b698c2c
 FS:  00007f3678292740(0000) GS:ffff981d3fd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 000000007c57a006 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_skbmod_init+0x1d1/0x210 [act_skbmod]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f36776a3ba0
 RSP: 002b:00007fff4703b618 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fff4703b740 RCX: 00007f36776a3ba0
 RDX: 0000000000000000 RSI: 00007fff4703b690 RDI: 0000000000000003
 RBP: 000000005aaaba36 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fff4703b0a0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fff4703b754 R14: 0000000000000001 R15: 0000000000669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: ffffbd2e403e7798
 CR2: 0000000000000008

Fix it in tcf_skbmod_cleanup(), ensuring that kfree_rcu(p, ...) is called
only when p is not NULL.

Fixes: 86da71b573 ("net_sched: Introduce skbmod action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 19:53:29 -04:00
Davide Caratti 1f110e7cae net/sched: fix NULL dereference in the error path of tcf_sample_init()
when the following command

 # tc action add action sample rate 100 group 100 index 100

is run for the first time, and psample_group_get(100) fails to create a
new group, tcf_sample_cleanup() calls psample_group_put(NULL), thus
causing the following error:

 BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
 IP: psample_group_put+0x15/0x71 [psample]
 PGD 8000000075775067 P4D 8000000075775067 PUD 7453c067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_sample(E) psample ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core mbcache jbd2 crct10dif_pclmul snd_hwdep crc32_pclmul snd_seq ghash_clmulni_intel pcbc snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer glue_helper snd cryptd joydev pcspkr i2c_piix4 soundcore virtio_balloon nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_net ata_piix virtio_console virtio_blk libata serio_raw crc32c_intel virtio_pci i2c_core virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_tunnel_key]
 CPU: 2 PID: 5740 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:psample_group_put+0x15/0x71 [psample]
 RSP: 0018:ffffb8a80032f7d0 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000024
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffffc06d93c0
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
 R10: 00000000bd003000 R11: ffff979fba04aa59 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff979fbba3f22c
 FS:  00007f7638112740(0000) GS:ffff979fbfd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000001c CR3: 00000000734ea001 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_sample_init+0x125/0x1d0 [act_sample]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f7637523ba0
 RSP: 002b:00007fff0473ef58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fff0473f080 RCX: 00007f7637523ba0
 RDX: 0000000000000000 RSI: 00007fff0473efd0 RDI: 0000000000000003
 RBP: 000000005aaaac80 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fff0473e9e0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fff0473f094 R14: 0000000000000001 R15: 0000000000669f60
 Code: be 02 00 00 00 48 89 df e8 a9 fe ff ff e9 7c ff ff ff 0f 1f 40 00 0f 1f 44 00 00 53 48 89 fb 48 c7 c7 c0 93 6d c0 e8 db 20 8c ef <83> 6b 1c 01 74 10 48 c7 c7 c0 93 6d c0 ff 14 25 e8 83 83 b0 5b
 RIP: psample_group_put+0x15/0x71 [psample] RSP: ffffb8a80032f7d0
 CR2: 000000000000001c

Fix it in tcf_sample_cleanup(), ensuring that calls to psample_group_put(p)
are done only when p is not NULL.

Fixes: cadb9c9fdb ("net/sched: act_sample: Fix error path in init")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 19:53:28 -04:00
Davide Caratti abdadd3cfd net/sched: fix NULL dereference in the error path of tunnel_key_init()
when the following command

 # tc action add action tunnel_key unset index 100

is run for the first time, and tunnel_key_init() fails to allocate struct
tcf_tunnel_key_params, tunnel_key_release() dereferences NULL pointers.
This causes the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
 IP: tunnel_key_release+0xd/0x40 [act_tunnel_key]
 PGD 8000000033787067 P4D 8000000033787067 PUD 74646067 PMD 0
 Oops: 0000 [#1] SMP PTI
 Modules linked in: act_tunnel_key(E) act_csum ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel snd_hda_intel pcbc snd_hda_codec snd_hda_core snd_hwdep snd_seq aesni_intel snd_seq_device crypto_simd glue_helper snd_pcm cryptd joydev snd_timer pcspkr virtio_balloon snd i2c_piix4 soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net virtio_blk drm virtio_console crc32c_intel ata_piix serio_raw i2c_core virtio_pci libata virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod
 CPU: 2 PID: 3101 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:tunnel_key_release+0xd/0x40 [act_tunnel_key]
 RSP: 0018:ffffba46803b7768 EFLAGS: 00010286
 RAX: ffffffffc09010a0 RBX: 0000000000000000 RCX: 0000000000000024
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff99ee336d7480
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
 R10: 0000000000000220 R11: ffff99ee79d73131 R12: 0000000000000000
 R13: ffff99ee32d67610 R14: ffff99ee7671dc38 R15: 00000000fffffff4
 FS:  00007febcb2cd740(0000) GS:ffff99ee7fd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000010 CR3: 000000007c8e4005 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tunnel_key_init+0xd9/0x460 [act_tunnel_key]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7febca6deba0
 RSP: 002b:00007ffe7b0dd128 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffe7b0dd250 RCX: 00007febca6deba0
 RDX: 0000000000000000 RSI: 00007ffe7b0dd1a0 RDI: 0000000000000003
 RBP: 000000005aaa90cb R08: 0000000000000002 R09: 0000000000000000
 R10: 00007ffe7b0dcba0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffe7b0dd264 R14: 0000000000000001 R15: 0000000000669f60
 Code: 44 00 00 8b 0d b5 23 00 00 48 8b 87 48 10 00 00 48 8b 3c c8 e9 a5 e5 d8 c3 0f 1f 44 00 00 0f 1f 44 00 00 53 48 8b 9f b0 00 00 00 <83> 7b 10 01 74 0b 48 89 df 31 f6 5b e9 f2 fa 7f c3 48 8b 7b 18
 RIP: tunnel_key_release+0xd/0x40 [act_tunnel_key] RSP: ffffba46803b7768
 CR2: 0000000000000010

Fix this in tunnel_key_release(), ensuring 'param' is not NULL before
dereferencing it.

Fixes: d0f6dd8a91 ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 19:53:28 -04:00
Davide Caratti aab378a7ed net/sched: fix NULL dereference in the error path of tcf_csum_init()
when the following command

 # tc action add action csum udp continue index 100

is run for the first time, and tcf_csum_init() fails allocating struct
tcf_csum, tcf_csum_cleanup() calls kfree_rcu(NULL,...). This causes the
following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
 IP: __call_rcu+0x23/0x2b0
 PGD 80000000740b4067 P4D 80000000740b4067 PUD 32e7f067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_csum(E) act_vlan ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_generic pcbc snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer aesni_intel crypto_simd glue_helper cryptd snd joydev pcspkr virtio_balloon i2c_piix4 soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk drm virtio_net virtio_console ata_piix crc32c_intel libata virtio_pci serio_raw i2c_core virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_vlan]
 CPU: 2 PID: 5763 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:ffffb275803e77c0 EFLAGS: 00010246
 RAX: ffffffffc057b080 RBX: ffff9674bc6f5240 RCX: 00000000ffffffff
 RDX: ffffffff928a5f00 RSI: 0000000000000008 RDI: 0000000000000008
 RBP: 0000000000000008 R08: 0000000000000001 R09: 0000000000000044
 R10: 0000000000000220 R11: ffff9674b9ab4821 R12: 0000000000000000
 R13: ffffffff928a5f00 R14: 0000000000000000 R15: 0000000000000001
 FS:  00007fa6368d8740(0000) GS:ffff9674bfd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000010 CR3: 0000000073dec001 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_csum_init+0xfb/0x180 [act_csum]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7fa635ce9ba0
 RSP: 002b:00007ffc185b0fc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffc185b10f0 RCX: 00007fa635ce9ba0
 RDX: 0000000000000000 RSI: 00007ffc185b1040 RDI: 0000000000000003
 RBP: 000000005aaa85e0 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007ffc185b0a20 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffc185b1104 R14: 0000000000000001 R15: 0000000000669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: ffffb275803e77c0
 CR2: 0000000000000010

fix this in tcf_csum_cleanup(), ensuring that kfree_rcu(param, ...) is
called only when param is not NULL.

Fixes: 9c5f69bbd7 ("net/sched: act_csum: don't use spinlock in the fast path")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 19:53:28 -04:00
Davide Caratti 1edf8abe04 net/sched: fix NULL dereference in the error path of tcf_vlan_init()
when the following command

 # tc actions replace action vlan pop index 100

is run for the first time, and tcf_vlan_init() fails allocating struct
tcf_vlan_params, tcf_vlan_cleanup() calls kfree_rcu(NULL, ...). This causes
the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
 IP: __call_rcu+0x23/0x2b0
 PGD 80000000760a2067 P4D 80000000760a2067 PUD 742c1067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_vlan(E) ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel mbcache snd_hda_codec jbd2 snd_hda_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer glue_helper snd cryptd joydev soundcore virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_console virtio_blk virtio_net ata_piix crc32c_intel libata virtio_pci i2c_core virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_vlan]
 CPU: 3 PID: 3119 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:ffffaac3005fb798 EFLAGS: 00010246
 RAX: ffffffffc0704080 RBX: ffff97f2b4bbe900 RCX: 00000000ffffffff
 RDX: ffffffffabca5f00 RSI: 0000000000000010 RDI: 0000000000000010
 RBP: 0000000000000010 R08: 0000000000000001 R09: 0000000000000044
 R10: 00000000fd003000 R11: ffff97f2faab5b91 R12: 0000000000000000
 R13: ffffffffabca5f00 R14: ffff97f2fb80202c R15: 00000000fffffff4
 FS:  00007f68f75b4740(0000) GS:ffff97f2ffd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000018 CR3: 0000000072b52001 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_vlan_init+0x168/0x270 [act_vlan]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f68f69c5ba0
 RSP: 002b:00007fffd79c1118 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fffd79c1240 RCX: 00007f68f69c5ba0
 RDX: 0000000000000000 RSI: 00007fffd79c1190 RDI: 0000000000000003
 RBP: 000000005aaa708e R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fffd79c0ba0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fffd79c1254 R14: 0000000000000001 R15: 0000000000669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: ffffaac3005fb798
 CR2: 0000000000000018

fix this in tcf_vlan_cleanup(), ensuring that kfree_rcu(p, ...) is called
only when p is not NULL.

Fixes: 4c5b9d9642 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Manish Kurup <manish.kurup@verizon.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 19:53:28 -04:00
Eric Dumazet cce6294cc2 net: sched: fix uses after free
syzbot reported one use-after-free in pfifo_fast_enqueue() [1]

Issue here is that we can not reuse skb after a successful skb_array_produce()
since another cpu might have consumed it already.

I believe a similar problem exists in try_bulk_dequeue_skb_slow()
in case we put an skb into qdisc_enqueue_skb_bad_txq() for lockless qdisc.

[1]
BUG: KASAN: use-after-free in qdisc_pkt_len include/net/sch_generic.h:610 [inline]
BUG: KASAN: use-after-free in qdisc_qstats_cpu_backlog_inc include/net/sch_generic.h:712 [inline]
BUG: KASAN: use-after-free in pfifo_fast_enqueue+0x4bc/0x5e0 net/sched/sch_generic.c:639
Read of size 4 at addr ffff8801cede37e8 by task syzkaller717588/5543

CPU: 1 PID: 5543 Comm: syzkaller717588 Not tainted 4.16.0-rc4+ #265
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x24d lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x23c/0x360 mm/kasan/report.c:412
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
 qdisc_pkt_len include/net/sch_generic.h:610 [inline]
 qdisc_qstats_cpu_backlog_inc include/net/sch_generic.h:712 [inline]
 pfifo_fast_enqueue+0x4bc/0x5e0 net/sched/sch_generic.c:639
 __dev_xmit_skb net/core/dev.c:3216 [inline]

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+ed43b6903ab968b16f54@syzkaller.appspotmail.com
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc:	Cong Wang <xiyou.wangcong@gmail.com>
Cc:	Jiri Pirko <jiri@resnulli.us>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 17:03:45 -04:00
Roman Mashak 51d4740f88 net sched actions: return explicit error when tunnel_key mode is not specified
If set/unset mode of the tunnel_key action is not provided, ->init() still
returns 0, and the caller proceeds with bogus 'struct tc_action *' object,
this results in crash:

% tc actions add action tunnel_key src_ip 1.1.1.1 dst_ip 2.2.2.1 id 7 index 1

[   35.805515] general protection fault: 0000 [#1] SMP PTI
[   35.806161] Modules linked in: act_tunnel_key kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64
crypto_simd glue_helper cryptd serio_raw
[   35.808233] CPU: 1 PID: 428 Comm: tc Not tainted 4.16.0-rc4+ #286
[   35.808929] RIP: 0010:tcf_action_init+0x90/0x190
[   35.809457] RSP: 0018:ffffb8edc068b9a0 EFLAGS: 00010206
[   35.810053] RAX: 1320c000000a0003 RBX: 0000000000000001 RCX: 0000000000000000
[   35.810866] RDX: 0000000000000070 RSI: 0000000000007965 RDI: ffffb8edc068b910
[   35.811660] RBP: ffffb8edc068b9d0 R08: 0000000000000000 R09: ffffb8edc068b808
[   35.812463] R10: ffffffffc02bf040 R11: 0000000000000040 R12: ffffb8edc068bb38
[   35.813235] R13: 0000000000000000 R14: 0000000000000000 R15: ffffb8edc068b910
[   35.814006] FS:  00007f3d0d8556c0(0000) GS:ffff91d1dbc40000(0000)
knlGS:0000000000000000
[   35.814881] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   35.815540] CR2: 000000000043f720 CR3: 0000000019248001 CR4: 00000000001606a0
[   35.816457] Call Trace:
[   35.817158]  tc_ctl_action+0x11a/0x220
[   35.817795]  rtnetlink_rcv_msg+0x23d/0x2e0
[   35.818457]  ? __slab_alloc+0x1c/0x30
[   35.819079]  ? __kmalloc_node_track_caller+0xb1/0x2b0
[   35.819544]  ? rtnl_calcit.isra.30+0xe0/0xe0
[   35.820231]  netlink_rcv_skb+0xce/0x100
[   35.820744]  netlink_unicast+0x164/0x220
[   35.821500]  netlink_sendmsg+0x293/0x370
[   35.822040]  sock_sendmsg+0x30/0x40
[   35.822508]  ___sys_sendmsg+0x2c5/0x2e0
[   35.823149]  ? pagecache_get_page+0x27/0x220
[   35.823714]  ? filemap_fault+0xa2/0x640
[   35.824423]  ? page_add_file_rmap+0x108/0x200
[   35.825065]  ? alloc_set_pte+0x2aa/0x530
[   35.825585]  ? finish_fault+0x4e/0x70
[   35.826140]  ? __handle_mm_fault+0xbc1/0x10d0
[   35.826723]  ? __sys_sendmsg+0x41/0x70
[   35.827230]  __sys_sendmsg+0x41/0x70
[   35.827710]  do_syscall_64+0x68/0x120
[   35.828195]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   35.828859] RIP: 0033:0x7f3d0ca4da67
[   35.829331] RSP: 002b:00007ffc9f284338 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[   35.830304] RAX: ffffffffffffffda RBX: 00007ffc9f284460 RCX: 00007f3d0ca4da67
[   35.831247] RDX: 0000000000000000 RSI: 00007ffc9f2843b0 RDI: 0000000000000003
[   35.832167] RBP: 000000005aa6a7a9 R08: 0000000000000001 R09: 0000000000000000
[   35.833075] R10: 00000000000005f1 R11: 0000000000000246 R12: 0000000000000000
[   35.833997] R13: 00007ffc9f2884c0 R14: 0000000000000001 R15: 0000000000674640
[   35.834923] Code: 24 30 bb 01 00 00 00 45 31 f6 eb 5e 8b 50 08 83 c2 07 83 e2
fc 83 c2 70 49 8b 07 48 8b 40 70 48 85 c0 74 10 48 89 14 24 4c 89 ff <ff> d0 48
8b 14 24 48 01 c2 49 01 d6 45 85 ed 74 05 41 83 47 2c
[   35.837442] RIP: tcf_action_init+0x90/0x190 RSP: ffffb8edc068b9a0
[   35.838291] ---[ end trace a095c06ee4b97a26 ]---

Fixes: d0f6dd8a91 ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-15 14:43:41 -04:00
Daniel Axtens 1dd27cde30 net: use skb_is_gso_sctp() instead of open-coding
As well as the basic conversion, I noticed that a lot of the
SCTP code checks gso_type without first checking skb_is_gso()
so I have added that where appropriate.

Also, document the helper.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:41:47 -05:00
Roman Mashak 9c5c9c5737 net sched actions: implement get_fill_size routine in act_gact
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:25:12 -05:00
Roman Mashak 4e76e75d6a net sched actions: calculate add/delete event message size
Introduce routines to calculate size of the shared tc netlink attributes
and the full message size including netlink header and tc service header.

Update add/delete action logic to have the size for event messages,
the size is passed to tcf_add_notify() and tcf_del_notify() where the
notification message is being allocated and constructed.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:25:11 -05:00
Roman Mashak d04e6990c9 net sched actions: update Add/Delete action API with new argument
Introduce a new function argument to carry total attributes size for
correct allocation of skb in event messages.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:25:11 -05:00
Pieter Jansen van Vuuren 459d153d99 net/sched: cls_flower: Add support to handle first frag as match field
Allow setting firstfrag as matching option in tc flower classifier.

 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_flags firstfrag
     action mirred egress redirect dev eth1

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-08 12:16:29 -05:00
Alexey Kodanev 35d889d10b sch_netem: fix skb leak in netem_enqueue()
When we exceed current packets limit and we have more than one
segment in the list returned by skb_gso_segment(), netem drops
only the first one, skipping the rest, hence kmemleak reports:

unreferenced object 0xffff880b5d23b600 (size 1024):
  comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
  hex dump (first 32 bytes):
    00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000d8a19b9d>] __alloc_skb+0xc9/0x520
    [<000000001709b32f>] skb_segment+0x8c8/0x3710
    [<00000000c7b9bb88>] tcp_gso_segment+0x331/0x1830
    [<00000000c921cba1>] inet_gso_segment+0x476/0x1370
    [<000000008b762dd4>] skb_mac_gso_segment+0x1f9/0x510
    [<000000002182660a>] __skb_gso_segment+0x1dd/0x620
    [<00000000412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
    [<0000000005d3b2a9>] __dev_queue_xmit+0x1167/0x2120
    [<00000000fc5f7327>] ip_finish_output2+0x998/0xf00
    [<00000000d309e9d3>] ip_output+0x1aa/0x2c0
    [<000000007ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
    [<0000000042d2a45f>] tcp_write_xmit+0x4d4/0x58c0
    [<0000000056a44199>] tcp_tasklet_func+0x3d9/0x540
    [<0000000013d06d02>] tasklet_action+0x1ca/0x250
    [<00000000fcde0b8b>] __do_softirq+0x1b4/0x5a3
    [<00000000e7ed027c>] irq_exit+0x1e2/0x210

Fix it by adding the rest of the segments, if any, to skb 'to_free'
list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
because they can be useful in the future if we need to drop segmented
GSO packets in other places.

Fixes: 6071bd1aa1 ("netem: Segment GSO packets on enqueue")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 11:18:14 -05:00
David S. Miller 0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Roman Mashak d143b9e305 net sched actions: corrected extack message
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-05 10:19:29 -05:00
Daniel Axtens ee78bbef8d net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
tbf_enqueue() checks the size of a packet before enqueuing it.
However, the GSO size check does not consider the GSO_BY_FRAGS
case, and so will drop GSO SCTP packets, causing a massive drop
in throughput.

Use skb_gso_validate_mac_len() instead, as it does consider that
case.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
Nogah Frankel b9c7a7acc7 net: sch: prio: Add offload ability for grafting a child
Offload sch_prio graft command for capable drivers.
Warn in case of a failure, unless the graft was done as part of a destroy
operation (the new qdisc is a noop) or if all the qdiscs (the parent, the
old child, and the new one) are not offloaded.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28 12:06:01 -05:00
Nogah Frankel 98ceb7b6d6 mlxsw: spectrum: qdiscs: prio: Delete child qdiscs when removing bands
When the number the bands of sch_prio is decreased, child qdiscs on the
deleted bands would get deleted as well.
This change and deletions are being done under sch_tree_lock of the
sch_prio qdisc. Part of the destruction of qdisc is unoffloading it, if
it is offloaded. Un-offloading can't be done inside this lock.
Move the offload command to be done before reducing the number of bands,
so unoffloading of the qdiscs that are about to be deleted could be done
outside of the lock.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28 12:06:01 -05:00
Nogah Frankel fd5ac14a1a net: sch: Don't warn on missmatching qlen and backlog for offloaded qdiscs
Offloaded qdiscs are allowed to expose only parts of their statistics.
It means that if backlog is being exposed and qlen is not, it might trigger
a warning in qdisc_tree_reduce_backlog.
Do not warn in case the qdisc that was removed was an offloaded one.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28 12:06:00 -05:00
Kirill Tkhai 685ecfb198 net: Convert tc_action_net_init() and tc_action_net_exit() based pernet_operations
These pernet_operations are from net/sched directory, and they call only
tc_action_net_init() and tc_action_net_exit():

bpf_net_ops
connmark_net_ops
csum_net_ops
gact_net_ops
ife_net_ops
ipt_net_ops
xt_net_ops
mirred_net_ops
nat_net_ops
pedit_net_ops
police_net_ops
sample_net_ops
simp_net_ops
skbedit_net_ops
skbmod_net_ops
tunnel_key_net_ops
vlan_net_ops

1)tc_action_net_init() just allocates and initializes per-net memory.
2)There should not be in-flight packets at the time of tc_action_net_exit()
call, or another pernet_operations send packets to dying net (except
netlink). So, it seems they can be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:36 -05:00
Kirill Tkhai 02df428ca2 net: Convert simple pernet_operations
These pernet_operations make pretty simple actions
like variable initialization on init, debug checks
on exit, and so on, and they obviously are able
to be executed in parallel with any others:

vrf_net_ops
lockd_net_ops
grace_net_ops
xfrm6_tunnel_net_ops
kcm_net_ops
tcf_net_ops

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
David S. Miller f74290fdb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-24 00:04:20 -05:00
Eyal Birger ccc007e4a7 net: sched: add em_ipt ematch for calling xtables matches
The commit a new tc ematch for using netfilter xtable matches.

This allows early classification as well as mirroning/redirecting traffic
based on logic implemented in netfilter extensions.

Current supported use case is classification based on the incoming IPSec
state used during decpsulation using the 'policy' iptables extension
(xt_policy).

The module dynamically fetches the netfilter match module and calls
it using a fake xt_action_param structure based on validated userspace
provided parameters.

As the xt_policy match does not access skb->data, no skb modifications
are needed on match.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 13:15:33 -05:00
Roman Kapl 5ae437ad5a net: sched: report if filter is too large to dump
So far, if the filter was too large to fit in the allocated skb, the
kernel did not return any error and stopped dumping. Modify the dumper
so that it returns -EMSGSIZE when a filter fails to dump and it is the
first filter in the skb. If we are not first, we will get a next chance
with more room.

I understand this is pretty near to being an API change, but the
original design (silent truncation) can be considered a bug.

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to

Signed-off-by: Roman Kapl <code@rkapl.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-20 21:57:17 -05:00
David S. Miller f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Alexander Aring 1d4760c75d net: sched: act: mirred: add extack support
This patch adds extack support for TC mirred action.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:05:51 -05:00
Alexander Aring b36201455a net: sched: act: handle extack in tcf_generic_walker
This patch adds extack handling for a common used TC act function
"tcf_generic_walker()" to add an extack message on failures.
The tcf_generic_walker() function can fail if get a invalid command
different than DEL and GET. The naming "action" here is wrong, the
correct naming would be command.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:05:50 -05:00
Alexander Aring 417801055b net: sched: act: add extack for walk callback
This patch adds extack support for act walker callback api. This
prepares to handle extack support inside each specific act
implementation.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:05:50 -05:00
Alexander Aring 331a9295de net: sched: act: add extack for lookup callback
This patch adds extack support for act lookup callback api. This
prepares to handle extack support inside each specific act
implementation.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:05:03 -05:00
Alexander Aring 589dad6d71 net: sched: act: add extack to init callback
This patch adds extack support for act init callback api. This
prepares to handle extack support inside each specific act
implementation.

Based on work by David Ahern <dsahern@gmail.com>

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:05:03 -05:00
Alexander Aring 84ae017a00 net: sched: act: handle generic action errors
This patch adds extack support for generic act handling. The extack
will be set deeper to each called function which is not part of netdev
core api.

Based on work by David Ahern <dsahern@gmail.com>

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:05:02 -05:00
Alexander Aring aea0d72789 net: sched: act: add extack to init
This patch adds extack to tcf_action_init and tcf_action_init_1
functions. These are necessary to make individual extack handling in
each act implementation.

Based on work by David Ahern <dsahern@gmail.com>

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:04:53 -05:00
Alexander Aring 1af8515581 net: sched: act: fix code style
This patch is used by subsequent patches. It fixes code style issues
caught by checkpatch.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:04:53 -05:00
David S. Miller ee99b2d8bf net: Revert sched action extack support series.
It was mis-applied and the changes had rejects.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:03:39 -05:00
Alexander Aring 10defbd29e net: sched: act: add extack to init
This patch adds extack to tcf_action_init and tcf_action_init_1
functions. These are necessary to make individual extack handling in
each act implementation.

Based on work by David Ahern <dsahern@gmail.com>

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:44:42 -05:00
Alexander Aring b7b347fa3c net: sched: act: fix code style
This patch is used by subsequent patches. It fixes code style issues
caught by checkpatch.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:44:41 -05:00
Davide Caratti 66dede2d6b net: sched: fix unbalance in the error path of tca_action_flush()
When tca_action_flush() calls the action walk() and gets an error,
a successful call to nla_nest_start() is not followed by a call to
nla_nest_cancel(). It's harmless, as the skb is freed in the error
path - but it's worth to fix this unbalance.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:43:17 -05:00
Jiri Pirko 339c21d7c4 net: sched: fix tc_u_common lookup
The offending commit wrongly assumes 1:1 mapping between block and q.
However, there are multiple blocks for a single q for classful qdiscs.
Since the obscure tc_u_common sharing mechanism expects it to be shared
among a qdisc, fix it by storing q pointer in case the block is not
shared.

Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Fixes: 7fa9d974f3 ("net: sched: cls_u32: use block instead of q in tc_u_common")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 12:29:02 -05:00
Jiri Pirko bb047ddd14 net: sched: don't set q pointer for shared blocks
It is pointless to set block->q for block which are shared among
multiple qdiscs. So remove the assignment in that case. Do a bit of code
reshuffle to make block->index initialized at that point so we can use
tcf_block_shared() helper.

Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Fixes: 4861738775 ("net: sched: introduce shared filter blocks infrastructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 12:29:02 -05:00
Kirill Tkhai 13da199c38 net: Convert subsys_initcall() registered pernet_operations from net/sched
psched_net_ops only creates and destroyes /proc entry,
and safe to be executed in parallel with any foreigh
pernet_operations.

tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht,
which is not touched by foreign pernet_operations.

So, make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:07 -05:00
Linus Torvalds c839682c71 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Make allocations less aggressive in x_tables, from Minchal Hocko.

 2) Fix netfilter flowtable Kconfig deps, from Pablo Neira Ayuso.

 3) Fix connection loss problems in rtlwifi, from Larry Finger.

 4) Correct DRAM dump length for some chips in ath10k driver, from Yu
    Wang.

 5) Fix ABORT handling in rxrpc, from David Howells.

 6) Add SPDX tags to Sun networking drivers, from Shannon Nelson.

 7) Some ipv6 onlink handling fixes, from David Ahern.

 8) Netem packet scheduler interval calcualtion fix from Md. Islam.

 9) Don't put crypto buffers on-stack in rxrpc, from David Howells.

10) Fix handling of error non-delivery status in netlink multicast
    delivery over multiple namespaces, from Nicolas Dichtel.

11) Missing xdp flush in tuntap driver, from Jason Wang.

12) Synchonize RDS protocol netns/module teardown with rds object
    management, from Sowini Varadhan.

13) Add nospec annotations to mpls, from Dan Williams.

14) Fix SKB truesize handling in TIPC, from Hoang Le.

15) Interrupt masking fixes in stammc from Niklas Cassel.

16) Don't allow ptr_ring objects to be sized outside of kmalloc's
    limits, from Jason Wang.

17) Don't allow SCTP chunks to be built which will have a length
    exceeding the chunk header's 16-bit length field, from Alexey
    Kodanev.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits)
  ibmvnic: Remove skb->protocol checks in ibmvnic_xmit
  bpf: fix rlimit in reuseport net selftest
  sctp: verify size of a new chunk in _sctp_make_chunk()
  s390/qeth: fix SETIP command handling
  s390/qeth: fix underestimated count of buffer elements
  ptr_ring: try vmalloc() when kmalloc() fails
  ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE
  net: stmmac: remove redundant enable of PMT irq
  net: stmmac: rename GMAC_INT_DEFAULT_MASK for dwmac4
  net: stmmac: discard disabled flags in interrupt status register
  ibmvnic: Reset long term map ID counter
  tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
  selftests/bpf: add selftest that use test_libbpf_open
  selftests/bpf: add test program for loading BPF ELF files
  tools/libbpf: improve the pr_debug statements to contain section numbers
  bpf: Sync kernel ABI header with tooling header for bpf_common.h
  net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
  net: thunder: change q_len's type to handle max ring size
  tipc: fix skb truesize/datasize ratio control
  net/sched: cls_u32: fix cls_u32 on filter replace
  ...
2018-02-09 15:34:18 -08:00
Linus Torvalds 9d21874da8 Merge branch 'idr-2018-02-06' of git://git.infradead.org/users/willy/linux-dax
Pull idr updates from Matthew Wilcox:

 - test-suite improvements

 - replace the extended API by improving the normal API

 - performance improvement for IDRs which are 1-based rather than
   0-based

 - add documentation

* 'idr-2018-02-06' of git://git.infradead.org/users/willy/linux-dax:
  idr: Add documentation
  idr: Make 1-based IDRs more efficient
  idr: Warn if old iterators see large IDs
  idr: Rename idr_for_each_entry_ext
  idr: Remove idr_alloc_ext
  cls_u32: Convert to idr_alloc_u32
  cls_u32: Reinstate cyclic allocation
  cls_flower: Convert to idr_alloc_u32
  cls_bpf: Convert to use idr_alloc_u32
  cls_basic: Convert to use idr_alloc_u32
  cls_api: Convert to idr_alloc_u32
  net sched actions: Convert to use idr_alloc_u32
  idr: Add idr_alloc_u32 helper
  idr: Delete idr_find_ext function
  idr: Delete idr_replace_ext function
  idr: Delete idr_remove_ext function
  IDR test suite: Check handling negative end correctly
  idr test suite: Fix ida_test_random()
  radix tree test suite: Remove ARRAY_SIZE
2018-02-08 14:39:29 -08:00
Ivan Vecera eb53f7af6f net/sched: cls_u32: fix cls_u32 on filter replace
The following sequence is currently broken:

 # tc qdisc add dev foo ingress
 # tc filter replace dev foo protocol all ingress \
   u32 match u8 0 0 action mirred egress mirror dev bar1
 # tc filter replace dev foo protocol all ingress \
   handle 800::800 pref 49152 \
   u32 match u8 0 0 action mirred egress mirror dev bar2
 Error: cls_u32: Key node flags do not match passed flags.
 We have an error talking to the kernel, -1

The error comes from u32_change() when comparing new and
existing flags. The existing ones always contains one of
TCA_CLS_FLAGS_{,NOT}_IN_HW flag depending on offloading state.
These flags cannot be passed from userspace so the condition
(n->flags != flags) in u32_change() always fails.

Fix the condition so the flags TCA_CLS_FLAGS_NOT_IN_HW and
TCA_CLS_FLAGS_IN_HW are not taken into account.

Fixes: 24d3dc6d27 ("net/sched: cls_u32: Reflect HW offload status")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08 15:27:58 -05:00
Md. Islam 043e337f55 sch_netem: Bug fixing in calculating Netem interval
In Kernel 4.15.0+, Netem does not work properly.

Netem setup:

tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms

Result:

PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=22.8 ms
64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=10.9 ms
64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=10.9 ms
64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=11.4 ms
64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=4303 ms
64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.2 ms
64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.3 ms
64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=4304 ms
64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=4303 ms

Patch:

(rnd % (2 * sigma)) - sigma was overflowing s32. After applying the
patch, I found following output which is desirable.

PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=21.1 ms
64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=8.46 ms
64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=9.00 ms
64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=11.8 ms
64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=8.36 ms
64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=8.11 ms
64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=10.0 ms
64 bytes from 172.16.101.2: icmp_seq=9 ttl=64 time=11.3 ms
64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.5 ms
64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.2 ms

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07 21:59:12 -05:00
Matthew Wilcox 7a4575778f idr: Rename idr_for_each_entry_ext
Most places in the kernel that we need to distinguish functions by the
type of their arguments, we use '_ul' as a suffix for the unsigned long
variant, not '_ext'.  Also add kernel-doc.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:41:28 -05:00
Matthew Wilcox f730cb93db cls_u32: Convert to idr_alloc_u32
No real benefit to this classifier, but since we're allocating a u32
anyway, we should use this function.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:41:27 -05:00
Matthew Wilcox ffdc2d9e1a cls_u32: Reinstate cyclic allocation
Commit e7614370d6 ("net_sched: use idr to allocate u32 filter handles)
converted htid allocation to use the IDR.  The ID allocated by this
scheme changes; it used to be cyclic, but now always allocates the
lowest available.  The IDR supports cyclic allocation, so just use
the right function.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:41:27 -05:00
Matthew Wilcox 85bd0438a2 cls_flower: Convert to idr_alloc_u32
Use the new helper which saves a temporary variable and a few lines
of code.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:41:26 -05:00
Matthew Wilcox 0b4ce8da79 cls_bpf: Convert to use idr_alloc_u32
Use the new helper.  This has a modest reduction in both lines of code
and compiled code size.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:41:26 -05:00
Matthew Wilcox 05af0ebb08 cls_basic: Convert to use idr_alloc_u32
Use the new helper which saves a temporary variable and a few lines of
code.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:41:26 -05:00
Matthew Wilcox 9ce75499ac cls_api: Convert to idr_alloc_u32
Use the new helper.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:40:33 -05:00
Matthew Wilcox 339913a8be net sched actions: Convert to use idr_alloc_u32
Use the new helper.  Also untangle the error path, and in so doing
noticed that estimator generator failure would lead to us leaking an
ID.  Fix that bug.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:40:33 -05:00
Matthew Wilcox 322d884ba7 idr: Delete idr_find_ext function
Simply changing idr_remove's 'id' argument to 'unsigned long' works
for all callers.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:40:32 -05:00
Matthew Wilcox 234a4624ef idr: Delete idr_replace_ext function
Changing idr_replace's 'id' argument to 'unsigned long' works for all
callers.  Callers which passed a negative ID now get -ENOENT instead of
-EINVAL.  No callers relied on this error value.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:40:31 -05:00
Matthew Wilcox 9c16094140 idr: Delete idr_remove_ext function
Simply changing idr_remove's 'id' argument to 'unsigned long' suffices
for all callers.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-02-06 16:40:31 -05:00
Paolo Abeni d7cdee5ea8 cls_u32: fix use after free in u32_destroy_key()
Li Shuang reported an Oops with cls_u32 due to an use-after-free
in u32_destroy_key(). The use-after-free can be triggered with:

dev=lo
tc qdisc add dev $dev root handle 1: htb default 10
tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256
tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\
 10.0.0.0/8 hashkey mask 0x0000ff00 at 16 link 1:
tc qdisc del dev $dev root

Which causes the following kasan splat:

 ==================================================================
 BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
 Read of size 4 at addr ffff881b83dae618 by task kworker/u48:5/571

 CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87
 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
 Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32]
 Call Trace:
  dump_stack+0xd6/0x182
  ? dma_virt_map_sg+0x22e/0x22e
  print_address_description+0x73/0x290
  kasan_report+0x277/0x360
  ? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
  u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
  u32_delete_key_freepf_work+0x1c/0x30 [cls_u32]
  process_one_work+0xae0/0x1c80
  ? sched_clock+0x5/0x10
  ? pwq_dec_nr_in_flight+0x3c0/0x3c0
  ? _raw_spin_unlock_irq+0x29/0x40
  ? trace_hardirqs_on_caller+0x381/0x570
  ? _raw_spin_unlock_irq+0x29/0x40
  ? finish_task_switch+0x1e5/0x760
  ? finish_task_switch+0x208/0x760
  ? preempt_notifier_dec+0x20/0x20
  ? __schedule+0x839/0x1ee0
  ? check_noncircular+0x20/0x20
  ? firmware_map_remove+0x73/0x73
  ? find_held_lock+0x39/0x1c0
  ? worker_thread+0x434/0x1820
  ? lock_contended+0xee0/0xee0
  ? lock_release+0x1100/0x1100
  ? init_rescuer.part.16+0x150/0x150
  ? retint_kernel+0x10/0x10
  worker_thread+0x216/0x1820
  ? process_one_work+0x1c80/0x1c80
  ? lock_acquire+0x1a5/0x540
  ? lock_downgrade+0x6b0/0x6b0
  ? sched_clock+0x5/0x10
  ? lock_release+0x1100/0x1100
  ? compat_start_thread+0x80/0x80
  ? do_raw_spin_trylock+0x190/0x190
  ? _raw_spin_unlock_irq+0x29/0x40
  ? trace_hardirqs_on_caller+0x381/0x570
  ? _raw_spin_unlock_irq+0x29/0x40
  ? finish_task_switch+0x1e5/0x760
  ? finish_task_switch+0x208/0x760
  ? preempt_notifier_dec+0x20/0x20
  ? __schedule+0x839/0x1ee0
  ? kmem_cache_alloc_trace+0x143/0x320
  ? firmware_map_remove+0x73/0x73
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1c0
  ? schedule+0xf3/0x3b0
  ? lock_downgrade+0x6b0/0x6b0
  ? __schedule+0x1ee0/0x1ee0
  ? do_wait_intr_irq+0x340/0x340
  ? do_raw_spin_trylock+0x190/0x190
  ? _raw_spin_unlock_irqrestore+0x32/0x60
  ? process_one_work+0x1c80/0x1c80
  ? process_one_work+0x1c80/0x1c80
  kthread+0x312/0x3d0
  ? kthread_create_worker_on_cpu+0xc0/0xc0
  ret_from_fork+0x3a/0x50

 Allocated by task 1688:
  kasan_kmalloc+0xa0/0xd0
  __kmalloc+0x162/0x380
  u32_change+0x1220/0x3c9e [cls_u32]
  tc_ctl_tfilter+0x1ba6/0x2f80
  rtnetlink_rcv_msg+0x4f0/0x9d0
  netlink_rcv_skb+0x124/0x320
  netlink_unicast+0x430/0x600
  netlink_sendmsg+0x8fa/0xd60
  sock_sendmsg+0xb1/0xe0
  ___sys_sendmsg+0x678/0x980
  __sys_sendmsg+0xc4/0x210
  do_syscall_64+0x232/0x7f0
  return_from_SYSCALL_64+0x0/0x75

 Freed by task 112:
  kasan_slab_free+0x71/0xc0
  kfree+0x114/0x320
  rcu_process_callbacks+0xc3f/0x1600
  __do_softirq+0x2bf/0xc06

 The buggy address belongs to the object at ffff881b83dae600
  which belongs to the cache kmalloc-4096 of size 4096
 The buggy address is located 24 bytes inside of
  4096-byte region [ffff881b83dae600, ffff881b83daf600)
 The buggy address belongs to the page:
 page:ffffea006e0f6a00 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
 flags: 0x17ffffc0008100(slab|head)
 raw: 0017ffffc0008100 0000000000000000 0000000000000000 0000000100070007
 raw: dead000000000100 dead000000000200 ffff880187c0e600 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
  ffff881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ==================================================================

The problem is that the htnode is freed before the linked knodes and the
latter will try to access the first at u32_destroy_key() time.
This change addresses the issue using the htnode refcnt to guarantee
the correct free order. While at it also add a RCU annotation,
to keep sparse happy.

v1 -> v2: use rtnl_derefence() instead of RCU read locks
v2 -> v3:
  - don't check refcnt in u32_destroy_hnode()
  - cleaned-up u32_destroy() implementation
  - cleaned-up code comment
v3 -> v4:
  - dropped unneeded comment

Reported-by: Li Shuang <shuali@redhat.com>
Fixes: c0d378ef12 ("net_sched: use tcf_queue_work() in u32 filter")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-06 11:25:54 -05:00
Paolo Abeni 058a6c0334 cls_u32: add missing RCU annotation.
In a couple of points of the control path, n->ht_down is currently
accessed without the required RCU annotation. The accesses are
safe, but sparse complaints. Since we already held the
rtnl lock, let use rtnl_dereference().

Fixes: a1b7c5fd7f ("net: sched: add cls_u32 offload hooks for netdevs")
Fixes: de5df63228 ("net: sched: cls_u32 changes to knode must appear atomic to readers")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-02 19:25:03 -05:00
Daniel Axtens 2b16f04872 net: create skb_gso_validate_mac_len()
If you take a GSO skb, and split it into packets, will the MAC
length (L2 + L3 + L4 headers + payload) of those packets be small
enough to fit within a given length?

Move skb_gso_mac_seglen() to skbuff.h with other related functions
like skb_gso_network_seglen() so we can use it, and then create
skb_gso_validate_mac_len to do the full calculation.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01 09:36:03 -05:00
Ulf Magnusson 11eab14805 net/sched: kconfig: Remove blank help texts
Blank help texts are probably either a typo, a Kconfig misunderstanding,
or some kind of half-committing to adding a help text (in which case a
TODO comment would be clearer, if the help text really can't be added
right away).

Best to remove them, IMO.

Signed-off-by: Ulf Magnusson <ulfalizer@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-31 10:26:30 -05:00
Cong Wang 7007ba630e net_sched: implement ->change_tx_queue_len() for pfifo_fast
pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
so we have to resize skb array when we change tx_queue_len.

Other qdiscs which read tx_queue_len are fine because they
all save it to sch->limit or somewhere else in qdisc during init.
They don't have to implement this, it is nicer if they do so
that users don't have to re-configure qdisc after changing
tx_queue_len.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:42:15 -05:00
Cong Wang 48bfd55e7e net_sched: plug in qdisc ops change_tx_queue_len
Introduce a new qdisc ops ->change_tx_queue_len() so that
each qdisc could decide how to implement this if it wants.
Previously we simply read dev->tx_queue_len, after pfifo_fast
switches to skb array, we need this API to resize the skb array
when we change dev->tx_queue_len.

To avoid handling race conditions with TX BH, we need to
deactivate all TX queues before change the value and bring them
back after we are done, this also makes implementation easier.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:42:15 -05:00
David S. Miller 955bd1d216 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 23:44:15 -05:00
Jakub Kicinski 458e704d4d cls_u32: propagate extack to delete callback
Propagate extack on removal of offloaded filter.  Don't pass
extack from error paths.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:11 -05:00
Jakub Kicinski f40fe58d13 cls_u32: pass offload flags to tc_cls_common_offload_init()
Pass offload flags to the new implementation of
tc_cls_common_offload_init().  Extack will now only
be set if user requested skip_sw.  hnodes need to
hold onto the flags now to be able to reuse them
on filter removal.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:11 -05:00
Jakub Kicinski 1b0f80375c cls_flower: propagate extack to delete callback
Propagate extack on removal of offloaded filter.  Don't pass
extack from error paths.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski ea2059409c cls_flower: pass offload flags to tc_cls_common_offload_init()
Pass offload flags to the new implementation of
tc_cls_common_offload_init().  Extack will now only
be set if user requested skip_sw.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski b505b29f68 cls_matchall: propagate extack to delete callback
Propagate extack on removal of offloaded filter.  Don't pass
extack from error paths.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski 93da52b567 cls_matchall: pass offload flags to tc_cls_common_offload_init()
Pass offload flags to the new implementation of
tc_cls_common_offload_init().  Extack will now only
be set if user requested skip_sw.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski 0e908a450a cls_bpf: propagate extack to offload delete callback
Propagate extack on removal of offloaded filter.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski a6ffd6b5d6 cls_bpf: pass offload flags to tc_cls_common_offload_init()
Pass offload flags to the new implementation of
tc_cls_common_offload_init().  Extack will now only
be set if user requested skip_sw.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski f558fdea03 cls_bpf: remove gen_flags from bpf_offload
cls_bpf now guarantees that only device-bound programs are
allowed with skip_sw.  The drivers no longer pay attention to
flags on filter load, therefore the bpf_offload member can be
removed.  If flags are needed again they should probably be
added to struct tc_cls_common_offload instead.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski 34832e1c70 net: sched: prepare for reimplementation of tc_cls_common_offload_init()
Rename the tc_cls_common_offload_init() helper function to
tc_cls_common_offload_init_deprecated() and add a new implementation
which also takes flags argument.  We will only set extack if flags
indicate that offload is forced (skip_sw) otherwise driver errors
should be ignored, as they don't influence the overall filter
installation.

Note that we need the tc_skip_hw() helper for new version, therefore
it is added later in the file.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:10 -05:00
Jakub Kicinski 715df5ecab net: sched: propagate extack to cls->destroy callbacks
Propagate extack to cls->destroy callbacks when called from
non-error paths.  On error paths pass NULL to avoid overwriting
the failure message.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 16:01:09 -05:00
Wolfgang Bumiller 560a66075d net: sched: em_nbyte: don't add the data offset twice
'ptr' is shifted by the offset and then validated,
the memcmp should not add it a second time.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24 14:52:40 -05:00
Davide Caratti 9c5f69bbd7 net/sched: act_csum: don't use spinlock in the fast path
use RCU instead of spin_{,unlock}_bh() to protect concurrent read/write on
act_csum configuration, to reduce the effects of contention in the data
path when multiple readers are present.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23 19:51:46 -05:00
Davide Caratti f6052cf2fc net/sched: act_csum: use per-core statistics
use per-CPU counters, like other TC actions do, instead of maintaining one
set of stats across all cores. This allows updating act_csum stats without
the need of protecting them using spin_{,un}lock_bh() invocations.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23 19:51:46 -05:00
Quentin Monnet 8f0b425a71 net: sched: add extack support for offload via tc_cls_common_offload
Add extack support for hardware offload of classifiers. In order
to achieve this, a pointer to a struct netlink_ext_ack is added to the
struct tc_cls_common_offload that is passed to the callback for setting
up the classifier. Function tc_cls_common_offload_init() is updated to
support initialization of this new attribute.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:28:32 -05:00
Quentin Monnet 631f65ff22 net: sched: cls_bpf: plumb extack support in filter for hardware offload
Pass the extack pointer obtained in the `->change()` filter operation to
cls_bpf_offload() and then to cls_bpf_offload_cmd(). This makes it
possible to use this extack pointer in drivers offloading BPF programs
in a future patch.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:28:31 -05:00
Quentin Monnet 10a47e0f09 net: sched: cls_u32: propagate extack support for filter offload
Propagate the extack pointer from the `->change()` classifier operation
to the function used for filter replacement in cls_u32. This makes it
possible to use netlink extack messages in the future at replacement
time for this filter, although it is not used at this point.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:28:23 -05:00
Quentin Monnet 0279814055 net: sched: cls_matchall: propagate extack support for filter offload
Propagate the extack pointer from the `->change()` classifier operation
to the function used for filter replacement in cls_matchall. This makes
it possible to use netlink extack messages in the future at replacement
time for this filter, although it is not used at this point.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:28:23 -05:00
Quentin Monnet 41002038f9 net: sched: cls_flower: propagate extack support for filter offload
Propagate the extack pointer from the `->change()` classifier operation
to the function used for filter replacement in cls_flower. This makes it
possible to use netlink extack messages in the future at replacement
time for this filter, although it is not used at this point.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:28:22 -05:00
David Decotigny b2d3bcfa26 net: core: Expose number of link up/down transitions
Expose the number of times the link has been going UP or DOWN, and
update the "carrier_changes" counter to be the sum of these two events.
While at it, also update the sysfs-class-net documentation to cover:
carrier_changes (3.15), carrier_up_count (4.16) and carrier_down_count
(4.16)

Signed-off-by: David Decotigny <decot@googlers.com>
[Florian:
* rebase
* add documentation
* merge carrier_changes with up/down counters]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 15:42:05 -05:00
David S. Miller 8565d26bcb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The BPF verifier conflict was some minor contextual issue.

The TUN conflict was less trivial.  Cong Wang fixed a memory leak of
tfile->tx_array in 'net'.  This is an skb_array.  But meanwhile in
net-next tun changed tfile->tx_arry into tfile->tx_ring which is a
ptr_ring.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 22:59:33 -05:00
Alexander Aring 4b981dbc22 net: sched: cls_u32: add extack support
This patch adds extack support for the u32 classifier as example for
delete and init callback.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:52:51 -05:00
Alexander Aring 1057c55f6b net: sched: cls: add extack support for tcf_change_indev
This patch adds extack handling for the tcf_change_indev function which
is common used by TC classifier implementations.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:52:51 -05:00
Alexander Aring 571acf2106 net: sched: cls: add extack support for delete callback
This patch adds extack support for classifier delete callback api. This
prepares to handle extack support inside each specific classifier
implementation.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:52:51 -05:00
Alexander Aring 50a561900e net: sched: cls: add extack support for tcf_exts_validate
The tcf_exts_validate function calls the act api change callback. For
preparing extack support for act api, this patch adds the extack as
parameter for this function which is common used in cls implementations.

Furthermore the tcf_exts_validate will call action init callback which
prepares the TC action subsystem for extack support.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:52:51 -05:00
Alexander Aring 7306db38a6 net: sched: cls: add extack support for change callback
This patch adds extack support for classifier change callback api. This
prepares to handle extack support inside each specific classifier
implementation.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:52:51 -05:00
Alexander Aring c35a4acc29 net: sched: cls_api: handle generic cls errors
This patch adds extack support for generic cls handling. The extack
will be set deeper to each called function which is not part of netdev
core api.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:52:51 -05:00
Alexander Aring 8865fdd4e1 net: sched: cls: fix code style issues
This patch changes some code style issues pointed out by checkpatch
inside the TC cls subsystem.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:52:51 -05:00
Andrew Morton ef58ca38db net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues
gcc-4.4.4 has problems witn anon union initializers.  Work around this.

net/sched/sch_prio.c: In function 'prio_dump_offload':
net/sched/sch_prio.c:260: error: unknown field 'stats' specified in initializer
net/sched/sch_prio.c:260: warning: initialization makes integer from pointer without a cast
net/sched/sch_prio.c:261: error: unknown field 'stats' specified in initializer
net/sched/sch_prio.c:261: warning: initialization makes integer from pointer without a cast

Fixes: 7fdb61b44c ("net: sch: prio: Add offload ability to PRIO qdisc")
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Yuval Mintz <yuvalm@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-18 21:11:31 -05:00
Jiri Pirko d680b3524c net: sched: silence uninitialized parent variable warning in tc_dump_tfilter
When tcm->tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK, parent is still passed
down but the value is never used. Compiler does not recognize it and
issues a warning. Silence it down initializing parent to 0.

Fixes: 7960d1daf2 ("net: sched: use block index as a handle instead of qdisc when block is shared")
Reported-by: David Miller <davem@davemloft.net>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-18 10:37:12 -05:00
Daniel Borkmann ad9294dbc2 bpf: fix cls_bpf on filter replace
Running the following sequence is currently broken:

  # tc qdisc add dev foo clsact
  # tc filter replace dev foo ingress prio 1 handle 1 bpf da obj bar.o
  # tc filter replace dev foo ingress prio 1 handle 1 bpf da obj bar.o
  RTNETLINK answers: Invalid argument

The normal expectation on kernel side is that the second command
succeeds replacing the existing program. However, what happens is
in cls_bpf_change(), we bail out with err in the second run in
cls_bpf_offload(). The EINVAL comes directly in cls_bpf_offload()
when comparing prog vs oldprog's gen_flags. In case of above
replace the new prog's gen_flags are 0, but the old ones are 8,
which means TCA_CLS_FLAGS_NOT_IN_HW is set (e.g. drivers not having
cls_bpf offload).

Fix 102740bd94 ("cls_bpf: fix offload assumptions after callback
conversion") in the following way: gen_flags from user space passed
down via netlink cannot include status flags like TCA_CLS_FLAGS_IN_HW
or TCA_CLS_FLAGS_NOT_IN_HW as opposed to oldprog that we previously
loaded. Therefore, it doesn't make any sense to include them in the
gen_flags comparison with the new prog before we even attempt to
offload. Thus, lets fix this before 4.15 goes out.

Fixes: 102740bd94 ("cls_bpf: fix offload assumptions after callback conversion")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 17:14:06 -05:00
Jiri Pirko 51ab2994c3 net: sched: allow ingress and clsact qdiscs to share filter blocks
Benefit from the previously introduced shared filter blocks
infrastructure and allow ingress and clsact qdisc instances to share
filter blocks. The block index is coming from userspace as qdisc option.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko d47a6b0e7c net: sched: introduce ingress/egress block index attributes for qdisc
Introduce two new attributes to be used for qdisc creation and dumping.
One for ingress block, one for egress block. Introduce a set of ops that
qdisc which supports block sharing would implement.

Passing block indexes in qdisc change is not supported yet and it is
checked and forbidded.

In future, these attributes are to be reused for specifying block
indexes for classes as well. As of this moment however, it is not
supported so a check is in place to forbid it.

Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko 7960d1daf2 net: sched: use block index as a handle instead of qdisc when block is shared
As the tcm_ifindex with value TCM_IFINDEX_MAGIC_BLOCK is invalid ifindex,
use it to indicate that we work with block, instead of qdisc.
So if tcm_ifindex is set to TCM_IFINDEX_MAGIC_BLOCK, tcm_parent is used
to carry block_index.

If the block is set to be shared between at least 2 qdiscs, it is
forbidden to use the qdisc handle to add/delete filters. In that case,
userspace has to pass block_index.

Also, for dump of the filters, in case the block is shared in between at
least 2 qdiscs, the each filter is dumped with tcm_ifindex value
TCM_IFINDEX_MAGIC_BLOCK and tcm_parent set to block_index. That gives
the user clear indication, that the filter belongs to a shared block
and not only to one qdisc under which it is dumped.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko caa7260156 net: sched: keep track of offloaded filters and check tc offload feature
During block bind, we need to check tc offload feature. If it is
disabled yet still the block contains offloaded filters, forbid the
bind. Also forbid to register callback for a block that already
contains offloaded filters, as the play back is not supported now.
For keeping track of offloaded filters there is a new counter
introduced, alongside with couple of helpers called from cls_* code.
These helpers set and clear TCA_CLS_FLAGS_IN_HW flag.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko edf6711c98 net: sched: remove classid and q fields from tcf_proto
Both are no longer used, so remove them.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko f36fe1c498 net: sched: introduce block mechanism to handle netif_keep_dst calls
Couple of classifiers call netif_keep_dst directly on q->dev. That is
not possible to do directly for shared blocke where multiple qdiscs are
owning the block. So introduce a infrastructure to keep track of the
block owners in list and use this list to implement block variant of
netif_keep_dst.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko 9d3aaff3d8 net: sched: avoid usage of tp->q in tcf_classify
Use block index in the messages instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko 4861738775 net: sched: introduce shared filter blocks infrastructure
Allow qdiscs to share filter blocks among them. Each qdisc type has to
use block get/put extended modifications that enable sharing.
Shared blocks are tracked within each net namespace and identified
by u32 index. This index is passed from user during the qdisc creation.
If user passes index that is not used by any other qdisc, new block
is created. If user passes index that is already used, the existing
block will be re-used.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko a9b19443ed net: sched: introduce support for multiple filter chain pointers registration
So far, there was possible only to register a single filter chain
pointer to block->chain[0]. However, when the blocks will get shareable,
we need to allow multiple filter chain pointers registration.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jakub Kicinski 416ef9b15c net: sched: red: don't reset the backlog on every stat dump
Commit 0dfb33a0d7 ("sch_red: report backlog information") copied
child's backlog into RED's backlog.  Back then RED did not maintain
its own backlog counts.  This has changed after commit 2ccccf5fb4
("net_sched: update hierarchical backlog too") and commit d7f4f332f0
("sch_red: update backlog as well").  Copying is no longer necessary.

Tested:

$ tc -s qdisc show dev veth0
qdisc red 1: root refcnt 2 limit 400000b min 30000b max 30000b ecn
 Sent 20942 bytes 221 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 1260b 14p requeues 14
  marked 0 early 0 pdrop 0 other 0
qdisc tbf 2: parent 1: rate 1Kbit burst 15000b lat 3585.0s
 Sent 20942 bytes 221 pkt (dropped 0, overlimits 138 requeues 0)
 backlog 1260b 14p requeues 14

Recently RED offload was added.  We need to make sure drivers don't
depend on resetting the stats.  This means backlog should be treated
like any other statistic:

  total_stat = new_hw_stat - prev_hw_stat;

Adjust mlxsw.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Nogah Frankel <nogahf@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:29:32 -05:00
David S. Miller c02b3741eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes all over.

The mini-qdisc bits were a little bit tricky, however.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 00:10:42 -05:00
Daniel Borkmann 81d947e2b8 net, sched: fix panic when updating miniq {b,q}stats
While working on fixing another bug, I ran into the following panic
on arm64 by simply attaching clsact qdisc, adding a filter and running
traffic on ingress to it:

  [...]
  [  178.188591] Unable to handle kernel read from unreadable memory at virtual address 810fb501f000
  [  178.197314] Mem abort info:
  [  178.200121]   ESR = 0x96000004
  [  178.203168]   Exception class = DABT (current EL), IL = 32 bits
  [  178.209095]   SET = 0, FnV = 0
  [  178.212157]   EA = 0, S1PTW = 0
  [  178.215288] Data abort info:
  [  178.218175]   ISV = 0, ISS = 0x00000004
  [  178.222019]   CM = 0, WnR = 0
  [  178.224997] user pgtable: 4k pages, 48-bit VAs, pgd = 0000000023cb3f33
  [  178.231531] [0000810fb501f000] *pgd=0000000000000000
  [  178.236508] Internal error: Oops: 96000004 [#1] SMP
  [...]
  [  178.311855] CPU: 73 PID: 2497 Comm: ping Tainted: G        W        4.15.0-rc7+ #5
  [  178.319413] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
  [  178.326887] pstate: 60400005 (nZCv daif +PAN -UAO)
  [  178.331685] pc : __netif_receive_skb_core+0x49c/0xac8
  [  178.336728] lr : __netif_receive_skb+0x28/0x78
  [  178.341161] sp : ffff00002344b750
  [  178.344465] x29: ffff00002344b750 x28: ffff810fbdfd0580
  [  178.349769] x27: 0000000000000000 x26: ffff000009378000
  [...]
  [  178.418715] x1 : 0000000000000054 x0 : 0000000000000000
  [  178.424020] Process ping (pid: 2497, stack limit = 0x000000009f0a3ff4)
  [  178.430537] Call trace:
  [  178.432976]  __netif_receive_skb_core+0x49c/0xac8
  [  178.437670]  __netif_receive_skb+0x28/0x78
  [  178.441757]  process_backlog+0x9c/0x160
  [  178.445584]  net_rx_action+0x2f8/0x3f0
  [...]

Reason is that sch_ingress and sch_clsact are doing mini_qdisc_pair_init()
which sets up miniq pointers to cpu_{b,q}stats from the underlying qdisc.
Problem is that this cannot work since they are actually set up right after
the qdisc ->init() callback in qdisc_create(), so first packet going into
sch_handle_ingress() tries to call mini_qdisc_bstats_cpu_update() and we
therefore panic.

In order to fix this, allocation of {b,q}stats needs to happen before we
call into ->init(). In net-next, there's already such option through commit
d59f5ffa59 ("net: sched: a dflt qdisc may be used with per cpu stats").
However, the bug needs to be fixed in net still for 4.15. Thus, include
these bits to reduce any merge churn and reuse the static_flags field to
set TCQ_F_CPUSTATS, and remove the allocation from qdisc_create() since
there is no other user left. Prashant Bhole ran into the same issue but
for net-next, thus adding him below as well as co-author. Same issue was
also reported by Sandipan Das when using bcc.

Fixes: 46209401f8 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath")
Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2018-January/001190.html
Reported-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Co-authored-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Co-authored-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-16 15:02:36 -05:00
Alexey Dobriyan 96890d6252 net: delete /proc THIS_MODULE references
/proc has been ignoring struct file_operations::owner field for 10 years.
Specifically, it started with commit 786d7e1612
("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
inode->i_fop is initialized with proxy struct file_operations for
regular files:

	-               if (de->proc_fops)
	-                       inode->i_fop = de->proc_fops;
	+               if (de->proc_fops) {
	+                       if (S_ISREG(inode->i_mode))
	+                               inode->i_fop = &proc_reg_file_ops;
	+                       else
	+                               inode->i_fop = de->proc_fops;
	+               }

VFS stopped pinning module at this point.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-16 15:01:33 -05:00
Nogah Frankel 7fdb61b44c net: sch: prio: Add offload ability to PRIO qdisc
Add the ability to offload PRIO qdisc by using ndo_setup_tc.
There are three commands for PRIO offloading:
* TC_PRIO_REPLACE: handles set and tune
* TC_PRIO_DESTROY: handles qdisc destroy
* TC_PRIO_STATS: updates the qdiscs counters (given as reference)

Like RED qdisc, the indication of whether PRIO is being offloaded is being
set and updated as part of the dump function. It is so because the driver
could decide to offload or not based on the qdisc parent, which could
change without notifying the qdisc.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-14 12:21:11 -05:00
Nogah Frankel f8253df553 net: sch: red: Change offloaded xstats to be incremental
Change the value of the xstats requested from the driver for offloaded RED
to be incremental, like the normal stats.
It increases consistency - if a qdisc stops being offloaded its xstats
don't change.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 16:07:40 -05:00
David S. Miller a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
Wei Yongjun 9540d97761 net: sched: fix skb leak in dev_requeue_skb()
When dev_requeue_skb() is called with bulked skb list, only the
first skb of the list will be requeued to qdisc layer, and leak
the others without free them.

TCP is broken due to skb leak since no free skb will be considered
as still in the host queue and never be retransmitted. This happend
when dev_requeue_skb() called from qdisc_restart().
  qdisc_restart
  |-- dequeue_skb
  |-- sch_direct_xmit()
      |-- dev_requeue_skb() <-- skb may bluked

Fix dev_requeue_skb() to requeue the full bluked list. Also change
to use __skb_queue_tail() in __dev_requeue_skb() to avoid skb out
of order.

Fixes: a53851e2c3 ("net: sched: explicit locking in gso_cpu fallback")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 13:48:29 -05:00
Roi Dayan 3bb23421a5 net/sched: Fix update of lastuse in act modules implementing stats_update
We need to update lastuse to to the most updated value between what
is already set and the new value.
If HW matching fails, i.e. because of an issue, the stats are not updated
but it could be that software did match and updated lastuse.

Fixes: 5712bf9c5c ("net/sched: act_mirred: Use passed lastuse argument")
Fixes: 9fea47d93b ("net/sched: act_gact: Update statistics when offloaded to hardware")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 13:27:52 -05:00
Nogah Frankel 44edf2f897 net: sched: Move offload check till after dump call
Move the check of the offload state to after the qdisc dump action was
called, so the qdisc could update it if it was changed.

Fixes: 7a4fa29106 ("net: sched: Add TCA_HW_OFFLOAD")
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 13:17:45 -05:00
Nogah Frankel 8234af2db3 net_sch: red: Fix the new offload indication
Update the offload flag, TCQ_F_OFFLOADED, in each dump call (and ignore
the offloading function return value in relation to this flag).
This is done because a qdisc is being initialized, and therefore offloaded
before being grafted. Since the ability of the driver to offload the qdisc
depends on its location, a qdisc can be offloaded and un-offloaded by graft
calls, that doesn't effect the qdisc itself.

Fixes: 428a68af3a ("net: sched: Move to new offload indication in RED"
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 13:17:45 -05:00
David S. Miller 6bb8824732 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/ipv6/ip6_gre.c is a case of parallel adds.

include/trace/events/tcp.h is a little bit more tricky.  The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-29 15:42:26 -05:00
Jiri Pirko 8ec6957403 net: sched: don't set extack message in case the qdisc will be created
If the qdisc is not found here, it is going to be created. Therefore,
this is not an error path. Remove the extack message set and don't
confuse user with error message in case the qdisc was created
successfully.

Fixes: 0921559811 ("net: sched: sch_api: handle generic qdisc errors")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-28 12:18:31 -05:00
David S. Miller 9f30e5c5c2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-12-22

1) Separate ESP handling from segmentation for GRO packets.
   This unifies the IPsec GSO and non GSO codepath.

2) Add asynchronous callbacks for xfrm on layer 2. This
   adds the necessary infrastructure to core networking.

3) Allow to use the layer2 IPsec GSO codepath for software
   crypto, all infrastructure is there now.

4) Also allow IPsec GSO with software crypto for local sockets.

5) Don't require synchronous crypto fallback on IPsec offloading,
   it is not needed anymore.

6) Check for xdo_dev_state_free and only call it if implemented.
   From Shannon Nelson.

7) Check for the required add and delete functions when a driver
   registers xdo_dev_ops. From Shannon Nelson.

8) Define xfrmdev_ops only with offload config.
   From Shannon Nelson.

9) Update the xfrm stats documentation.
   From Shannon Nelson.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27 11:15:14 -05:00
Jiri Pirko 4853f128c1 net: sched: fix possible null pointer deref in tcf_block_put
We need to check block for being null in both tcf_block_put and
tcf_block_put_ext.

Fixes: 343723dd51 ("net: sched: fix clsact init error path")
Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26 13:02:05 -05:00
Cong Wang b2fb01f426 net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap()
The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for
flying RCU callback installed by a previous mini_qdisc_pair_swap(),
however we miss it on the tp_head==NULL path, which leads to that
the RCU callback still uses miniq_old->rcu after it is freed together
with qdisc in qdisc_graft(). So just add it on that path too.

Fixes: 46209401f8 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath ")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26 12:28:40 -05:00
David S. Miller fba961ab29 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes.  Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.

Include a necessary change by Jakub Kicinski, with log message:

====================
cls_bpf no longer takes care of offload tracking.  Make sure
netdevsim performs necessary checks.  This fixes a warning
caused by TC trying to remove a filter it has not added.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22 11:16:31 -05:00
Alexander Aring a7c31693e1 net: sch: sch_drr: add extack support
This patch adds extack support for the drr qdisc implementation by
adding NL_SET_ERR_MSG in validation of user input.
Also it serves to illustrate a use case of how the infrastructure ops
api changes are to be used by individual qdiscs.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring 710fb39689 net: sch: sch_cbs: add extack support
This patch adds extack support for the cbs qdisc implementation by
adding NL_SET_ERR_MSG in validation of user input.
Also it serves to illustrate a use case of how the infrastructure ops
api changes are to be used by individual qdiscs.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring 62a6de62dc net: sch: sch_cbq: add extack support
This patch adds extack support for the cbq qdisc implementation by
adding NL_SET_ERR_MSG in validation of user input.
Also it serves to illustrate a use case of how the infrastructure ops
api changes are to be used by individual qdiscs.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring a38a98821c net: sch: api: add extack support in qdisc_create_dflt
This patch adds extack support for the function qdisc_create_dflt which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_create_dflt failed. The function qdisc_create_dflt
will also call an init callback which can fail by any per-qdisc specific
handling.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring d0bd684ddd net: sch: api: add extack support in qdisc_alloc
This patch adds extack support for the function qdisc_alloc which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_alloc failed.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring 8d1a77f974 net: sch: api: add extack support in tcf_block_get
This patch adds extack support for the function tcf_block_get which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why tcf_block_get failed.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring e9bc3fa28b net: sch: api: add extack support in qdisc_get_rtab
This patch adds extack support for the function qdisc_get_rtab which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_get_rtab failed.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring 653d6fd68d net: sched: sch: add extack for graft callback
This patch adds extack support for graft callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring cbaacc4e8a net: sched: sch: add extack for block callback
This patch adds extack support for block callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring 793d81d6a1 net: sched: sch: add extack to change class
This patch adds extack support for class change callback api. This prepares
to handle extack support inside each specific class implementation.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring 2030721cc0 net: sched: sch: add extack for change qdisc ops
This patch adds extack support for change callback for qdisc ops
structtur to prepare per-qdisc specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring e63d7dfd2d net: sched: sch: add extack for init callback
This patch adds extack support for init callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring 0921559811 net: sched: sch_api: handle generic qdisc errors
This patch adds extack support for generic qdisc handling. The extack
will be set deeper to each called function which is not part of netdev
core api.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:49 -05:00
Alexander Aring ac8ef4ab73 net: sched: fix coding style issues
This patch fix checkpatch issues for upcomming patches according to the
sched api file. It changes mostly how to check on null pointer.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:49 -05:00
Jakub Kicinski 102740bd94 cls_bpf: fix offload assumptions after callback conversion
cls_bpf used to take care of tracking what offload state a filter
is in, i.e. it would track if offload request succeeded or not.
This information would then be used to issue correct requests to
the driver, e.g. requests for statistics only on offloaded filters,
removing only filters which were offloaded, using add instead of
replace if previous filter was not added etc.

This tracking of offload state no longer functions with the new
callback infrastructure.  There could be multiple entities trying
to offload the same filter.

Throw out all the tracking and corresponding commands and simply
pass to the drivers both old and new bpf program.  Drivers will
have to deal with offload state tracking by themselves.

Fixes: 3f7889c4c7 ("net: sched: cls_bpf: call block callbacks for offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 13:08:18 -05:00
Steffen Klassert f53c723902 net: Add asynchronous callbacks for xfrm on layer 2.
This patch implements asynchronous crypto callbacks
and a backlog handler that can be used when IPsec
is done at layer 2 in the TX path. It also extends
the skb validate functions so that we can update
the driver transmit return codes based on async
crypto operation or to indicate that we queued the
packet in a backlog queue.

Joint work with: Aviv Heller <avivh@mellanox.com>

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-12-20 10:41:36 +01:00
Cong Wang 1df94c3c5d net_sched: properly check for empty skb array on error path
First, the check of &q->ring.queue against NULL is wrong, it
is always false. We should check the value rather than the address.

Secondly, we need the same check in pfifo_fast_reset() too,
as both ->reset() and ->destroy() are called in qdisc_destroy().

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 14:13:12 -05:00
David S. Miller c30abd5e40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-16 22:11:55 -05:00
Jiri Pirko b59e6979a8 net: sched: fix static key imbalance in case of ingress/clsact_init error
Move static key increments to the beginning of the init function
so they pair 1:1 with decrements in ingress/clsact_destroy,
which is called in case ingress/clsact_init fails.

Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 15:43:12 -05:00
Jiri Pirko 343723dd51 net: sched: fix clsact init error path
Since in qdisc_create, the destroy op is called when init fails, we
don't do cleanup in init and leave it up to destroy.
This fixes use-after-free when trying to put already freed block.

Fixes: 6e40cf2d4d ("net: sched: use extended variants of block_get/put in ingress and clsact qdiscs")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 15:43:12 -05:00
Yuval Mintz 428a68af3a net: sched: Move to new offload indication in RED
Let RED utilize the new internal flag, TCQ_F_OFFLOADED,
to mark a given qdisc as offloaded instead of using a dedicated
indication.

Also, change internal logic into looking at said flag when possible.

Fixes: 602f3baf22 ("net_sch: red: Add offload ability to RED qdisc")
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:35:36 -05:00
Yuval Mintz 7a4fa29106 net: sched: Add TCA_HW_OFFLOAD
Qdiscs can be offloaded to HW, but current implementation isn't uniform.
Instead, qdiscs either pass information about offload status via their
TCA_OPTIONS or omit it altogether.

Introduce a new attribute - TCA_HW_OFFLOAD that would form a uniform
uAPI for the offloading status of qdiscs.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:35:36 -05:00
Cong Wang 039af9c66b net_sched: switch to exit_batch for action pernet ops
Since we now hold RTNL lock in tc_action_net_exit(), it is good to
batch them to speedup tc action dismantle.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 13:58:41 -05:00
Pravin Shedge 83593010d3 net: remove duplicate includes
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 13:18:46 -05:00
David S. Miller 51e18a453f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict was two parallel additions of include files to sch_generic.c,
no biggie.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-09 22:09:55 -05:00
Jiri Pirko df45bf84e4 net: sched: fix use-after-free in tcf_block_put_ext
Since the block is freed with last chain being put, once we reach the
end of iteration of list_for_each_entry_safe, the block may be
already freed. I'm hitting this only by creating and deleting clsact:

[  202.171952] ==================================================================
[  202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390
[  202.187590] Read of size 8 at addr ffff880225539a80 by task tc/796
[  202.194508]
[  202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5
[  202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
[  202.213613] Call Trace:
[  202.216369]  dump_stack+0xda/0x169
[  202.220192]  ? dma_virt_map_sg+0x147/0x147
[  202.224790]  ? show_regs_print_info+0x54/0x54
[  202.229691]  ? tcf_chain_destroy+0x1dc/0x250
[  202.234494]  print_address_description+0x83/0x3d0
[  202.239781]  ? tcf_block_put_ext+0x240/0x390
[  202.244575]  kasan_report+0x1ba/0x460
[  202.248707]  ? tcf_block_put_ext+0x240/0x390
[  202.253518]  tcf_block_put_ext+0x240/0x390
[  202.258117]  ? tcf_chain_flush+0x290/0x290
[  202.262708]  ? qdisc_hash_del+0x82/0x1a0
[  202.267111]  ? qdisc_hash_add+0x50/0x50
[  202.271411]  ? __lock_is_held+0x5f/0x1a0
[  202.275843]  clsact_destroy+0x3d/0x80 [sch_ingress]
[  202.281323]  qdisc_destroy+0xcb/0x240
[  202.285445]  qdisc_graft+0x216/0x7b0
[  202.289497]  tc_get_qdisc+0x260/0x560

Fix this by holding the block also by chain 0 and put chain 0
explicitly, out of the list_for_each_entry_safe loop at the very
end of tcf_block_put_ext.

Fixes: efbf789739 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:09:08 -05:00
John Fastabend c5ad119fb6 net: sched: pfifo_fast use skb_array
This converts the pfifo_fast qdisc to use the skb_array data structure
and set the lockless qdisc bit. pfifo_fast is the first qdisc to support
the lockless bit that can be a child of a qdisc requiring locking. So
we add logic to clear the lock bit on initialization in these cases when
the qdisc graft operation occurs.

This also removes the logic used to pick the next band to dequeue from
and instead just checks a per priority array for packets from top priority
to lowest. This might need to be a bit more clever but seems to work
for now.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend ce679e8df7 net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend b01ac095c7 net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Also exports __gnet_stats_copy_queue() to use as a helper function.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend 7e66016f2c net: sched: helpers to sum qlen and qlen for per cpu logic
Add qdisc qlen helper routines for lockless qdiscs to use.

The qdisc qlen is no longer used in the hotpath but it is reported
via stats query on the qdisc so it still needs to be tracked. This
adds the per cpu operations needed along with a helper to return
the summation of per cpu stats.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend fd8e8d1a77 net: sched: check for frozen queue before skb_bad_txq check
I can not think of any reason to pull the bad txq skb off the qdisc if
the txq we plan to send this on is still frozen. So check for frozen
queue first and abort before dequeuing either skb_bad_txq skb or
normal qdisc dequeue() skb.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend 70e57d5e3f net: sched: use skb list for skb_bad_tx
Similar to how gso is handled use skb list for skb_bad_tx this is
required with lockless qdiscs because we may have multiple cores
attempting to push skbs into skb_bad_tx concurrently

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend 7bbde83b18 net: sched: drop qdisc_reset from dev_graft_qdisc
In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
operation is called on the old qdisc. The destroy operation will wait
a rcu grace period and call qdisc_rcu_free(). At which point
gso_cpu_skb is free'd along with all stats so no need to zero stats
and gso_cpu_skb from the graft operation itself.

Further after dropping the qdisc locks we can not continue to call
qdisc_reset before waiting an rcu grace period so that the qdisc is
detached from all cpus. By removing the qdisc_reset() here we get
the correct property of waiting an rcu grace period and letting the
qdisc_destroy operation clean up the qdisc correctly.

Note, a refcnt greater than 1 would cause the destroy operation to
be aborted however if this ever happened the reference to the qdisc
would be lost and we would have a memory leak.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend a53851e2c3 net: sched: explicit locking in gso_cpu fallback
This work is preparing the qdisc layer to support egress lockless
qdiscs. If we are running the egress qdisc lockless in the case we
overrun the netdev, for whatever reason, the netdev returns a busy
error code and the skb is parked on the gso_skb pointer. With many
cores all hitting this case at once its possible to have multiple
sk_buffs here so we turn gso_skb into a queue.

This should be the edge case and if we see this frequently then
the netdev/qdisc layer needs to back off.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend d59f5ffa59 net: sched: a dflt qdisc may be used with per cpu stats
Enable dflt qdisc support for per cpu stats before this patch a dflt
qdisc was required to use the global statistics qstats and bstats.

This adds a static flags field to qdisc_ops that is propagated
into qdisc->flags in qdisc allocate call. This allows the allocation
block to completely allocate the qdisc object so we don't have
dangling allocations after qdisc init.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend 29b86cdac0 net: sched: remove remaining uses for qdisc_qlen in xmit path
sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
of the routine only check if it is zero or not. Simplify the logic so
that we don't need to return an actual queue length value.

This introduces a case now where sch_direct_xmit would have returned
a qlen of zero but now it returns true. However in this case all
call sites of sch_direct_xmit will implement a dequeue() and get
a null skb and abort. This trades tracking qlen in the hotpath for
an extra dequeue operation. Overall this seems to be good for
performance.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend 6b3ba9146f net: sched: allow qdiscs to handle locking
This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
for this is synchronizing a qlen counter across threads proves to
cost more than doing the enqueue/dequeue operations when tested with
pktgen.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend 6c148184b5 net: sched: cleanup qdisc_run and __qdisc_run semantics
Currently __qdisc_run calls qdisc_run_end() but does not call
qdisc_run_begin(). This makes it hard to track pairs of
qdisc_run_{begin,end} across function calls.

To simplify reading these code paths this patch moves begin/end calls
into qdisc_run().

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
Chris Dion 32d3e51a82 net_sched: use macvlan real dev trans_start in dev_trans_start()
Macvlan devices are similar to vlans and do not update their
own trans_start. In order for arp monitoring to work for a bond device
when the slaves are macvlans, obtain its real device.

Signed-off-by: Chris Dion <christopher.dion@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 15:34:10 -05:00
Cong Wang 4bee00eb25 act_mirred: get rid of mirred_list_lock spinlock
TC actions are no longer freed in RCU callbacks and we should
always have RTNL lock, so this spinlock is no longer needed.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 14:50:13 -05:00
Cong Wang 9f8a739e72 act_mirred: get rid of tcfm_ifindex from struct tcf_mirred
tcfm_dev always points to the correct netdev and we already
hold a refcnt, so no need to use tcfm_ifindex to lookup again.

If we would support moving target netdev across netns, using
pointer would be better than ifindex.

This also fixes dumping obsolete ifindex, now after the
target device is gone we just dump 0 as ifindex.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 14:50:13 -05:00
Cong Wang 9a63b255df net_sched: remove unused parameter from act cleanup ops
No one actually uses it.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 18:07:58 -05:00
Alexander Aring 54160ef6ec net: sched: sch_api: rearrange init handling
This patch fixes the following checkpatch error:

ERROR: do not use assignment in if condition

by rearranging the if condition to execute init callback only if init
callback exists. The whole setup afterwards is called in any case,
doesn't matter if init callback is set or not. This patch has the same
behaviour as before, just without assign err variable in if condition.
It also makes the code easier to read.

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 15:04:27 -05:00
Alexander Aring 0ac4bd68ab net: sched: sch_api: fix code style issues
This patch fix checkpatch issues for upcomming patches according to the
sched api file. It changes checking on null pointer, remove unnecessary
brackets, add variable names for parameters and adjust 80 char width.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 15:04:27 -05:00
Cong Wang efbf789739 net_sched: get rid of rcu_barrier() in tcf_block_put_ext()
Both Eric and Paolo noticed the rcu_barrier() we use in
tcf_block_put_ext() could be a performance bottleneck when
we have a lot of tc classes.

Paolo provided the following to demonstrate the issue:

tc qdisc add dev lo root htb
for I in `seq 1 1000`; do
        tc class add dev lo parent 1: classid 1:$I htb rate 100kbit
        tc qdisc add dev lo parent 1:$I handle $((I + 1)): htb
        for J in `seq 1 10`; do
                tc filter add dev lo parent $((I + 1)): u32 match ip src 1.1.1.$J
        done
done
time tc qdisc del dev root

real    0m54.764s
user    0m0.023s
sys     0m0.000s

The rcu_barrier() there is to ensure we free the block after all chains
are gone, that is, to queue tcf_block_put_final() at the tail of workqueue.
We can achieve this ordering requirement by refcnt'ing tcf block instead,
that is, the tcf block is freed only when the last chain in this block is
gone. This also simplifies the code.

Paolo reported after this patch we get:

real    0m0.017s
user    0m0.000s
sys     0m0.017s

Tested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 14:53:17 -05:00
Nogah Frankel 8afa10cbe2 net_sched: red: Avoid illegal values
Check the qmin & qmax values doesn't overflow for the given Wlog value.
Check that qmin <= qmax.

Fixes: a783474591 ("[PKT_SCHED]: Generic RED layer")
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 14:37:13 -05:00
Simon Horman 62b32379fd flow_dissector: dissect tunnel info outside __skb_flow_dissect()
Move dissection of tunnel info to outside of the main flow dissection
function, __skb_flow_dissect(). The sole user of this feature, the flower
classifier, is updated to call tunnel info dissection directly, using
skb_flow_dissect_tunnel_info().

This results in a slightly less complex implementation of
__skb_flow_dissect(), in particular removing logic from that call path
which is not used by the majority of users. The expense of this is borne by
the flower classifier which now has to make an extra call for tunnel info
dissection.

This patch should not result in any behavioural change.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 12:09:18 -05:00
Cong Wang 90a6ec8535 act_sample: get rid of tcf_sample_cleanup_rcu()
Similar to commit d7fb60b9ca ("net_sched: get rid of tcfa_rcu"),
TC actions don't need to respect RCU grace period, because it
is either just detached from tc filter (standalone case) or
it is removed together with tc filter (bound case) in which case
RCU grace period is already respected at filter layer.

Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-30 10:19:17 -05:00
Jiri Pirko d51aae68b1 net: sched: cbq: create block for q->link.block
q->link.block is not initialized, that leads to EINVAL when one tries to
add filter there. So initialize it properly.

This can be reproduced by:
$ tc qdisc add dev eth0 root handle 1: cbq avpkt 1000 rate 1000Mbit bandwidth 1000Mbit
$ tc filter add dev eth0 parent 1: protocol ip prio 100 u32 match ip protocol 0 0x00 flowid 1:1

Reported-by: Jaroslav Aster <jaster@redhat.com>
Reported-by: Ivan Vecera <ivecera@redhat.com>
Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-28 16:04:26 -05:00
Paolo Abeni f85729d07c sch_sfq: fix null pointer dereference at timer expiration
While converting sch_sfq to use timer_setup(), the commit cdeabbb881
("net: sched: Convert timers to use timer_setup()") forgot to
initialize the 'sch' field. As a result, the timer callback tries to
dereference a NULL pointer, and the kernel does oops.

Fix it initializing such field at qdisc creation time.

Fixes: cdeabbb881 ("net: sched: Convert timers to use timer_setup()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-28 15:54:05 -05:00
Jakub Kicinski 25415cec50 cls_bpf: don't decrement net's refcount when offload fails
When cls_bpf offload was added it seemed like a good idea to
call cls_bpf_delete_prog() instead of extending the error
handling path, since the software state is fully initialized
at that point.  This handling of errors without jumping to
the end of the function is error prone, as proven by later
commit missing that extra call to __cls_bpf_delete_prog().

__cls_bpf_delete_prog() is now expected to be invoked with
a reference on exts->net or the field zeroed out.  The call
on the offload's error patch does not fullfil this requirement,
leading to each error stealing a reference on net namespace.

Create a function undoing what cls_bpf_set_parms() did and
use it from __cls_bpf_delete_prog() and the error path.

Fixes: aae2c35ec8 ("cls_bpf: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-28 15:49:44 -05:00
Roman Kapl a60b3f515d net: sched: crash on blocks with goto chain action
tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding all chains (except 0, that one is
already held). foreach_safe is not enough in this case.

To reproduce, run the following in a netns and then delete the ns:
    ip link add dtest type dummy
    tc qdisc add dev dtest ingress
    tc filter add dev dtest chain 1 parent ffff: handle 1 prio 1 flower action goto chain 2

Fixes: 822e86d997 ("net_sched: remove tcf_block_put_deferred()")
Signed-off-by: Roman Kapl <code@rkapl.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-25 23:57:20 +09:00
David S. Miller e4be7baba8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2017-11-23

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Several BPF offloading fixes, from Jakub. Among others:

    - Limit offload to cls_bpf and XDP program types only.
    - Move device validation into the driver and don't make
      any assumptions about the device in the classifier due
      to shared blocks semantics.
    - Don't pass offloaded XDP program into the driver when
      it should be run in native XDP instead. Offloaded ones
      are not JITed for the host in such cases.
    - Don't destroy device offload state when moved to
      another namespace.
    - Revert dumping offload info into user space for now,
      since ifindex alone is not sufficient. This will be
      redone properly for bpf-next tree.

2) Fix test_verifier to avoid using bpf_probe_write_user()
   helper in test cases, since it's dumping a warning into
   kernel log which may confuse users when only running tests.
   Switch to use bpf_trace_printk() instead, from Yonghong.

3) Several fixes for correcting ARG_CONST_SIZE_OR_ZERO semantics
   before it becomes uabi, from Gianluca. More specifically:

    - Add a type ARG_PTR_TO_MEM_OR_NULL that is used only
      by bpf_csum_diff(), where the argument is either a
      valid pointer or NULL. The subsequent ARG_CONST_SIZE_OR_ZERO
      then enforces a valid pointer in case of non-0 size
      or a valid pointer or NULL in case of size 0. Given
      that, the semantics for ARG_PTR_TO_MEM in combination
      with ARG_CONST_SIZE_OR_ZERO are now such that in case
      of size 0, the pointer must always be valid and cannot
      be NULL. This fix in semantics allows for bpf_probe_read()
      to drop the recently added size == 0 check in the helper
      that would become part of uabi otherwise once released.
      At the same time we can then fix bpf_probe_read_str() and
      bpf_perf_event_output() to use ARG_CONST_SIZE_OR_ZERO
      instead of ARG_CONST_SIZE in order to fix recently
      reported issues by Arnaldo et al, where LLVM optimizes
      two boundary checks into a single one for unknown
      variables where the verifier looses track of the variable
      bounds and thus rejects valid programs otherwise.

4) A fix for the verifier for the case when it detects
   comparison of two constants where the branch is guaranteed
   to not be taken at runtime. Verifier will rightfully prune
   the exploration of such paths, but we still pass the program
   to JITs, where they would complain about using reserved
   fields, etc. Track such dead instructions and sanitize
   them with mov r0,r0. Rejection is not possible since LLVM
   may generate them for valid C code and doesn't do as much
   data flow analysis as verifier. For bpf-next we might
   implement removal of such dead code and adjust branches
   instead. Fix from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 02:33:01 +09:00
Willem de Bruijn 0c19f846d5 net: accept UFO datagrams from tuntap and packet
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216f ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Tested
  Booted a v4.13 guest kernel with QEMU. On a host kernel before this
  patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
  enabled, same as on a v4.13 host kernel.

  A UFO packet sent from the guest appears on the tap device:
    host:
      nc -l -p -u 8000 &
      tcpdump -n -i tap0

    guest:
      dd if=/dev/zero of=payload.txt bs=1 count=2000
      nc -u 192.16.1.1 8000 < payload.txt

  Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
  packets arriving fragmented:

    ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
    (from https://github.com/wdebruij/kerneltools/tree/master/tests)

Changes
  v1 -> v2
    - simplified set_offload change (review comment)
    - documented test procedure

Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe8 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 01:37:35 +09:00
Roman Kapl d7aa04a5e8 net: sched: fix crash when deleting secondary chains
If you flush (delete) a filter chain other than chain 0 (such as when
deleting the device), the kernel may run into a use-after-free. The
chain refcount must not be decremented unless we are sure we are done
with the chain.

To reproduce the bug, run:
    ip link add dtest type dummy
    tc qdisc add dev dtest ingress
    tc filter add dev dtest chain 1  parent ffff: flower
    ip link del dtest

Introduced in: commit f93e1cdcf4 ("net/sched: fix filter flushing"),
but unless you have KAsan or luck, you won't notice it until
commit 0dadc117ac ("cls_flower: use tcf_exts_get_net() before call_rcu()")

Fixes: f93e1cdcf4 ("net/sched: fix filter flushing")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Roman Kapl <code@rkapl.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 01:25:37 +09:00
Jakub Kicinski 288b3de55a bpf: offload: move offload device validation out to the drivers
With TC shared block changes we can't depend on correct netdev
pointer being available in cls_bpf.  Move the device validation
to the driver.  Core will only make sure that offloaded programs
are always attached in the driver (or in HW by the driver).  We
trust that drivers which implement offload callbacks will perform
necessary checks.

Moving the checks to the driver is generally a useful thing,
in practice the check should be against a switchdev instance,
not a netdev, given that most ASICs will probably allow using
the same program on many ports.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-11-21 00:37:35 +01:00
Stephen Hemminger 9b0ed89172 netem: remove unnecessary 64 bit modulus
Fix compilation on 32 bit platforms (where doing modulus operation
with 64 bit requires extra glibc functions) by truncation.
The jitter for table distribution is limited to a 32 bit value
because random numbers are scaled as 32 bit value.

Also fix some whitespace.

Fixes: 99803171ef ("netem: add uapi to express delay and jitter in nanoseconds")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-15 14:14:16 +09:00
Stephen Hemminger bce552fd6f netem: use 64 bit divide by rate
Since times are now expressed in nanosecond, need to now do
true 64 bit divide. Old code would truncate rate at 32 bits.
Rename function to better express current usage.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-15 14:14:16 +09:00
Andrew Morton ee9d3429c0 net/sched/sch_red.c: work around gcc-4.4.4 anon union initializer issue
gcc-4.4.4 (at lest) has issues with initializers and anonymous unions:

net/sched/sch_red.c: In function 'red_dump_offload':
net/sched/sch_red.c:282: error: unknown field 'stats' specified in initializer
net/sched/sch_red.c:282: warning: initialization makes integer from pointer without a cast
net/sched/sch_red.c:283: error: unknown field 'stats' specified in initializer
net/sched/sch_red.c:283: warning: initialization makes integer from pointer without a cast
net/sched/sch_red.c: In function 'red_dump_stats':
net/sched/sch_red.c:352: error: unknown field 'xstats' specified in initializer
net/sched/sch_red.c:352: warning: initialization makes integer from pointer without a cast

Work around this.

Fixes: 602f3baf22 ("net_sch: red: Add offload ability to RED qdisc")
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:33:07 +09:00
Dave Taht 836af83b54 netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
         slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
         limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
         slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Dave Taht 99803171ef netem: add uapi to express delay and jitter in nanoseconds
netem userspace has long relied on a horrible /proc/net/psched hack
to translate the current notion of "ticks" to nanoseconds.

Expressing latency and jitter instead, in well defined nanoseconds,
increases the dynamic range of emulated delays and jitter in netem.

It will also ease a transition where reducing a tick to nsec
equivalence would constrain the max delay in prior versions of
netem to only 4.3 seconds.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Dave Taht 112f9cb656 netem: convert to qdisc_watchdog_schedule_ns
Upgrade the internal netem scheduler to use nanoseconds rather than
ticks throughout.

Convert to and from the std "ticks" userspace api automatically,
while allowing for finer grained scheduling to take place.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Manish Kurup 4c5b9d9642 act_vlan: VLAN action rewrite to use RCU lock/unlock and update
Using a spinlock in the VLAN action causes performance issues when the VLAN
action is used on multiple cores. Rewrote the VLAN action to use RCU read
locking for reads and updates instead.
All functions now use an RCU dereferenced pointer to access the VLAN action
context. Modified helper functions used by other modules, to use the RCU as
opposed to directly accessing the structure.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Manish Kurup <manish.kurup@verizon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 15:32:20 +09:00
Manish Kurup e0496cbbf8 act_vlan: Change stats update to use per-core stats
The VLAN action maintains one set of stats across all cores, and uses a
spinlock to synchronize updates to it from the same. Changed this to use a
per-CPU stats context instead.
This change will result in better performance.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Manish Kurup <manish.kurup@verizon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 15:32:20 +09:00
David S. Miller 4dc6758d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Simple cases of overlapping changes in the packet scheduler.

Must easier to resolve this time.

Which probably means that I screwed it up somehow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 10:00:18 +09:00
Cong Wang 35c55fc156 cls_u32: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:10 +09:00
Cong Wang f2b751053e cls_tcindex: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:10 +09:00
Cong Wang 96585063a2 cls_rsvp: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:10 +09:00
Cong Wang 3fd51de5e3 cls_route: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:10 +09:00
Cong Wang 57767e7853 cls_matchall: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:10 +09:00
Cong Wang d5f984f5af cls_fw: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang 0dadc117ac cls_flower: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang 22f7cec93f cls_flow: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang ed14816814 cls_cgroup: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang aae2c35ec8 cls_bpf: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang 0b2a59894b cls_basic: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.

Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang e4b95c41df net_sched: introduce tcf_exts_get_net() and tcf_exts_put_net()
Instead of holding netns refcnt in tc actions, we can minimize
the holding time by saving it in struct tcf_exts instead. This
means we can just hold netns refcnt right before call_rcu() and
release it after tcf_exts_destroy() is done.

However, because on netns cleanup path we call tcf_proto_destroy()
too, obviously we can not hold netns for a zero refcnt, in this
case we have to do cleanup synchronously. It is fine for RCU too,
the caller cleanup_net() already waits for a grace period.

For other cases, refcnt is non-zero and we can safely grab it as
normal and release it after we are done.

This patch provides two new API for each filter to use:
tcf_exts_get_net() and tcf_exts_put_net(). And all filters now can
use the following pattern:

void __destroy_filter() {
  tcf_exts_destroy();
  tcf_exts_put_net();  // <== release netns refcnt
  kfree();
}
void some_work() {
  rtnl_lock();
  __destroy_filter();
  rtnl_unlock();
}
void some_rcu_callback() {
  tcf_queue_work(some_work);
}

if (tcf_exts_get_net())  // <== hold netns refcnt
  call_rcu(some_rcu_callback);
else
  __destroy_filter();

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang c7e460ce55 Revert "net_sched: hold netns refcnt for each action"
This reverts commit ceffcc5e25.
If we hold that refcnt, the netns can never be destroyed until
all actions are destroyed by user, this breaks our netns design
which we expect all actions are destroyed when we destroy the
whole netns.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Nogah Frankel 8521db4c7e net_sch: cbs: Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS
Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS to match the new convention..

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:38 +09:00
Nogah Frankel 575ed7d39e net_sch: mqprio: Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO
Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO to match the new
convention.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:38 +09:00
Nogah Frankel 602f3baf22 net_sch: red: Add offload ability to RED qdisc
Add the ability to offload RED qdisc by using ndo_setup_tc.
There are four commands for RED offloading:
* TC_RED_SET: handles set and change.
* TC_RED_DESTROY: handle qdisc destroy.
* TC_RED_STATS: update the qdiscs counters (given as reference)
* TC_RED_XSTAT: returns red xstats.

Whether RED is being offloaded is being determined every time dump action
is being called because parent change of this qdisc could change its
offload state but doesn't require any RED function to be called.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:37 +09:00
Jakub Kicinski 6c8dfe21c4 cls_bpf: allow attaching programs loaded for specific device
If TC program is loaded with skip_sw flag, we should allow
the device-specific programs to be accepted.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Colin Ian King 0f04d05751 net: sched: cls_u32: use bitwise & rather than logical && on n->flags
Currently n->flags is being operated on by a logical && operator rather
than a bitwise & operator. This looks incorrect as these should be bit
flag operations. Fix this.

Detected by CoverityScan, CID#1460398 ("Logical vs. bitwise operator")

Fixes: 245dc5121a ("net: sched: cls_u32: call block callbacks for offload")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 22:43:54 +09:00
David S. Miller 2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Linus Torvalds 7ba3ebff9c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Hopefully this is the last batch of networking fixes for 4.14

  Fingers crossed...

   1) Fix stmmac to use the proper sized OF property read, from Bhadram
      Varka.

   2) Fix use after free in net scheduler tc action code, from Cong
      Wang.

   3) Fix SKB control block mangling in tcp_make_synack().

   4) Use proper locking in fib_dump_info(), from Florian Westphal.

   5) Fix IPG encodings in systemport driver, from Florian Fainelli.

   6) Fix division by zero in NV TCP congestion control module, from
      Konstantin Khlebnikov.

   7) Fix use after free in nf_reject_ipv4, from Tejaswi Tanikella"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: systemport: Correct IPG length settings
  tcp: do not mangle skb->cb[] in tcp_make_synack()
  fib: fib_dump_info can no longer use __in_dev_get_rtnl
  stmmac: use of_property_read_u32 instead of read_u8
  net_sched: hold netns refcnt for each action
  net_sched: acquire RTNL in tc_action_net_exit()
  net: vrf: correct FRA_L3MDEV encode type
  tcp_nv: fix division by zero in tcpnv_acked()
  netfilter: nf_reject_ipv4: Fix use-after-free in send_reset
  netfilter: nft_set_hash: disable fast_ops for 2-len keys
2017-11-03 09:09:21 -07:00
Jiri Pirko 46209401f8 net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath
In sch_handle_egress and sch_handle_ingress tp->q is used only in order
to update stats. So stats and filter list are the only things that are
needed in clsact qdisc fastpath processing. Introduce new mini_Qdisc
struct to hold those items. Also, introduce a helper to swap the
mini_Qdisc structures in case filter list head changes.

This removes need for tp->q usage without added overhead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 21:57:24 +09:00
Jiri Pirko c7eb7d7230 net: sched: introduce chain_head_change callback
Add a callback that is to be called whenever head of the chain changes.
Also provide a callback for the default case when the caller gets a
block using non-extended getter.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 21:57:23 +09:00
Cong Wang 8f918d3ff4 net_sched: check NULL in tcf_block_put()
Callers of tcf_block_put() could pass NULL so
we can't use block->q before checking if block is
NULL or not.

tcf_block_put_ext() callers are fine, it is always
non-NULL.

Fixes: 8c4083b30e ("net: sched: add block bind/unbind notif. and extended block_get/put")
Reported-by: Dave Taht <dave.taht@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 21:31:15 +09:00
Colin Ian King 53b3847be5 net: sched: cls_bpf: use bitwise & rather than logical && on gen_flags
Currently gen_flags is being operated on by a logical && operator rather
than a bitwise & operator. This looks incorrect as these should be bit
flag operations. Fix this.

Detected by CoverityScan, CID#1460305 ("Logical vs. bitwise operator")

Fixes: 3f7889c4c7 ("net: sched: cls_bpf: call block callbacks for offload)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 15:52:38 +09:00
Jiri Pirko 4bb1b116b7 net: sched: move block offload unbind after all chains are flushed
Currently, the offload unbind is done before the chains are flushed.
That causes driver to unregister block callback before it can get all
the callback calls done during flush, leaving the offloaded tps inside
the HW. So fix the order to prevent this situation and restore the
original behaviour.

Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 15:46:15 +09:00
Cong Wang ceffcc5e25 net_sched: hold netns refcnt for each action
TC actions have been destroyed asynchronously for a long time,
previously in a RCU callback and now in a workqueue. If we
don't hold a refcnt for its netns, we could use the per netns
data structure, struct tcf_idrinfo, after it has been freed by
netns workqueue.

Hold refcnt to ensure netns destroy happens after all actions
are gone.

Fixes: ddf97ccdd7 ("net_sched: add network namespace support for tc actions")
Reported-by: Lucas Bates <lucasb@mojatatu.com>
Tested-by: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 10:30:38 +09:00
Cong Wang a159d3c4b8 net_sched: acquire RTNL in tc_action_net_exit()
I forgot to acquire RTNL in tc_action_net_exit()
which leads that action ops->cleanup() is not always
called with RTNL. This usually is not a big deal because
this function is called after all netns refcnt are gone,
but given RTNL protects more than just actions, add it
for safety and consistency.

Also add an assertion to catch other potential bugs.

Fixes: ddf97ccdd7 ("net_sched: add network namespace support for tc actions")
Reported-by: Lucas Bates <lucasb@mojatatu.com>
Tested-by: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 10:30:38 +09:00
Linus Torvalds ead751507d License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
 makes it harder for compliance tools to determine the correct license.
 
 By default all files without license information are under the default
 license of the kernel, which is GPL version 2.
 
 Update the files which contain no license information with the 'GPL-2.0'
 SPDX license identifier.  The SPDX identifier is a legally binding
 shorthand, which can be used instead of the full boiler plate text.
 
 This patch is based on work done by Thomas Gleixner and Kate Stewart and
 Philippe Ombredanne.
 
 How this work was done:
 
 Patches were generated and checked against linux-4.14-rc6 for a subset of
 the use cases:
  - file had no licensing information it it.
  - file was a */uapi/* one with no licensing information in it,
  - file was a */uapi/* one with existing licensing information,
 
 Further patches will be generated in subsequent months to fix up cases
 where non-standard license headers were used, and references to license
 had to be inferred by heuristics based on keywords.
 
 The analysis to determine which SPDX License Identifier to be applied to
 a file was done in a spreadsheet of side by side results from of the
 output of two independent scanners (ScanCode & Windriver) producing SPDX
 tag:value files created by Philippe Ombredanne.  Philippe prepared the
 base worksheet, and did an initial spot review of a few 1000 files.
 
 The 4.13 kernel was the starting point of the analysis with 60,537 files
 assessed.  Kate Stewart did a file by file comparison of the scanner
 results in the spreadsheet to determine which SPDX license identifier(s)
 to be applied to the file. She confirmed any determination that was not
 immediately clear with lawyers working with the Linux Foundation.
 
 Criteria used to select files for SPDX license identifier tagging was:
  - Files considered eligible had to be source code files.
  - Make and config files were included as candidates if they contained >5
    lines of source
  - File already had some variant of a license header in it (even if <5
    lines).
 
 All documentation files were explicitly excluded.
 
 The following heuristics were used to determine which SPDX license
 identifiers to apply.
 
  - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.
 
    For non */uapi/* files that summary was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0                                              11139
 
    and resulted in the first patch in this series.
 
    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note                        930
 
    and resulted in the second patch in this series.
 
  - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point).  Results summary:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note                       270
    GPL-2.0+ WITH Linux-syscall-note                      169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
    LGPL-2.1+ WITH Linux-syscall-note                      15
    GPL-1.0+ WITH Linux-syscall-note                       14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
    LGPL-2.0+ WITH Linux-syscall-note                       4
    LGPL-2.1 WITH Linux-syscall-note                        3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
 
    and that resulted in the third patch in this series.
 
  - when the two scanners agreed on the detected license(s), that became
    the concluded license(s).
 
  - when there was disagreement between the two scanners (one detected a
    license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.
 
  - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply (and
    which scanner probably needed to revisit its heuristics).
 
  - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.
 
  - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.
 
 In total, over 70 hours of logged manual review was done on the
 spreadsheet to determine the SPDX license identifiers to apply to the
 source files by Kate, Philippe, Thomas and, in some cases, confirmation
 by lawyers working with the Linux Foundation.
 
 Kate also obtained a third independent scan of the 4.13 code base from
 FOSSology, and compared selected files where the other two scanners
 disagreed against that SPDX file, to see if there was new insights.  The
 Windriver scanner is based on an older version of FOSSology in part, so
 they are related.
 
 Thomas did random spot checks in about 500 files from the spreadsheets
 for the uapi headers and agreed with SPDX license identifier in the
 files he inspected. For the non-uapi files Thomas did random spot checks
 in about 15000 files.
 
 In initial set of patches against 4.14-rc6, 3 files were found to have
 copy/paste license identifier errors, and have been fixed to reflect the
 correct identifier.
 
 Additionally Philippe spent 10 hours this week doing a detailed manual
 inspection and review of the 12,461 patched files from the initial patch
 version early this week with:
  - a full scancode scan run, collecting the matched texts, detected
    license ids and scores
  - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct
  - reviewing anything where there was no detection but the patch license
    was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
    SPDX license was correct
 
 This produced a worksheet with 20 files needing minor correction.  This
 worksheet was then exported into 3 different .csv files for the
 different types of files to be modified.
 
 These .csv files were then reviewed by Greg.  Thomas wrote a script to
 parse the csv files and add the proper SPDX tag to the file, in the
 format that the file expected.  This script was further refined by Greg
 based on the output to detect more types of files automatically and to
 distinguish between header and source .c files (which need different
 comment types.)  Finally Greg ran the script using the .csv files to
 generate the patches.
 
 Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
 Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
 Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
 6dVh26uchcEQLN/XqUDt
 =x306
 -----END PGP SIGNATURE-----

Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull initial SPDX identifiers from Greg KH:
 "License cleanup: add SPDX license identifiers to some files

  Many source files in the tree are missing licensing information, which
  makes it harder for compliance tools to determine the correct license.

  By default all files without license information are under the default
  license of the kernel, which is GPL version 2.

  Update the files which contain no license information with the
  'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
  binding shorthand, which can be used instead of the full boiler plate
  text.

  This patch is based on work done by Thomas Gleixner and Kate Stewart
  and Philippe Ombredanne.

  How this work was done:

  Patches were generated and checked against linux-4.14-rc6 for a subset
  of the use cases:

   - file had no licensing information it it.

   - file was a */uapi/* one with no licensing information in it,

   - file was a */uapi/* one with existing licensing information,

  Further patches will be generated in subsequent months to fix up cases
  where non-standard license headers were used, and references to
  license had to be inferred by heuristics based on keywords.

  The analysis to determine which SPDX License Identifier to be applied
  to a file was done in a spreadsheet of side by side results from of
  the output of two independent scanners (ScanCode & Windriver)
  producing SPDX tag:value files created by Philippe Ombredanne.
  Philippe prepared the base worksheet, and did an initial spot review
  of a few 1000 files.

  The 4.13 kernel was the starting point of the analysis with 60,537
  files assessed. Kate Stewart did a file by file comparison of the
  scanner results in the spreadsheet to determine which SPDX license
  identifier(s) to be applied to the file. She confirmed any
  determination that was not immediately clear with lawyers working with
  the Linux Foundation.

  Criteria used to select files for SPDX license identifier tagging was:

   - Files considered eligible had to be source code files.

   - Make and config files were included as candidates if they contained
     >5 lines of source

   - File already had some variant of a license header in it (even if <5
     lines).

  All documentation files were explicitly excluded.

  The following heuristics were used to determine which SPDX license
  identifiers to apply.

   - when both scanners couldn't find any license traces, file was
     considered to have no license information in it, and the top level
     COPYING file license applied.

     For non */uapi/* files that summary was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0                                              11139

     and resulted in the first patch in this series.

     If that file was a */uapi/* path one, it was "GPL-2.0 WITH
     Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
     was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0 WITH Linux-syscall-note                        930

     and resulted in the second patch in this series.

   - if a file had some form of licensing information in it, and was one
     of the */uapi/* ones, it was denoted with the Linux-syscall-note if
     any GPL family license was found in the file or had no licensing in
     it (per prior point). Results summary:

       SPDX license identifier                            # files
       ---------------------------------------------------|------
       GPL-2.0 WITH Linux-syscall-note                       270
       GPL-2.0+ WITH Linux-syscall-note                      169
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
       LGPL-2.1+ WITH Linux-syscall-note                      15
       GPL-1.0+ WITH Linux-syscall-note                       14
       ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
       LGPL-2.0+ WITH Linux-syscall-note                       4
       LGPL-2.1 WITH Linux-syscall-note                        3
       ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
       ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

     and that resulted in the third patch in this series.

   - when the two scanners agreed on the detected license(s), that
     became the concluded license(s).

   - when there was disagreement between the two scanners (one detected
     a license but the other didn't, or they both detected different
     licenses) a manual inspection of the file occurred.

   - In most cases a manual inspection of the information in the file
     resulted in a clear resolution of the license that should apply
     (and which scanner probably needed to revisit its heuristics).

   - When it was not immediately clear, the license identifier was
     confirmed with lawyers working with the Linux Foundation.

   - If there was any question as to the appropriate license identifier,
     the file was flagged for further research and to be revisited later
     in time.

  In total, over 70 hours of logged manual review was done on the
  spreadsheet to determine the SPDX license identifiers to apply to the
  source files by Kate, Philippe, Thomas and, in some cases,
  confirmation by lawyers working with the Linux Foundation.

  Kate also obtained a third independent scan of the 4.13 code base from
  FOSSology, and compared selected files where the other two scanners
  disagreed against that SPDX file, to see if there was new insights.
  The Windriver scanner is based on an older version of FOSSology in
  part, so they are related.

  Thomas did random spot checks in about 500 files from the spreadsheets
  for the uapi headers and agreed with SPDX license identifier in the
  files he inspected. For the non-uapi files Thomas did random spot
  checks in about 15000 files.

  In initial set of patches against 4.14-rc6, 3 files were found to have
  copy/paste license identifier errors, and have been fixed to reflect
  the correct identifier.

  Additionally Philippe spent 10 hours this week doing a detailed manual
  inspection and review of the 12,461 patched files from the initial
  patch version early this week with:

   - a full scancode scan run, collecting the matched texts, detected
     license ids and scores

   - reviewing anything where there was a license detected (about 500+
     files) to ensure that the applied SPDX license was correct

   - reviewing anything where there was no detection but the patch
     license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
     applied SPDX license was correct

  This produced a worksheet with 20 files needing minor correction. This
  worksheet was then exported into 3 different .csv files for the
  different types of files to be modified.

  These .csv files were then reviewed by Greg. Thomas wrote a script to
  parse the csv files and add the proper SPDX tag to the file, in the
  format that the file expected. This script was further refined by Greg
  based on the output to detect more types of files automatically and to
  distinguish between header and source .c files (which need different
  comment types.) Finally Greg ran the script using the .csv files to
  generate the patches.

  Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
  Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
  Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  License cleanup: add SPDX license identifier to uapi header files with a license
  License cleanup: add SPDX license identifier to uapi header files with no license
  License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02 10:04:46 -07:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Jiri Pirko 7612fb0387 net: sched: remove tc_can_offload check from egdev call
Since the only user, mlx5 driver does the check in
mlx5e_setup_tc_block_cb, no need to check here.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 16:10:39 +09:00
Jiri Pirko 44ae12a768 net: sched: move the can_offload check from binding phase to rule insertion phase
This restores the original behaviour before the block callbacks were
introduced. Allow the drivers to do binding of block always, no matter
if the NETIF_F_HW_TC feature is on or off. Move the check to the block
callback which is called for rule insertion.

Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 16:10:39 +09:00
David S. Miller ed29668d1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Smooth Cong Wang's bug fix into 'net-next'.  Basically put
the bulk of the tcf_block_put() logic from 'net' into
tcf_block_put_ext(), but after the offload unbind.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 15:23:39 +09:00
Yotam Gigi f1fd20c361 MAINTAINERS: Update Yotam's E-mail
For the time being I will be available in my private mail. Update both the
MAINTAINERS file and the individual modules MODULE_AUTHOR directive with
the new address.

Signed-off-by: Yotam Gigi <yotam.gi@gmail.com>
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 12:19:03 +09:00
Amritha Nambiar 384c181e37 net: sched: Identify hardware traffic classes using classid
This patch offloads the classid to hardware and uses the classid
reserved in the range :ffe0 - :ffef to identify hardware traffic
classes reported via dev->num_tc.

tcf_result structure contains the class ID of the class to which
the packet belongs and is offloaded to hardware via flower filter.
A new helper function is introduced to represent HW traffic
classes 0 through 15 using the reserved classid values :ffe0 - :ffef.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-31 10:45:45 -07:00
Cong Wang 822e86d997 net_sched: remove tcf_block_put_deferred()
In commit 7aa0045dad ("net_sched: introduce a workqueue for RCU callbacks of tc filter")
I defer tcf_chain_flush() to a workqueue, this causes a use-after-free
because qdisc is already destroyed after we queue this work.

The tcf_block_put_deferred() is no longer necessary after we get RTNL
for each tc filter destroy work, no others could jump in at this point.
Same for tcf_chain_hold(), we are fully serialized now.

This also reduces one indirection therefore makes the code more
readable. Note this brings back a rcu_barrier(), however comparing
to the code prior to commit 7aa0045dad we still reduced one
rcu_barrier(). For net-next, we can consider to refcnt tcf block to
avoid it.

Fixes: 7aa0045dad ("net_sched: introduce a workqueue for RCU callbacks of tc filter")
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-31 11:06:01 +09:00
David S. Miller e1ea2f9856 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts here.

NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
nfp_fl_output() needed some adjustments because the code block is in
an else block now.

Parallel additions to net/pkt_cls.h and net/sch_generic.h

A bug fix in __tcp_retransmit_skb() conflicted with some of
the rbtree changes in net-next.

The tc action RCU callback fixes in 'net' had some overlap with some
of the recent tcf_block reworking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-30 21:09:24 +09:00
Cong Wang 46e235c15c net_sched: fix call_rcu() race on act_sample module removal
Similar to commit c78e1746d3
("net: sched: fix call_rcu() race on classifier module unloads"),
we need to wait for flying RCU callback tcf_sample_cleanup_rcu().

Cc: Yotam Gigi <yotamg@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang 2d132eba1d net_sched: add rtnl assertion to tcf_exts_destroy()
After previous patches, it is now safe to claim that
tcf_exts_destroy() is always called with RTNL lock.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang 27ce4f05e2 net_sched: use tcf_queue_work() in tcindex filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang d4f84a41dc net_sched: use tcf_queue_work() in rsvp filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang c2f3f31d40 net_sched: use tcf_queue_work() in route filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang c0d378ef12 net_sched: use tcf_queue_work() in u32 filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang df2735ee8e net_sched: use tcf_queue_work() in matchall filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang e071dff2a6 net_sched: use tcf_queue_work() in fw filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang 0552c8afa0 net_sched: use tcf_queue_work() in flower filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang 94cdb47566 net_sched: use tcf_queue_work() in flow filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:31 +09:00
Cong Wang b1b5b04fdb net_sched: use tcf_queue_work() in cgroup filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:30 +09:00
Cong Wang e910af676b net_sched: use tcf_queue_work() in bpf filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:30 +09:00
Cong Wang c96a48385d net_sched: use tcf_queue_work() in basic filter
Defer the tcf_exts_destroy() in RCU callback to
tc filter workqueue and get RTNL lock.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:30 +09:00
Cong Wang 7aa0045dad net_sched: introduce a workqueue for RCU callbacks of tc filter
This patch introduces a dedicated workqueue for tc filters
so that each tc filter's RCU callback could defer their
action destroy work to this workqueue. The helper
tcf_queue_work() is introduced for them to use.

Because we hold RTNL lock when calling tcf_block_put(), we
can not simply flush works inside it, therefore we have to
defer it again to this workqueue and make sure all flying RCU
callbacks have already queued their work before this one, in
other words, to ensure this is the last one to execute to
prevent any use-after-free.

On the other hand, this makes tcf_block_put() ugly and
harder to understand. Since David and Eric strongly dislike
adding synchronize_rcu(), this is probably the only
solution that could make everyone happy.

Please also see the code comments below.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 22:49:30 +09:00
Cong Wang 50317fce2c net_sched: avoid matching qdisc with zero handle
Davide found the following script triggers a NULL pointer
dereference:

ip l a name eth0 type dummy
tc q a dev eth0 parent :1 handle 1: htb

This is because for a freshly created netdevice noop_qdisc
is attached and when passing 'parent :1', kernel actually
tries to match the major handle which is 0 and noop_qdisc
has handle 0 so is matched by mistake. Commit 69012ae425
tries to fix a similar bug but still misses this case.

Handle 0 is not a valid one, should be just skipped. In
fact, kernel uses it as TC_H_UNSPEC.

Fixes: 69012ae425 ("net: sched: fix handling of singleton qdiscs with qdisc_hash")
Fixes: 59cc1f61f0 ("net: sched:convert qdisc linked list to hashtable")
Reported-by: Davide Caratti <dcaratti@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 17:55:03 +09:00
Vinicius Costa Gomes 3d0bd028ff net/sched: Add support for HW offloading for CBS
This adds support for offloading the CBS algorithm to the controller,
if supported. Drivers wanting to support CBS offload must implement
the .ndo_setup_tc callback and handle the TC_SETUP_CBS (introduced
here) type.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-27 09:49:24 -07:00
Vinicius Costa Gomes 585d763af0 net/sched: Introduce Credit Based Shaper (CBS) qdisc
This queueing discipline implements the shaper algorithm defined by
the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.

It's primary usage is to apply some bandwidth reservation to user
defined traffic classes, which are mapped to different queues via the
mqprio qdisc.

Only a simple software implementation is added for now.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Tested-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-27 09:48:02 -07:00
Jesus Sanchez-Palencia 0f7787b413 net/sched: Add select_queue() class_ops for mqprio
When replacing a child qdisc from mqprio, tc_modify_qdisc() must fetch
the netdev_queue pointer that the current child qdisc is associated
with before creating the new qdisc.

Currently, when using mqprio as root qdisc, the kernel will end up
getting the queue #0 pointer from the mqprio (root qdisc), which leaves
any new child qdisc with a possibly wrong netdev_queue pointer.

Implementing the Qdisc_class_ops select_queue() on mqprio fixes this
issue and avoid an inconsistent state when child qdiscs are replaced.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Tested-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-27 09:41:50 -07:00
Jesus Sanchez-Palencia ce8a75f60b net/sched: Change behavior of mq select_queue()
Currently, the class_ops select_queue() implementation on sch_mq
returns a pointer to netdev_queue #0 when it receives and invalid
qdisc id. That can be misleading since all of mq's inner qdiscs are
attached to a valid netdev_queue.

Here we fix that by returning NULL when a qdisc id is invalid. This is
aligned with how select_queue() is implemented for sch_mqprio in the
next patch on this series, keeping a consistent behavior between these
two qdiscs.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Tested-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-27 09:41:45 -07:00
Jesus Sanchez-Palencia 26aa0459fa net/sched: Check for null dev_queue on create flow
In qdisc_alloc() the dev_queue pointer was used without any checks
being performed. If qdisc_create() gets a null dev_queue pointer, it
just passes it along to qdisc_alloc(), leading to a crash. That
happens if a root qdisc implements select_queue() and returns a null
dev_queue pointer for an "invalid handle", for example, or if the
dev_queue associated with the parent qdisc is null.

This patch is in preparation for the next in this series, where
select_queue() is being added to mqprio and as it may return a null
dev_queue.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Tested-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-27 09:41:38 -07:00
Or Gerlitz 9d452cebd7 net/sched: Fix actions list corruption when adding offloaded tc flows
Prior to commit b3f55bdda8, the networking core doesn't wire an in-place
actions list the when the low level driver is called to offload the flow,
but all low level drivers do that (call tcf_exts_to_list()) in their
offloading "add" logic.

Now, the in-place list is set in the core which goes over the list in a loop,
but also by the hw driver when their offloading code is invoked indirectly:

	cls_xxx add flow -> tc_setup_cb_call -> tc_exts_setup_cb_egdev_call -> hw driver

which messes up the core list instance upon driver return. Fix that by avoiding
in-place list on the net core code that deals with adding flows.

Fixes: b3f55bdda8 ('net: sched: introduce per-egress action device callbacks')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-24 19:00:54 +09:00
Gustavo A. R. Silva f3ae608edb net: sched: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 02:07:08 +01:00
Jiri Pirko 8d26d5636d net: sched: avoid ndo_setup_tc calls for TC_SETUP_CLS*
All drivers are converted to use block callbacks for TC_SETUP_CLS*.
So it is now safe to remove the calls to ndo_setup_tc from cls_*

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:08 +01:00
Jiri Pirko 3f7889c4c7 net: sched: cls_bpf: call block callbacks for offload
Use the newly introduced callbacks infrastructure and call block
callbacks alongside with the existing per-netdev ndo_setup_tc.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:07 +01:00
Jiri Pirko 245dc5121a net: sched: cls_u32: call block callbacks for offload
Use the newly introduced callbacks infrastructure and call block
callbacks alongside with the existing per-netdev ndo_setup_tc.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:07 +01:00
Jiri Pirko 7746041192 net: sched: cls_u32: swap u32_remove_hw_knode and u32_remove_hw_hnode
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:07 +01:00
Jiri Pirko 2447a96f88 net: sched: cls_matchall: call block callbacks for offload
Use the newly introduced callbacks infrastructure and call block
callbacks alongside with the existing per-netdev ndo_setup_tc.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:07 +01:00
Jiri Pirko 208c0f4b52 net: sched: use tc_setup_cb_call to call per-block callbacks
Extend the tc_setup_cb_call entrypoint function originally used only for
action egress devices callbacks to call per-block callbacks as well.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:07 +01:00
Jiri Pirko acb674428c net: sched: introduce per-block callbacks
Introduce infrastructure that allows drivers to register callbacks that
are called whenever tc would offload inserted rule for a specific block.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:06 +01:00
Jiri Pirko 6e40cf2d4d net: sched: use extended variants of block_get/put in ingress and clsact qdiscs
Use previously introduced extended variants of block get and put
functions. This allows to specify a binder types specific to clsact
ingress/egress which is useful for drivers to distinguish who actually
got the block.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 03:04:06 +01:00