License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifndef __LINUX_RTNETLINK_H
|
|
|
|
#define __LINUX_RTNETLINK_H
|
|
|
|
|
|
|
|
|
2006-03-21 14:23:58 +08:00
|
|
|
#include <linux/mutex.h>
|
2010-11-15 14:01:59 +08:00
|
|
|
#include <linux/netdevice.h>
|
2014-05-13 06:11:20 +08:00
|
|
|
#include <linux/wait.h>
|
2018-09-25 00:22:49 +08:00
|
|
|
#include <linux/refcount.h>
|
2012-10-13 17:46:48 +08:00
|
|
|
#include <uapi/linux/rtnetlink.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2007-11-20 14:26:51 +08:00
|
|
|
extern int rtnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, u32 group, int echo);
|
|
|
|
extern int rtnl_unicast(struct sk_buff *skb, struct net *net, u32 pid);
|
2009-02-25 15:18:28 +08:00
|
|
|
extern void rtnl_notify(struct sk_buff *skb, struct net *net, u32 pid,
|
|
|
|
u32 group, struct nlmsghdr *nlh, gfp_t flags);
|
2007-11-20 14:26:51 +08:00
|
|
|
extern void rtnl_set_sk_err(struct net *net, u32 group, int error);
|
2005-04-17 06:20:36 +08:00
|
|
|
extern int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics);
|
2006-11-28 01:27:07 +08:00
|
|
|
extern int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst,
|
2012-07-10 20:06:14 +08:00
|
|
|
u32 id, long expires, u32 error);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-10-24 07:02:42 +08:00
|
|
|
void rtmsg_ifinfo(int type, struct net_device *dev, unsigned change, gfp_t flags);
|
2017-10-03 19:53:23 +08:00
|
|
|
void rtmsg_ifinfo_newnet(int type, struct net_device *dev, unsigned int change,
|
2018-01-25 22:01:39 +08:00
|
|
|
gfp_t flags, int *new_nsid, int new_ifindex);
|
2014-12-04 05:46:24 +08:00
|
|
|
struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev,
|
2017-05-27 22:14:34 +08:00
|
|
|
unsigned change, u32 event,
|
2018-01-25 22:01:39 +08:00
|
|
|
gfp_t flags, int *new_nsid,
|
|
|
|
int new_ifindex);
|
2014-12-04 05:46:24 +08:00
|
|
|
void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev,
|
|
|
|
gfp_t flags);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-03-21 14:23:58 +08:00
|
|
|
/* RTNL is used as a global lock for all changes to network configuration */
|
2005-04-17 06:20:36 +08:00
|
|
|
extern void rtnl_lock(void);
|
|
|
|
extern void rtnl_unlock(void);
|
2006-03-21 14:23:58 +08:00
|
|
|
extern int rtnl_trylock(void);
|
2008-04-24 13:10:48 +08:00
|
|
|
extern int rtnl_is_locked(void);
|
2018-03-15 03:17:20 +08:00
|
|
|
extern int rtnl_lock_killable(void);
|
2018-09-25 00:22:49 +08:00
|
|
|
extern bool refcount_dec_and_rtnl_lock(refcount_t *r);
|
2014-05-13 06:11:20 +08:00
|
|
|
|
|
|
|
extern wait_queue_head_t netdev_unregistering_wq;
|
2018-03-27 23:02:23 +08:00
|
|
|
extern struct rw_semaphore pernet_ops_rwsem;
|
net: Introduce net_rwsem to protect net_namespace_list
rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.
This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.
Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 00:20:32 +08:00
|
|
|
extern struct rw_semaphore net_rwsem;
|
2014-05-13 06:11:20 +08:00
|
|
|
|
2010-02-23 09:04:49 +08:00
|
|
|
#ifdef CONFIG_PROVE_LOCKING
|
2015-10-08 21:29:02 +08:00
|
|
|
extern bool lockdep_rtnl_is_held(void);
|
2013-11-26 14:33:52 +08:00
|
|
|
#else
|
2015-10-08 21:29:02 +08:00
|
|
|
static inline bool lockdep_rtnl_is_held(void)
|
2013-11-26 14:33:52 +08:00
|
|
|
{
|
2015-10-08 21:29:02 +08:00
|
|
|
return true;
|
2013-11-26 14:33:52 +08:00
|
|
|
}
|
2010-02-23 09:04:49 +08:00
|
|
|
#endif /* #ifdef CONFIG_PROVE_LOCKING */
|
2006-03-21 14:23:58 +08:00
|
|
|
|
2010-09-09 05:15:32 +08:00
|
|
|
/**
|
|
|
|
* rcu_dereference_rtnl - rcu_dereference with debug checking
|
|
|
|
* @p: The pointer to read, prior to dereferencing
|
|
|
|
*
|
|
|
|
* Do an rcu_dereference(p), but check caller either holds rcu_read_lock()
|
2010-10-05 15:29:48 +08:00
|
|
|
* or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference()
|
2010-09-09 05:15:32 +08:00
|
|
|
*/
|
|
|
|
#define rcu_dereference_rtnl(p) \
|
2011-07-08 20:39:41 +08:00
|
|
|
rcu_dereference_check(p, lockdep_rtnl_is_held())
|
2010-09-09 05:15:32 +08:00
|
|
|
|
2014-09-13 11:08:20 +08:00
|
|
|
/**
|
|
|
|
* rcu_dereference_bh_rtnl - rcu_dereference_bh with debug checking
|
|
|
|
* @p: The pointer to read, prior to dereference
|
|
|
|
*
|
|
|
|
* Do an rcu_dereference_bh(p), but check caller either holds rcu_read_lock_bh()
|
|
|
|
* or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference_bh()
|
|
|
|
*/
|
|
|
|
#define rcu_dereference_bh_rtnl(p) \
|
|
|
|
rcu_dereference_bh_check(p, lockdep_rtnl_is_held())
|
|
|
|
|
2010-09-15 19:07:15 +08:00
|
|
|
/**
|
2010-10-05 15:29:48 +08:00
|
|
|
* rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL
|
2010-09-15 19:07:15 +08:00
|
|
|
* @p: The pointer to read, prior to dereferencing
|
|
|
|
*
|
2010-10-05 15:29:48 +08:00
|
|
|
* Return the value of the specified RCU-protected pointer, but omit
|
2017-10-10 01:37:22 +08:00
|
|
|
* the READ_ONCE(), because caller holds RTNL.
|
2010-09-15 19:07:15 +08:00
|
|
|
*/
|
|
|
|
#define rtnl_dereference(p) \
|
2010-10-05 15:29:48 +08:00
|
|
|
rcu_dereference_protected(p, lockdep_rtnl_is_held())
|
2010-09-15 19:07:15 +08:00
|
|
|
|
2010-10-02 14:11:55 +08:00
|
|
|
static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
|
|
|
|
{
|
|
|
|
return rtnl_dereference(dev->ingress_queue);
|
|
|
|
}
|
|
|
|
|
2018-09-25 00:22:51 +08:00
|
|
|
static inline struct netdev_queue *dev_ingress_queue_rcu(struct net_device *dev)
|
|
|
|
{
|
|
|
|
return rcu_dereference(dev->ingress_queue);
|
|
|
|
}
|
|
|
|
|
net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-11 05:07:54 +08:00
|
|
|
struct netdev_queue *dev_ingress_queue_create(struct net_device *dev);
|
|
|
|
|
2015-05-14 00:19:37 +08:00
|
|
|
#ifdef CONFIG_NET_INGRESS
|
net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-11 05:07:54 +08:00
|
|
|
void net_inc_ingress_queue(void);
|
|
|
|
void net_dec_ingress_queue(void);
|
|
|
|
#endif
|
2010-10-02 14:11:55 +08:00
|
|
|
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 05:29:47 +08:00
|
|
|
#ifdef CONFIG_NET_EGRESS
|
|
|
|
void net_inc_egress_queue(void);
|
|
|
|
void net_dec_egress_queue(void);
|
net: sched: use queue_mapping to pick tx queue
This patch fixes issue:
* If we install tc filters with act_skbedit in clsact hook.
It doesn't work, because netdev_core_pick_tx() overwrites
queue_mapping.
$ tc filter ... action skbedit queue_mapping 1
And this patch is useful:
* We can use FQ + EDT to implement efficient policies. Tx queues
are picked by xps, ndo_select_queue of netdev driver, or skb hash
in netdev_core_pick_tx(). In fact, the netdev driver, and skb
hash are _not_ under control. xps uses the CPUs map to select Tx
queues, but we can't figure out which task_struct of pod/containter
running on this cpu in most case. We can use clsact filters to classify
one pod/container traffic to one Tx queue. Why ?
In containter networking environment, there are two kinds of pod/
containter/net-namespace. One kind (e.g. P1, P2), the high throughput
is key in these applications. But avoid running out of network resource,
the outbound traffic of these pods is limited, using or sharing one
dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
(e.g. Pn), the low latency of data access is key. And the traffic is not
limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
This choice provides two benefits. First, contention on the HTB/FQ Qdisc
lock is significantly reduced since fewer CPUs contend for the same queue.
More importantly, Qdisc contention can be eliminated completely if each
CPU has its own FIFO Qdisc for the second kind of pods.
There must be a mechanism in place to support classifying traffic based on
pods/container to different Tx queues. Note that clsact is outside of Qdisc
while Qdisc can run a classifier to select a sub-queue under the lock.
In general recording the decision in the skb seems a little heavy handed.
This patch introduces a per-CPU variable, suggested by Eric.
The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
- Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
is set in qdisc->enqueue() though tx queue has been selected in
netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
firstly in __dev_queue_xmit(), is useful:
- Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
For example, eth0, macvlan in pod, which root Qdisc install skbedit
queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
because there is no filters in clsact or tx Qdisc of this netdev.
Same action taked in eth0, ixgbe in Host.
- Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
in __dev_queue_xmit when processing next packets.
For performance reasons, use the static key. If user does not config the NET_EGRESS,
the patch will not be compiled.
+----+ +----+ +----+
| P1 | | P2 | | Pn |
+----+ +----+ +----+
| | |
+-----------+-----------+
|
| clsact/skbedit
| MQ
v
+-----------+-----------+
| q0 | q1 | qn
v v v
HTB/FQ HTB/FQ ... FIFO
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-16 00:40:45 +08:00
|
|
|
void netdev_xmit_skip_txqueue(bool skip);
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 05:29:47 +08:00
|
|
|
#endif
|
|
|
|
|
2016-06-14 11:21:50 +08:00
|
|
|
void rtnetlink_init(void);
|
|
|
|
void __rtnl_unlock(void);
|
|
|
|
void rtnl_kfree_skbs(struct sk_buff *head, struct sk_buff *tail);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-12-21 17:40:04 +08:00
|
|
|
#define ASSERT_RTNL() \
|
|
|
|
WARN_ONCE(!rtnl_is_locked(), \
|
|
|
|
"RTNL: assertion failed at %s (%d)\n", __FILE__, __LINE__)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-04-15 14:43:56 +08:00
|
|
|
extern int ndo_dflt_fdb_dump(struct sk_buff *skb,
|
|
|
|
struct netlink_callback *cb,
|
|
|
|
struct net_device *dev,
|
2014-07-10 19:01:58 +08:00
|
|
|
struct net_device *filter_dev,
|
2016-08-31 12:56:45 +08:00
|
|
|
int *idx);
|
2013-03-06 23:39:42 +08:00
|
|
|
extern int ndo_dflt_fdb_add(struct ndmsg *ndm,
|
|
|
|
struct nlattr *tb[],
|
|
|
|
struct net_device *dev,
|
|
|
|
const unsigned char *addr,
|
2014-11-28 21:34:15 +08:00
|
|
|
u16 vid,
|
|
|
|
u16 flags);
|
2013-03-06 23:39:42 +08:00
|
|
|
extern int ndo_dflt_fdb_del(struct ndmsg *ndm,
|
|
|
|
struct nlattr *tb[],
|
|
|
|
struct net_device *dev,
|
2014-11-28 21:34:15 +08:00
|
|
|
const unsigned char *addr,
|
|
|
|
u16 vid);
|
2012-10-24 16:13:09 +08:00
|
|
|
|
|
|
|
extern int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
|
2014-11-28 21:34:25 +08:00
|
|
|
struct net_device *dev, u16 mode,
|
2015-06-22 15:27:17 +08:00
|
|
|
u32 flags, u32 mask, int nlflags,
|
|
|
|
u32 filter_mask,
|
|
|
|
int (*vlan_fill)(struct sk_buff *skb,
|
|
|
|
struct net_device *dev,
|
|
|
|
u32 filter_mask));
|
2022-03-03 00:31:23 +08:00
|
|
|
|
|
|
|
extern void rtnl_offload_xstats_notify(struct net_device *dev);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* __LINUX_RTNETLINK_H */
|