2019-05-29 22:18:09 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2017-07-18 00:28:56 +08:00
|
|
|
/* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Devmaps primary use is as a backend map for XDP BPF helper call
|
|
|
|
* bpf_redirect_map(). Because XDP is mostly concerned with performance we
|
|
|
|
* spent some effort to ensure the datapath with redirect maps does not use
|
|
|
|
* any locking. This is a quick note on the details.
|
|
|
|
*
|
|
|
|
* We have three possible paths to get into the devmap control plane bpf
|
|
|
|
* syscalls, bpf programs, and driver side xmit/flush operations. A bpf syscall
|
|
|
|
* will invoke an update, delete, or lookup operation. To ensure updates and
|
|
|
|
* deletes appear atomic from the datapath side xchg() is used to modify the
|
|
|
|
* netdev_map array. Then because the datapath does a lookup into the netdev_map
|
|
|
|
* array (read-only) from an RCU critical section we use call_rcu() to wait for
|
|
|
|
* an rcu grace period before free'ing the old data structures. This ensures the
|
|
|
|
* datapath always has a valid copy. However, the datapath does a "flush"
|
|
|
|
* operation that pushes any pending packets in the driver outside the RCU
|
|
|
|
* critical section. Each bpf_dtab_netdev tracks these pending operations using
|
2019-06-28 17:12:34 +08:00
|
|
|
* a per-cpu flush list. The bpf_dtab_netdev object will not be destroyed until
|
|
|
|
* this list is empty, indicating outstanding flush operations have completed.
|
2017-07-18 00:28:56 +08:00
|
|
|
*
|
|
|
|
* BPF syscalls may race with BPF program calls on any of the update, delete
|
|
|
|
* or lookup operations. As noted above the xchg() operation also keep the
|
|
|
|
* netdev_map consistent in this case. From the devmap side BPF programs
|
|
|
|
* calling into these operations are the same as multiple user space threads
|
|
|
|
* making system calls.
|
2017-07-18 00:30:02 +08:00
|
|
|
*
|
|
|
|
* Finally, any of the above may race with a netdev_unregister notifier. The
|
|
|
|
* unregister notifier must search for net devices in the map structure that
|
|
|
|
* contain a reference to the net device and remove them. This is a two step
|
|
|
|
* process (a) dereference the bpf_dtab_netdev object in netdev_map and (b)
|
|
|
|
* check to see if the ifindex is the same as the net_device being removed.
|
bpf: devmap fix mutex in rcu critical section
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-05 13:02:19 +08:00
|
|
|
* When removing the dev a cmpxchg() is used to ensure the correct dev is
|
|
|
|
* removed, in the case of a concurrent update or delete operation it is
|
|
|
|
* possible that the initially referenced dev is no longer in the map. As the
|
|
|
|
* notifier hook walks the map we know that new dev references can not be
|
|
|
|
* added by the user because core infrastructure ensures dev_get_by_index()
|
|
|
|
* calls will fail at this point.
|
2019-07-27 00:06:55 +08:00
|
|
|
*
|
|
|
|
* The devmap_hash type is a map type which interprets keys as ifindexes and
|
|
|
|
* indexes these using a hashmap. This allows maps that use ifindex as key to be
|
|
|
|
* densely packed instead of having holes in the lookup array for unused
|
|
|
|
* ifindexes. The setup and packet enqueue/send code is shared between the two
|
|
|
|
* types of devmap; only the lookup and insertion is different.
|
2017-07-18 00:28:56 +08:00
|
|
|
*/
|
|
|
|
#include <linux/bpf.h>
|
2018-05-24 22:45:46 +08:00
|
|
|
#include <net/xdp.h>
|
2017-07-18 00:28:56 +08:00
|
|
|
#include <linux/filter.h>
|
2018-05-24 22:45:46 +08:00
|
|
|
#include <trace/events/xdp.h>
|
2017-07-18 00:28:56 +08:00
|
|
|
|
2017-10-19 04:00:22 +08:00
|
|
|
#define DEV_CREATE_FLAG_MASK \
|
|
|
|
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
|
|
|
|
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
struct xdp_dev_bulk_queue {
|
2018-05-24 22:45:51 +08:00
|
|
|
struct xdp_frame *q[DEV_MAP_BULK_SIZE];
|
2019-06-28 17:12:34 +08:00
|
|
|
struct list_head flush_node;
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
struct net_device *dev;
|
2018-05-24 22:45:57 +08:00
|
|
|
struct net_device *dev_rx;
|
2018-05-24 22:45:51 +08:00
|
|
|
unsigned int count;
|
|
|
|
};
|
|
|
|
|
2020-05-30 06:07:12 +08:00
|
|
|
/* DEVMAP values */
|
|
|
|
struct bpf_devmap_val {
|
|
|
|
u32 ifindex; /* device index */
|
2020-05-30 06:07:13 +08:00
|
|
|
union {
|
|
|
|
int fd; /* prog fd on map write */
|
|
|
|
u32 id; /* prog id on map read */
|
|
|
|
} bpf_prog;
|
2020-05-30 06:07:12 +08:00
|
|
|
};
|
|
|
|
|
2017-07-18 00:28:56 +08:00
|
|
|
struct bpf_dtab_netdev {
|
2018-05-24 22:45:46 +08:00
|
|
|
struct net_device *dev; /* must be first member, due to tracepoint */
|
2019-07-27 00:06:55 +08:00
|
|
|
struct hlist_node index_hlist;
|
2017-07-18 00:28:56 +08:00
|
|
|
struct bpf_dtab *dtab;
|
2020-05-30 06:07:13 +08:00
|
|
|
struct bpf_prog *xdp_prog;
|
2017-08-23 07:47:54 +08:00
|
|
|
struct rcu_head rcu;
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
unsigned int idx;
|
2020-05-30 06:07:12 +08:00
|
|
|
struct bpf_devmap_val val;
|
2017-07-18 00:28:56 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct bpf_dtab {
|
|
|
|
struct bpf_map map;
|
2019-11-21 21:36:12 +08:00
|
|
|
struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
|
2017-07-18 00:30:02 +08:00
|
|
|
struct list_head list;
|
2019-07-27 00:06:55 +08:00
|
|
|
|
|
|
|
/* these are only used for DEVMAP_HASH type maps */
|
|
|
|
struct hlist_head *dev_index_head;
|
|
|
|
spinlock_t index_lock;
|
|
|
|
unsigned int items;
|
|
|
|
u32 n_buckets;
|
2017-07-18 00:28:56 +08:00
|
|
|
};
|
|
|
|
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
static DEFINE_PER_CPU(struct list_head, dev_flush_list);
|
bpf: devmap fix mutex in rcu critical section
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-05 13:02:19 +08:00
|
|
|
static DEFINE_SPINLOCK(dev_map_lock);
|
2017-07-18 00:30:02 +08:00
|
|
|
static LIST_HEAD(dev_map_list);
|
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
static struct hlist_head *dev_map_create_hash(unsigned int entries)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct hlist_head *hash;
|
|
|
|
|
|
|
|
hash = kmalloc_array(entries, sizeof(*hash), GFP_KERNEL);
|
|
|
|
if (hash != NULL)
|
|
|
|
for (i = 0; i < entries; i++)
|
|
|
|
INIT_HLIST_HEAD(&hash[i]);
|
|
|
|
|
|
|
|
return hash;
|
|
|
|
}
|
|
|
|
|
2019-11-21 21:36:12 +08:00
|
|
|
static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab,
|
|
|
|
int idx)
|
|
|
|
{
|
|
|
|
return &dtab->dev_index_head[idx & (dtab->n_buckets - 1)];
|
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:53 +08:00
|
|
|
static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
|
2017-07-18 00:28:56 +08:00
|
|
|
{
|
2020-05-30 06:07:13 +08:00
|
|
|
u32 valsize = attr->value_size;
|
2019-12-19 14:10:03 +08:00
|
|
|
u64 cost = 0;
|
|
|
|
int err;
|
2017-07-18 00:28:56 +08:00
|
|
|
|
2020-05-30 06:07:13 +08:00
|
|
|
/* check sanity of attributes. 2 value sizes supported:
|
|
|
|
* 4 bytes: ifindex
|
|
|
|
* 8 bytes: ifindex + prog fd
|
|
|
|
*/
|
2017-07-18 00:28:56 +08:00
|
|
|
if (attr->max_entries == 0 || attr->key_size != 4 ||
|
2020-05-30 06:07:13 +08:00
|
|
|
(valsize != offsetofend(struct bpf_devmap_val, ifindex) &&
|
|
|
|
valsize != offsetofend(struct bpf_devmap_val, bpf_prog.fd)) ||
|
|
|
|
attr->map_flags & ~DEV_CREATE_FLAG_MASK)
|
2019-07-27 00:06:53 +08:00
|
|
|
return -EINVAL;
|
2017-07-18 00:28:56 +08:00
|
|
|
|
2019-06-28 17:12:35 +08:00
|
|
|
/* Lookup returns a pointer straight to dev->ifindex, so make sure the
|
|
|
|
* verifier prevents writes from the BPF side
|
|
|
|
*/
|
|
|
|
attr->map_flags |= BPF_F_RDONLY_PROG;
|
|
|
|
|
2017-07-18 00:28:56 +08:00
|
|
|
|
2018-01-12 12:29:06 +08:00
|
|
|
bpf_map_init_from_attr(&dtab->map, attr);
|
2017-07-18 00:28:56 +08:00
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
|
|
|
|
dtab->n_buckets = roundup_pow_of_two(dtab->map.max_entries);
|
|
|
|
|
|
|
|
if (!dtab->n_buckets) /* Overflow check */
|
|
|
|
return -EINVAL;
|
2019-10-17 18:57:02 +08:00
|
|
|
cost += (u64) sizeof(struct hlist_head) * dtab->n_buckets;
|
2019-11-21 21:36:12 +08:00
|
|
|
} else {
|
|
|
|
cost += (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *);
|
2019-07-27 00:06:55 +08:00
|
|
|
}
|
|
|
|
|
2019-05-30 09:03:58 +08:00
|
|
|
/* if map size is larger than memlock limit, reject it */
|
2019-05-30 09:03:59 +08:00
|
|
|
err = bpf_map_charge_init(&dtab->map.memory, cost);
|
2017-07-18 00:28:56 +08:00
|
|
|
if (err)
|
2019-07-27 00:06:53 +08:00
|
|
|
return -EINVAL;
|
2017-09-18 21:03:46 +08:00
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
|
|
|
|
dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets);
|
|
|
|
if (!dtab->dev_index_head)
|
2019-12-19 14:10:03 +08:00
|
|
|
goto free_charge;
|
2019-07-27 00:06:55 +08:00
|
|
|
|
|
|
|
spin_lock_init(&dtab->index_lock);
|
2019-11-21 21:36:12 +08:00
|
|
|
} else {
|
|
|
|
dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries *
|
|
|
|
sizeof(struct bpf_dtab_netdev *),
|
|
|
|
dtab->map.numa_node);
|
|
|
|
if (!dtab->netdev_map)
|
2019-12-19 14:10:03 +08:00
|
|
|
goto free_charge;
|
2019-07-27 00:06:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:53 +08:00
|
|
|
return 0;
|
2019-06-28 17:12:34 +08:00
|
|
|
|
2019-05-30 09:03:58 +08:00
|
|
|
free_charge:
|
|
|
|
bpf_map_charge_finish(&dtab->map.memory);
|
2019-07-27 00:06:53 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!capable(CAP_NET_ADMIN))
|
|
|
|
return ERR_PTR(-EPERM);
|
|
|
|
|
|
|
|
dtab = kzalloc(sizeof(*dtab), GFP_USER);
|
|
|
|
if (!dtab)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
err = dev_map_init_map(dtab, attr);
|
|
|
|
if (err) {
|
|
|
|
kfree(dtab);
|
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&dev_map_lock);
|
|
|
|
list_add_tail_rcu(&dtab->list, &dev_map_list);
|
|
|
|
spin_unlock(&dev_map_lock);
|
|
|
|
|
|
|
|
return &dtab->map;
|
2017-07-18 00:28:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void dev_map_free(struct bpf_map *map)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
2019-12-19 14:09:59 +08:00
|
|
|
int i;
|
2017-07-18 00:28:56 +08:00
|
|
|
|
|
|
|
/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
|
|
|
|
* so the programs (can be more than one that used this map) were
|
2020-01-27 08:14:00 +08:00
|
|
|
* disconnected from events. The following synchronize_rcu() guarantees
|
|
|
|
* both rcu read critical sections complete and waits for
|
|
|
|
* preempt-disable regions (NAPI being the relevant context here) so we
|
|
|
|
* are certain there will be no further reads against the netdev_map and
|
|
|
|
* all flush operations are complete. Flush operations can only be done
|
|
|
|
* from NAPI context for this reason.
|
2017-07-18 00:28:56 +08:00
|
|
|
*/
|
2017-08-21 07:48:12 +08:00
|
|
|
|
|
|
|
spin_lock(&dev_map_lock);
|
|
|
|
list_del_rcu(&dtab->list);
|
|
|
|
spin_unlock(&dev_map_lock);
|
|
|
|
|
2018-08-18 05:26:14 +08:00
|
|
|
bpf_clear_redirect_map(map);
|
2017-07-18 00:28:56 +08:00
|
|
|
synchronize_rcu();
|
|
|
|
|
2019-05-14 00:59:16 +08:00
|
|
|
/* Make sure prior __dev_map_entry_free() have completed. */
|
|
|
|
rcu_barrier();
|
|
|
|
|
2019-11-21 21:36:12 +08:00
|
|
|
if (dtab->map.map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
|
|
|
|
for (i = 0; i < dtab->n_buckets; i++) {
|
|
|
|
struct bpf_dtab_netdev *dev;
|
|
|
|
struct hlist_head *head;
|
|
|
|
struct hlist_node *next;
|
|
|
|
|
|
|
|
head = dev_map_index_hash(dtab, i);
|
|
|
|
|
|
|
|
hlist_for_each_entry_safe(dev, next, head, index_hlist) {
|
|
|
|
hlist_del_rcu(&dev->index_hlist);
|
2020-05-30 06:07:13 +08:00
|
|
|
if (dev->xdp_prog)
|
|
|
|
bpf_prog_put(dev->xdp_prog);
|
2019-11-21 21:36:12 +08:00
|
|
|
dev_put(dev->dev);
|
|
|
|
kfree(dev);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
kfree(dtab->dev_index_head);
|
|
|
|
} else {
|
|
|
|
for (i = 0; i < dtab->map.max_entries; i++) {
|
|
|
|
struct bpf_dtab_netdev *dev;
|
|
|
|
|
|
|
|
dev = dtab->netdev_map[i];
|
|
|
|
if (!dev)
|
|
|
|
continue;
|
|
|
|
|
2020-05-30 06:07:13 +08:00
|
|
|
if (dev->xdp_prog)
|
|
|
|
bpf_prog_put(dev->xdp_prog);
|
2019-11-21 21:36:12 +08:00
|
|
|
dev_put(dev->dev);
|
|
|
|
kfree(dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
bpf_map_area_free(dtab->netdev_map);
|
2017-07-18 00:28:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
kfree(dtab);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
|
|
|
u32 index = key ? *(u32 *)key : U32_MAX;
|
2017-08-23 07:47:54 +08:00
|
|
|
u32 *next = next_key;
|
2017-07-18 00:28:56 +08:00
|
|
|
|
|
|
|
if (index >= dtab->map.max_entries) {
|
|
|
|
*next = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (index == dtab->map.max_entries - 1)
|
|
|
|
return -ENOENT;
|
|
|
|
*next = index + 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
|
|
|
struct hlist_head *head = dev_map_index_hash(dtab, key);
|
|
|
|
struct bpf_dtab_netdev *dev;
|
|
|
|
|
2020-01-23 20:04:38 +08:00
|
|
|
hlist_for_each_entry_rcu(dev, head, index_hlist,
|
|
|
|
lockdep_is_held(&dtab->index_lock))
|
2019-07-27 00:06:55 +08:00
|
|
|
if (dev->idx == key)
|
|
|
|
return dev;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dev_map_hash_get_next_key(struct bpf_map *map, void *key,
|
|
|
|
void *next_key)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
|
|
|
u32 idx, *next = next_key;
|
|
|
|
struct bpf_dtab_netdev *dev, *next_dev;
|
|
|
|
struct hlist_head *head;
|
|
|
|
int i = 0;
|
|
|
|
|
|
|
|
if (!key)
|
|
|
|
goto find_first;
|
|
|
|
|
|
|
|
idx = *(u32 *)key;
|
|
|
|
|
|
|
|
dev = __dev_map_hash_lookup_elem(map, idx);
|
|
|
|
if (!dev)
|
|
|
|
goto find_first;
|
|
|
|
|
|
|
|
next_dev = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&dev->index_hlist)),
|
|
|
|
struct bpf_dtab_netdev, index_hlist);
|
|
|
|
|
|
|
|
if (next_dev) {
|
|
|
|
*next = next_dev->idx;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
i = idx & (dtab->n_buckets - 1);
|
|
|
|
i++;
|
|
|
|
|
|
|
|
find_first:
|
|
|
|
for (; i < dtab->n_buckets; i++) {
|
|
|
|
head = dev_map_index_hash(dtab, i);
|
|
|
|
|
|
|
|
next_dev = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
|
|
|
|
struct bpf_dtab_netdev,
|
|
|
|
index_hlist);
|
|
|
|
if (next_dev) {
|
|
|
|
*next = next_dev->idx;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
|
2020-05-30 06:07:13 +08:00
|
|
|
bool dev_map_can_have_prog(struct bpf_map *map)
|
|
|
|
{
|
|
|
|
if ((map->map_type == BPF_MAP_TYPE_DEVMAP ||
|
|
|
|
map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) &&
|
|
|
|
map->value_size != offsetofend(struct bpf_devmap_val, ifindex))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
static int bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
|
2018-05-24 22:45:51 +08:00
|
|
|
{
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
struct net_device *dev = bq->dev;
|
2018-05-24 22:46:17 +08:00
|
|
|
int sent = 0, drops = 0, err = 0;
|
2018-05-24 22:45:51 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
if (unlikely(!bq->count))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
for (i = 0; i < bq->count; i++) {
|
|
|
|
struct xdp_frame *xdpf = bq->q[i];
|
|
|
|
|
|
|
|
prefetch(xdpf);
|
|
|
|
}
|
|
|
|
|
2018-05-31 17:00:23 +08:00
|
|
|
sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
|
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 22:46:12 +08:00
|
|
|
if (sent < 0) {
|
2018-05-24 22:46:17 +08:00
|
|
|
err = sent;
|
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 22:46:12 +08:00
|
|
|
sent = 0;
|
|
|
|
goto error;
|
2018-05-24 22:45:51 +08:00
|
|
|
}
|
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 22:46:12 +08:00
|
|
|
drops = bq->count - sent;
|
|
|
|
out:
|
2018-05-24 22:45:51 +08:00
|
|
|
bq->count = 0;
|
|
|
|
|
2020-01-16 23:14:46 +08:00
|
|
|
trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);
|
2018-05-24 22:45:57 +08:00
|
|
|
bq->dev_rx = NULL;
|
2019-06-28 17:12:34 +08:00
|
|
|
__list_del_clearprev(&bq->flush_node);
|
2018-05-24 22:45:51 +08:00
|
|
|
return 0;
|
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 22:46:12 +08:00
|
|
|
error:
|
|
|
|
/* If ndo_xdp_xmit fails with an errno, no frames have been
|
|
|
|
* xmit'ed and it's our responsibility to them free all.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < bq->count; i++) {
|
|
|
|
struct xdp_frame *xdpf = bq->q[i];
|
|
|
|
|
2019-12-19 14:09:59 +08:00
|
|
|
xdp_return_frame_rx_napi(xdpf);
|
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 22:46:12 +08:00
|
|
|
drops++;
|
|
|
|
}
|
|
|
|
goto out;
|
2018-05-24 22:45:51 +08:00
|
|
|
}
|
|
|
|
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
/* __dev_flush is called from xdp_do_flush() which _must_ be signaled
|
2017-07-18 00:29:40 +08:00
|
|
|
* from the driver before returning from its napi->poll() routine. The poll()
|
|
|
|
* routine is called either from busy_poll context or net_rx_action signaled
|
|
|
|
* from NET_RX_SOFTIRQ. Either way the poll routine must complete before the
|
2019-06-28 17:12:34 +08:00
|
|
|
* net device can be torn down. On devmap tear down we ensure the flush list
|
|
|
|
* is empty before completing to ensure all flush operations have completed.
|
2020-01-27 08:14:02 +08:00
|
|
|
* When drivers update the bpf program they may need to ensure any flush ops
|
|
|
|
* are also complete. Using synchronize_rcu or call_rcu will suffice for this
|
|
|
|
* because both wait for napi context to exit.
|
2017-07-18 00:29:40 +08:00
|
|
|
*/
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
void __dev_flush(void)
|
2017-07-18 00:29:40 +08:00
|
|
|
{
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
struct xdp_dev_bulk_queue *bq, *tmp;
|
2017-07-18 00:29:40 +08:00
|
|
|
|
2019-06-28 17:12:34 +08:00
|
|
|
list_for_each_entry_safe(bq, tmp, flush_list, flush_node)
|
2019-12-19 14:09:59 +08:00
|
|
|
bq_xmit_all(bq, XDP_XMIT_FLUSH);
|
2017-07-18 00:29:40 +08:00
|
|
|
}
|
|
|
|
|
2017-07-18 00:28:56 +08:00
|
|
|
/* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or
|
|
|
|
* update happens in parallel here a dev_put wont happen until after reading the
|
|
|
|
* ifindex.
|
|
|
|
*/
|
2018-05-24 22:45:46 +08:00
|
|
|
struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
|
2017-07-18 00:28:56 +08:00
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
2018-05-24 22:45:46 +08:00
|
|
|
struct bpf_dtab_netdev *obj;
|
2017-07-18 00:28:56 +08:00
|
|
|
|
2017-08-23 07:47:54 +08:00
|
|
|
if (key >= map->max_entries)
|
2017-07-18 00:28:56 +08:00
|
|
|
return NULL;
|
|
|
|
|
2018-05-24 22:45:46 +08:00
|
|
|
obj = READ_ONCE(dtab->netdev_map[key]);
|
|
|
|
return obj;
|
|
|
|
}
|
|
|
|
|
2018-05-24 22:45:51 +08:00
|
|
|
/* Runs under RCU-read-side, plus in softirq under NAPI protection.
|
|
|
|
* Thus, safe percpu variable access.
|
|
|
|
*/
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
static int bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
|
2018-05-24 22:45:57 +08:00
|
|
|
struct net_device *dev_rx)
|
2018-05-24 22:45:51 +08:00
|
|
|
{
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
|
2018-05-24 22:45:51 +08:00
|
|
|
|
|
|
|
if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
|
2019-12-19 14:09:59 +08:00
|
|
|
bq_xmit_all(bq, 0);
|
2018-05-24 22:45:51 +08:00
|
|
|
|
2018-05-24 22:45:57 +08:00
|
|
|
/* Ingress dev_rx will be the same for all xdp_frame's in
|
|
|
|
* bulk_queue, because bq stored per-CPU and must be flushed
|
|
|
|
* from net_device drivers NAPI func end.
|
|
|
|
*/
|
|
|
|
if (!bq->dev_rx)
|
|
|
|
bq->dev_rx = dev_rx;
|
|
|
|
|
2018-05-24 22:45:51 +08:00
|
|
|
bq->q[bq->count++] = xdpf;
|
2019-06-28 17:12:34 +08:00
|
|
|
|
|
|
|
if (!bq->flush_node.prev)
|
|
|
|
list_add(&bq->flush_node, flush_list);
|
|
|
|
|
2018-05-24 22:45:51 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
|
|
|
|
struct net_device *dev_rx)
|
2018-05-24 22:45:46 +08:00
|
|
|
{
|
|
|
|
struct xdp_frame *xdpf;
|
2018-07-06 10:49:00 +08:00
|
|
|
int err;
|
2018-05-24 22:45:46 +08:00
|
|
|
|
|
|
|
if (!dev->netdev_ops->ndo_xdp_xmit)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2018-07-06 10:49:00 +08:00
|
|
|
err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
|
|
|
|
if (unlikely(err))
|
|
|
|
return err;
|
|
|
|
|
2018-05-24 22:45:46 +08:00
|
|
|
xdpf = convert_to_xdp_frame(xdp);
|
|
|
|
if (unlikely(!xdpf))
|
|
|
|
return -EOVERFLOW;
|
|
|
|
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
return bq_enqueue(dev, xdpf, dev_rx);
|
2017-07-18 00:28:56 +08:00
|
|
|
}
|
|
|
|
|
2020-05-30 06:07:13 +08:00
|
|
|
static struct xdp_buff *dev_map_run_prog(struct net_device *dev,
|
|
|
|
struct xdp_buff *xdp,
|
|
|
|
struct bpf_prog *xdp_prog)
|
|
|
|
{
|
2020-05-30 06:07:14 +08:00
|
|
|
struct xdp_txq_info txq = { .dev = dev };
|
2020-05-30 06:07:13 +08:00
|
|
|
u32 act;
|
|
|
|
|
2020-05-30 06:07:14 +08:00
|
|
|
xdp->txq = &txq;
|
|
|
|
|
2020-05-30 06:07:13 +08:00
|
|
|
act = bpf_prog_run_xdp(xdp_prog, xdp);
|
|
|
|
switch (act) {
|
|
|
|
case XDP_PASS:
|
|
|
|
return xdp;
|
|
|
|
case XDP_DROP:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
bpf_warn_invalid_xdp_action(act);
|
|
|
|
fallthrough;
|
|
|
|
case XDP_ABORTED:
|
|
|
|
trace_xdp_exception(dev, xdp_prog, act);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
xdp_return_buff(xdp);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
|
|
|
|
struct net_device *dev_rx)
|
|
|
|
{
|
|
|
|
return __xdp_enqueue(dev, xdp, dev_rx);
|
|
|
|
}
|
|
|
|
|
|
|
|
int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
|
|
|
|
struct net_device *dev_rx)
|
|
|
|
{
|
|
|
|
struct net_device *dev = dst->dev;
|
|
|
|
|
2020-05-30 06:07:13 +08:00
|
|
|
if (dst->xdp_prog) {
|
|
|
|
xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog);
|
|
|
|
if (!xdp)
|
|
|
|
return 0;
|
|
|
|
}
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
return __xdp_enqueue(dev, xdp, dev_rx);
|
|
|
|
}
|
|
|
|
|
2018-06-14 10:07:42 +08:00
|
|
|
int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
|
|
|
|
struct bpf_prog *xdp_prog)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
2018-07-06 10:49:00 +08:00
|
|
|
err = xdp_ok_fwd_dev(dst->dev, skb->len);
|
2018-06-14 10:07:42 +08:00
|
|
|
if (unlikely(err))
|
|
|
|
return err;
|
|
|
|
skb->dev = dst->dev;
|
|
|
|
generic_xdp_tx(skb, xdp_prog);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-08-23 07:47:54 +08:00
|
|
|
static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
2018-05-24 22:45:46 +08:00
|
|
|
struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
|
2017-08-23 07:47:54 +08:00
|
|
|
|
2020-05-30 06:07:12 +08:00
|
|
|
return obj ? &obj->val : NULL;
|
2017-08-23 07:47:54 +08:00
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
struct bpf_dtab_netdev *obj = __dev_map_hash_lookup_elem(map,
|
|
|
|
*(u32 *)key);
|
2020-05-30 06:07:12 +08:00
|
|
|
return obj ? &obj->val : NULL;
|
2019-07-27 00:06:55 +08:00
|
|
|
}
|
|
|
|
|
2017-07-18 00:28:56 +08:00
|
|
|
static void __dev_map_entry_free(struct rcu_head *rcu)
|
|
|
|
{
|
2017-08-23 07:47:54 +08:00
|
|
|
struct bpf_dtab_netdev *dev;
|
2017-07-18 00:28:56 +08:00
|
|
|
|
2017-08-23 07:47:54 +08:00
|
|
|
dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
|
2020-05-30 06:07:13 +08:00
|
|
|
if (dev->xdp_prog)
|
|
|
|
bpf_prog_put(dev->xdp_prog);
|
2017-08-23 07:47:54 +08:00
|
|
|
dev_put(dev->dev);
|
|
|
|
kfree(dev);
|
2017-07-18 00:28:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int dev_map_delete_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
|
|
|
struct bpf_dtab_netdev *old_dev;
|
|
|
|
int k = *(u32 *)key;
|
|
|
|
|
|
|
|
if (k >= map->max_entries)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2017-08-23 07:47:54 +08:00
|
|
|
/* Use call_rcu() here to ensure any rcu critical sections have
|
2020-01-27 08:14:00 +08:00
|
|
|
* completed as well as any flush operations because call_rcu
|
|
|
|
* will wait for preempt-disable region to complete, NAPI in this
|
|
|
|
* context. And additionally, the driver tear down ensures all
|
|
|
|
* soft irqs are complete before removing the net device in the
|
|
|
|
* case of dev_put equals zero.
|
2017-07-18 00:28:56 +08:00
|
|
|
*/
|
|
|
|
old_dev = xchg(&dtab->netdev_map[k], NULL);
|
|
|
|
if (old_dev)
|
|
|
|
call_rcu(&old_dev->rcu, __dev_map_entry_free);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
static int dev_map_hash_delete_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
|
|
|
struct bpf_dtab_netdev *old_dev;
|
|
|
|
int k = *(u32 *)key;
|
|
|
|
unsigned long flags;
|
|
|
|
int ret = -ENOENT;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&dtab->index_lock, flags);
|
|
|
|
|
|
|
|
old_dev = __dev_map_hash_lookup_elem(map, k);
|
|
|
|
if (old_dev) {
|
|
|
|
dtab->items--;
|
|
|
|
hlist_del_init_rcu(&old_dev->index_hlist);
|
|
|
|
call_rcu(&old_dev->rcu, __dev_map_entry_free);
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&dtab->index_lock, flags);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:53 +08:00
|
|
|
static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net,
|
|
|
|
struct bpf_dtab *dtab,
|
2020-05-30 06:07:12 +08:00
|
|
|
struct bpf_devmap_val *val,
|
2019-07-27 00:06:53 +08:00
|
|
|
unsigned int idx)
|
2017-07-18 00:28:56 +08:00
|
|
|
{
|
2020-05-30 06:07:13 +08:00
|
|
|
struct bpf_prog *prog = NULL;
|
2019-07-27 00:06:53 +08:00
|
|
|
struct bpf_dtab_netdev *dev;
|
|
|
|
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
dev = kmalloc_node(sizeof(*dev), GFP_ATOMIC | __GFP_NOWARN,
|
|
|
|
dtab->map.numa_node);
|
2019-07-27 00:06:53 +08:00
|
|
|
if (!dev)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
2020-05-30 06:07:12 +08:00
|
|
|
dev->dev = dev_get_by_index(net, val->ifindex);
|
|
|
|
if (!dev->dev)
|
|
|
|
goto err_out;
|
2019-07-27 00:06:53 +08:00
|
|
|
|
2020-05-30 06:07:13 +08:00
|
|
|
if (val->bpf_prog.fd >= 0) {
|
|
|
|
prog = bpf_prog_get_type_dev(val->bpf_prog.fd,
|
|
|
|
BPF_PROG_TYPE_XDP, false);
|
|
|
|
if (IS_ERR(prog))
|
|
|
|
goto err_put_dev;
|
|
|
|
if (prog->expected_attach_type != BPF_XDP_DEVMAP)
|
|
|
|
goto err_put_prog;
|
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:53 +08:00
|
|
|
dev->idx = idx;
|
|
|
|
dev->dtab = dtab;
|
2020-05-30 06:07:13 +08:00
|
|
|
if (prog) {
|
|
|
|
dev->xdp_prog = prog;
|
|
|
|
dev->val.bpf_prog.id = prog->aux->id;
|
|
|
|
} else {
|
|
|
|
dev->xdp_prog = NULL;
|
|
|
|
dev->val.bpf_prog.id = 0;
|
|
|
|
}
|
2020-05-30 06:07:12 +08:00
|
|
|
dev->val.ifindex = val->ifindex;
|
2019-07-27 00:06:53 +08:00
|
|
|
|
|
|
|
return dev;
|
2020-05-30 06:07:13 +08:00
|
|
|
err_put_prog:
|
|
|
|
bpf_prog_put(prog);
|
|
|
|
err_put_dev:
|
|
|
|
dev_put(dev->dev);
|
2020-05-30 06:07:12 +08:00
|
|
|
err_out:
|
|
|
|
kfree(dev);
|
|
|
|
return ERR_PTR(-EINVAL);
|
2019-07-27 00:06:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __dev_map_update_elem(struct net *net, struct bpf_map *map,
|
|
|
|
void *key, void *value, u64 map_flags)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
2020-05-30 06:07:13 +08:00
|
|
|
struct bpf_devmap_val val = { .bpf_prog.fd = -1 };
|
2017-07-18 00:28:56 +08:00
|
|
|
struct bpf_dtab_netdev *dev, *old_dev;
|
2019-06-28 17:12:34 +08:00
|
|
|
u32 i = *(u32 *)key;
|
2017-07-18 00:28:56 +08:00
|
|
|
|
|
|
|
if (unlikely(map_flags > BPF_EXIST))
|
|
|
|
return -EINVAL;
|
|
|
|
if (unlikely(i >= dtab->map.max_entries))
|
|
|
|
return -E2BIG;
|
|
|
|
if (unlikely(map_flags == BPF_NOEXIST))
|
|
|
|
return -EEXIST;
|
|
|
|
|
2020-05-30 06:07:12 +08:00
|
|
|
/* already verified value_size <= sizeof val */
|
|
|
|
memcpy(&val, value, map->value_size);
|
|
|
|
|
|
|
|
if (!val.ifindex) {
|
2017-07-18 00:28:56 +08:00
|
|
|
dev = NULL;
|
2020-05-30 06:07:13 +08:00
|
|
|
/* can not specify fd if ifindex is 0 */
|
|
|
|
if (val.bpf_prog.fd != -1)
|
|
|
|
return -EINVAL;
|
2017-07-18 00:28:56 +08:00
|
|
|
} else {
|
2020-05-30 06:07:12 +08:00
|
|
|
dev = __dev_map_alloc_node(net, dtab, &val, i);
|
2019-07-27 00:06:53 +08:00
|
|
|
if (IS_ERR(dev))
|
|
|
|
return PTR_ERR(dev);
|
2017-07-18 00:28:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Use call_rcu() here to ensure rcu critical sections have completed
|
|
|
|
* Remembering the driver side flush operation will happen before the
|
|
|
|
* net device is removed.
|
|
|
|
*/
|
|
|
|
old_dev = xchg(&dtab->netdev_map[i], dev);
|
|
|
|
if (old_dev)
|
|
|
|
call_rcu(&old_dev->rcu, __dev_map_entry_free);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:53 +08:00
|
|
|
static int dev_map_update_elem(struct bpf_map *map, void *key, void *value,
|
|
|
|
u64 map_flags)
|
|
|
|
{
|
|
|
|
return __dev_map_update_elem(current->nsproxy->net_ns,
|
|
|
|
map, key, value, map_flags);
|
|
|
|
}
|
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
static int __dev_map_hash_update_elem(struct net *net, struct bpf_map *map,
|
|
|
|
void *key, void *value, u64 map_flags)
|
|
|
|
{
|
|
|
|
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
|
2020-05-30 06:07:13 +08:00
|
|
|
struct bpf_devmap_val val = { .bpf_prog.fd = -1 };
|
2019-07-27 00:06:55 +08:00
|
|
|
struct bpf_dtab_netdev *dev, *old_dev;
|
|
|
|
u32 idx = *(u32 *)key;
|
|
|
|
unsigned long flags;
|
2019-09-08 16:20:16 +08:00
|
|
|
int err = -EEXIST;
|
2019-07-27 00:06:55 +08:00
|
|
|
|
2020-05-30 06:07:12 +08:00
|
|
|
/* already verified value_size <= sizeof val */
|
|
|
|
memcpy(&val, value, map->value_size);
|
|
|
|
|
|
|
|
if (unlikely(map_flags > BPF_EXIST || !val.ifindex))
|
2019-07-27 00:06:55 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-09-08 16:20:16 +08:00
|
|
|
spin_lock_irqsave(&dtab->index_lock, flags);
|
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
old_dev = __dev_map_hash_lookup_elem(map, idx);
|
|
|
|
if (old_dev && (map_flags & BPF_NOEXIST))
|
2019-09-08 16:20:16 +08:00
|
|
|
goto out_err;
|
2019-07-27 00:06:55 +08:00
|
|
|
|
2020-05-30 06:07:12 +08:00
|
|
|
dev = __dev_map_alloc_node(net, dtab, &val, idx);
|
2019-09-08 16:20:16 +08:00
|
|
|
if (IS_ERR(dev)) {
|
|
|
|
err = PTR_ERR(dev);
|
|
|
|
goto out_err;
|
|
|
|
}
|
2019-07-27 00:06:55 +08:00
|
|
|
|
|
|
|
if (old_dev) {
|
|
|
|
hlist_del_rcu(&old_dev->index_hlist);
|
|
|
|
} else {
|
|
|
|
if (dtab->items >= dtab->map.max_entries) {
|
|
|
|
spin_unlock_irqrestore(&dtab->index_lock, flags);
|
|
|
|
call_rcu(&dev->rcu, __dev_map_entry_free);
|
|
|
|
return -E2BIG;
|
|
|
|
}
|
|
|
|
dtab->items++;
|
|
|
|
}
|
|
|
|
|
|
|
|
hlist_add_head_rcu(&dev->index_hlist,
|
|
|
|
dev_map_index_hash(dtab, idx));
|
|
|
|
spin_unlock_irqrestore(&dtab->index_lock, flags);
|
|
|
|
|
|
|
|
if (old_dev)
|
|
|
|
call_rcu(&old_dev->rcu, __dev_map_entry_free);
|
|
|
|
|
|
|
|
return 0;
|
2019-09-08 16:20:16 +08:00
|
|
|
|
|
|
|
out_err:
|
|
|
|
spin_unlock_irqrestore(&dtab->index_lock, flags);
|
|
|
|
return err;
|
2019-07-27 00:06:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value,
|
|
|
|
u64 map_flags)
|
|
|
|
{
|
|
|
|
return __dev_map_hash_update_elem(current->nsproxy->net_ns,
|
|
|
|
map, key, value, map_flags);
|
|
|
|
}
|
|
|
|
|
2017-07-18 00:28:56 +08:00
|
|
|
const struct bpf_map_ops dev_map_ops = {
|
|
|
|
.map_alloc = dev_map_alloc,
|
|
|
|
.map_free = dev_map_free,
|
|
|
|
.map_get_next_key = dev_map_get_next_key,
|
|
|
|
.map_lookup_elem = dev_map_lookup_elem,
|
|
|
|
.map_update_elem = dev_map_update_elem,
|
|
|
|
.map_delete_elem = dev_map_delete_elem,
|
2018-08-12 07:59:17 +08:00
|
|
|
.map_check_btf = map_check_no_btf,
|
2017-07-18 00:28:56 +08:00
|
|
|
};
|
2017-07-18 00:30:02 +08:00
|
|
|
|
2019-07-27 00:06:55 +08:00
|
|
|
const struct bpf_map_ops dev_map_hash_ops = {
|
|
|
|
.map_alloc = dev_map_alloc,
|
|
|
|
.map_free = dev_map_free,
|
|
|
|
.map_get_next_key = dev_map_hash_get_next_key,
|
|
|
|
.map_lookup_elem = dev_map_hash_lookup_elem,
|
|
|
|
.map_update_elem = dev_map_hash_update_elem,
|
|
|
|
.map_delete_elem = dev_map_hash_delete_elem,
|
|
|
|
.map_check_btf = map_check_no_btf,
|
|
|
|
};
|
|
|
|
|
2019-10-19 19:19:31 +08:00
|
|
|
static void dev_map_hash_remove_netdev(struct bpf_dtab *dtab,
|
|
|
|
struct net_device *netdev)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
u32 i;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&dtab->index_lock, flags);
|
|
|
|
for (i = 0; i < dtab->n_buckets; i++) {
|
|
|
|
struct bpf_dtab_netdev *dev;
|
|
|
|
struct hlist_head *head;
|
|
|
|
struct hlist_node *next;
|
|
|
|
|
|
|
|
head = dev_map_index_hash(dtab, i);
|
|
|
|
|
|
|
|
hlist_for_each_entry_safe(dev, next, head, index_hlist) {
|
|
|
|
if (netdev != dev->dev)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
dtab->items--;
|
|
|
|
hlist_del_rcu(&dev->index_hlist);
|
|
|
|
call_rcu(&dev->rcu, __dev_map_entry_free);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&dtab->index_lock, flags);
|
|
|
|
}
|
|
|
|
|
2017-07-18 00:30:02 +08:00
|
|
|
static int dev_map_notification(struct notifier_block *notifier,
|
|
|
|
ulong event, void *ptr)
|
|
|
|
{
|
|
|
|
struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
|
|
|
|
struct bpf_dtab *dtab;
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
int i, cpu;
|
2017-07-18 00:30:02 +08:00
|
|
|
|
|
|
|
switch (event) {
|
xdp: Move devmap bulk queue into struct net_device
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.
This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit. To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.
We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.
After this patch, the relevant part of struct net_device looks like this,
according to pahole:
/* --- cacheline 14 boundary (896 bytes) --- */
struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
unsigned int num_tx_queues; /* 904 4 */
unsigned int real_num_tx_queues; /* 908 4 */
struct Qdisc * qdisc; /* 912 8 */
unsigned int tx_queue_len; /* 920 4 */
spinlock_t tx_global_lock; /* 924 4 */
struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
struct xps_dev_maps * xps_cpus_map; /* 936 8 */
struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
struct mini_Qdisc * miniq_egress; /* 952 8 */
/* --- cacheline 15 boundary (960 bytes) --- */
struct hlist_head qdisc_hash[16]; /* 960 128 */
/* --- cacheline 17 boundary (1088 bytes) --- */
struct timer_list watchdog_timer; /* 1088 40 */
/* XXX last struct has 4 bytes of padding */
int watchdog_timeo; /* 1128 4 */
/* XXX 4 bytes hole, try to pack */
struct list_head todo_list; /* 1136 16 */
/* --- cacheline 18 boundary (1152 bytes) --- */
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 23:14:44 +08:00
|
|
|
case NETDEV_REGISTER:
|
|
|
|
if (!netdev->netdev_ops->ndo_xdp_xmit || netdev->xdp_bulkq)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* will be freed in free_netdev() */
|
|
|
|
netdev->xdp_bulkq =
|
|
|
|
__alloc_percpu_gfp(sizeof(struct xdp_dev_bulk_queue),
|
|
|
|
sizeof(void *), GFP_ATOMIC);
|
|
|
|
if (!netdev->xdp_bulkq)
|
|
|
|
return NOTIFY_BAD;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
per_cpu_ptr(netdev->xdp_bulkq, cpu)->dev = netdev;
|
|
|
|
break;
|
2017-07-18 00:30:02 +08:00
|
|
|
case NETDEV_UNREGISTER:
|
bpf: devmap fix mutex in rcu critical section
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-05 13:02:19 +08:00
|
|
|
/* This rcu_read_lock/unlock pair is needed because
|
|
|
|
* dev_map_list is an RCU list AND to ensure a delete
|
|
|
|
* operation does not free a netdev_map entry while we
|
|
|
|
* are comparing it against the netdev being unregistered.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(dtab, &dev_map_list, list) {
|
2019-10-19 19:19:31 +08:00
|
|
|
if (dtab->map.map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
|
|
|
|
dev_map_hash_remove_netdev(dtab, netdev);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2017-07-18 00:30:02 +08:00
|
|
|
for (i = 0; i < dtab->map.max_entries; i++) {
|
bpf: devmap fix mutex in rcu critical section
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-05 13:02:19 +08:00
|
|
|
struct bpf_dtab_netdev *dev, *odev;
|
2017-07-18 00:30:02 +08:00
|
|
|
|
bpf: devmap fix mutex in rcu critical section
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-05 13:02:19 +08:00
|
|
|
dev = READ_ONCE(dtab->netdev_map[i]);
|
2018-10-24 19:15:17 +08:00
|
|
|
if (!dev || netdev != dev->dev)
|
2017-07-18 00:30:02 +08:00
|
|
|
continue;
|
bpf: devmap fix mutex in rcu critical section
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-05 13:02:19 +08:00
|
|
|
odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
|
|
|
|
if (dev == odev)
|
2017-07-18 00:30:02 +08:00
|
|
|
call_rcu(&dev->rcu,
|
|
|
|
__dev_map_entry_free);
|
|
|
|
}
|
|
|
|
}
|
bpf: devmap fix mutex in rcu critical section
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-05 13:02:19 +08:00
|
|
|
rcu_read_unlock();
|
2017-07-18 00:30:02 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return NOTIFY_OK;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct notifier_block dev_map_notifier = {
|
|
|
|
.notifier_call = dev_map_notification,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init dev_map_init(void)
|
|
|
|
{
|
2019-12-19 14:10:03 +08:00
|
|
|
int cpu;
|
|
|
|
|
2018-05-24 22:45:46 +08:00
|
|
|
/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
|
|
|
|
BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
|
|
|
|
offsetof(struct _bpf_dtab_netdev, dev));
|
2017-07-18 00:30:02 +08:00
|
|
|
register_netdevice_notifier(&dev_map_notifier);
|
2019-12-19 14:10:03 +08:00
|
|
|
|
|
|
|
for_each_possible_cpu(cpu)
|
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 23:14:45 +08:00
|
|
|
INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu));
|
2017-07-18 00:30:02 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
subsys_initcall(dev_map_init);
|