Merge branch 'Add support for transmitting packets using XDP in bpf_prog_run()'

Toke Høiland-Jørgensen says:

====================

This series adds support for transmitting packets using XDP in
bpf_prog_run(), by enabling a new mode "live packet" mode which will handle
the XDP program return codes and redirect the packets to the stack or other
devices.

The primary use case for this is testing the redirect map types and the
ndo_xdp_xmit driver operation without an external traffic generator. But it
turns out to also be useful for creating a programmable traffic generator
in XDP, as well as injecting frames into the stack. A sample traffic
generator, which was included in previous versions of the series, but now
moved to xdp-tools, transmits up to 9 Mpps/core on my test machine.

To transmit the frames, the new mode instantiates a page_pool structure in
bpf_prog_run() and initialises the pages to contain XDP frames with the
data passed in by userspace. These frames can then be handled as though
they came from the hardware XDP path, and the existing page_pool code takes
care of returning and recycling them. The setup is optimised for high
performance with a high number of repetitions to support stress testing and
the traffic generator use case; see patch 1 for details.

v11:
- Fix override of return code in xdp_test_run_batch()
- Add Martin's ACKs to remaining patches

v10:
- Only propagate memory allocation errors from xdp_test_run_batch()
- Get rid of BPF_F_TEST_XDP_RESERVED; batch_size can be used to probe
- Check that batch_size is unset in non-XDP test_run funcs
- Lower the number of repetitions in the selftest to 10k
- Count number of recycled pages in the selftest
- Fix a few other nits from Martin, carry forward ACKs

v9:
- XDP_DROP packets in the selftest to ensure pages are recycled
- Fix a few issues reported by the kernel test robot
- Rewrite the documentation of the batch size to make it a bit clearer
- Rebase to newest bpf-next

v8:
- Make the batch size configurable from userspace
- Don't interrupt the packet loop on errors in do_redirect (this can be
  caught from the tracepoint)
- Add documentation of the feature
- Add reserved flag userspace can use to probe for support (kernel didn't
  check flags previously)
- Rebase to newest bpf-next, disallow live mode for jumbo frames

v7:
- Extend the local_bh_disable() to cover the full test run loop, to prevent
  running concurrently with the softirq. Fixes a deadlock with veth xmit.
- Reinstate the forwarding sysctl setting in the selftest, and bump up the
  number of packets being transmitted to trigger the above bug.
- Update commit message to make it clear that user space can select the
  ingress interface.

v6:
- Fix meta vs data pointer setting and add a selftest for it
- Add local_bh_disable() around code passing packets up the stack
- Create a new netns for the selftest and use a TC program instead of the
  forwarding hack to count packets being XDP_PASS'ed from the test prog.
- Check for the correct ingress ifindex in the selftest
- Rebase and drop patches 1-5 that were already merged

v5:
- Rebase to current bpf-next

v4:
- Fix a few code style issues (Alexei)
- Also handle the other return codes: XDP_PASS builds skbs and injects them
  into the stack, and XDP_TX is turned into a redirect out the same
  interface (Alexei).
- Drop the last patch adding an xdp_trafficgen program to samples/bpf; this
  will live in xdp-tools instead (Alexei).
- Add a separate bpf_test_run_xdp_live() function to test_run.c instead of
  entangling the new mode in the existing bpf_test_run().

v3:
- Reorder patches to make sure they all build individually (Patchwork)
- Remove a couple of unused variables (Patchwork)
- Remove unlikely() annotation in slow path and add back John's ACK that I
  accidentally dropped for v2 (John)

v2:
- Split up up __xdp_do_redirect to avoid passing two pointers to it (John)
- Always reset context pointers before each test run (John)
- Use get_mac_addr() from xdp_sample_user.h instead of rolling our own (Kumar)
- Fix wrong offset for metadata pointer
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
Alexei Starovoitov 2022-03-09 14:19:23 -08:00
commit de55c9a196
14 changed files with 821 additions and 105 deletions

View File

@ -0,0 +1,117 @@
.. SPDX-License-Identifier: GPL-2.0
===================================
Running BPF programs from userspace
===================================
This document describes the ``BPF_PROG_RUN`` facility for running BPF programs
from userspace.
.. contents::
:local:
:depth: 2
Overview
--------
The ``BPF_PROG_RUN`` command can be used through the ``bpf()`` syscall to
execute a BPF program in the kernel and return the results to userspace. This
can be used to unit test BPF programs against user-supplied context objects, and
as way to explicitly execute programs in the kernel for their side effects. The
command was previously named ``BPF_PROG_TEST_RUN``, and both constants continue
to be defined in the UAPI header, aliased to the same value.
The ``BPF_PROG_RUN`` command can be used to execute BPF programs of the
following types:
- ``BPF_PROG_TYPE_SOCKET_FILTER``
- ``BPF_PROG_TYPE_SCHED_CLS``
- ``BPF_PROG_TYPE_SCHED_ACT``
- ``BPF_PROG_TYPE_XDP``
- ``BPF_PROG_TYPE_SK_LOOKUP``
- ``BPF_PROG_TYPE_CGROUP_SKB``
- ``BPF_PROG_TYPE_LWT_IN``
- ``BPF_PROG_TYPE_LWT_OUT``
- ``BPF_PROG_TYPE_LWT_XMIT``
- ``BPF_PROG_TYPE_LWT_SEG6LOCAL``
- ``BPF_PROG_TYPE_FLOW_DISSECTOR``
- ``BPF_PROG_TYPE_STRUCT_OPS``
- ``BPF_PROG_TYPE_RAW_TRACEPOINT``
- ``BPF_PROG_TYPE_SYSCALL``
When using the ``BPF_PROG_RUN`` command, userspace supplies an input context
object and (for program types operating on network packets) a buffer containing
the packet data that the BPF program will operate on. The kernel will then
execute the program and return the results to userspace. Note that programs will
not have any side effects while being run in this mode; in particular, packets
will not actually be redirected or dropped, the program return code will just be
returned to userspace. A separate mode for live execution of XDP programs is
provided, documented separately below.
Running XDP programs in "live frame mode"
-----------------------------------------
The ``BPF_PROG_RUN`` command has a separate mode for running live XDP programs,
which can be used to execute XDP programs in a way where packets will actually
be processed by the kernel after the execution of the XDP program as if they
arrived on a physical interface. This mode is activated by setting the
``BPF_F_TEST_XDP_LIVE_FRAMES`` flag when supplying an XDP program to
``BPF_PROG_RUN``.
The live packet mode is optimised for high performance execution of the supplied
XDP program many times (suitable for, e.g., running as a traffic generator),
which means the semantics are not quite as straight-forward as the regular test
run mode. Specifically:
- When executing an XDP program in live frame mode, the result of the execution
will not be returned to userspace; instead, the kernel will perform the
operation indicated by the program's return code (drop the packet, redirect
it, etc). For this reason, setting the ``data_out`` or ``ctx_out`` attributes
in the syscall parameters when running in this mode will be rejected. In
addition, not all failures will be reported back to userspace directly;
specifically, only fatal errors in setup or during execution (like memory
allocation errors) will halt execution and return an error. If an error occurs
in packet processing, like a failure to redirect to a given interface,
execution will continue with the next repetition; these errors can be detected
via the same trace points as for regular XDP programs.
- Userspace can supply an ifindex as part of the context object, just like in
the regular (non-live) mode. The XDP program will be executed as though the
packet arrived on this interface; i.e., the ``ingress_ifindex`` of the context
object will point to that interface. Furthermore, if the XDP program returns
``XDP_PASS``, the packet will be injected into the kernel networking stack as
though it arrived on that ifindex, and if it returns ``XDP_TX``, the packet
will be transmitted *out* of that same interface. Do note, though, that
because the program execution is not happening in driver context, an
``XDP_TX`` is actually turned into the same action as an ``XDP_REDIRECT`` to
that same interface (i.e., it will only work if the driver has support for the
``ndo_xdp_xmit`` driver op).
- When running the program with multiple repetitions, the execution will happen
in batches. The batch size defaults to 64 packets (which is same as the
maximum NAPI receive batch size), but can be specified by userspace through
the ``batch_size`` parameter, up to a maximum of 256 packets. For each batch,
the kernel executes the XDP program repeatedly, each invocation getting a
separate copy of the packet data. For each repetition, if the program drops
the packet, the data page is immediately recycled (see below). Otherwise, the
packet is buffered until the end of the batch, at which point all packets
buffered this way during the batch are transmitted at once.
- When setting up the test run, the kernel will initialise a pool of memory
pages of the same size as the batch size. Each memory page will be initialised
with the initial packet data supplied by userspace at ``BPF_PROG_RUN``
invocation. When possible, the pages will be recycled on future program
invocations, to improve performance. Pages will generally be recycled a full
batch at a time, except when a packet is dropped (by return code or because
of, say, a redirection error), in which case that page will be recycled
immediately. If a packet ends up being passed to the regular networking stack
(because the XDP program returns ``XDP_PASS``, or because it ends up being
redirected to an interface that injects it into the stack), the page will be
released and a new one will be allocated when the pool is empty.
When recycling, the page content is not rewritten; only the packet boundary
pointers (``data``, ``data_end`` and ``data_meta``) in the context object will
be reset to the original values. This means that if a program rewrites the
packet contents, it has to be prepared to see either the original content or
the modified version on subsequent invocations.

View File

@ -21,6 +21,7 @@ that goes into great technical depth about the BPF Architecture.
helpers
programs
maps
bpf_prog_run
classic_vs_extended.rst
bpf_licensing
test_debug

View File

@ -1232,6 +1232,8 @@ enum {
/* If set, run the test on the cpu specified by bpf_attr.test.cpu */
#define BPF_F_TEST_RUN_ON_CPU (1U << 0)
/* If set, XDP frames will be transmitted after processing */
#define BPF_F_TEST_XDP_LIVE_FRAMES (1U << 1)
/* type for BPF_ENABLE_STATS */
enum bpf_stats_type {
@ -1393,6 +1395,7 @@ union bpf_attr {
__aligned_u64 ctx_out;
__u32 flags;
__u32 cpu;
__u32 batch_size;
} test;
struct { /* anonymous struct used by BPF_*_GET_*_ID */

View File

@ -30,6 +30,7 @@ config BPF_SYSCALL
select TASKS_TRACE_RCU
select BINARY_PRINTF
select NET_SOCK_MSG if NET
select PAGE_POOL if NET
default n
help
Enable the bpf() system call that allows to manipulate BPF programs

View File

@ -3336,7 +3336,7 @@ static int bpf_prog_query(const union bpf_attr *attr,
}
}
#define BPF_PROG_TEST_RUN_LAST_FIELD test.cpu
#define BPF_PROG_TEST_RUN_LAST_FIELD test.batch_size
static int bpf_prog_test_run(const union bpf_attr *attr,
union bpf_attr __user *uattr)

View File

@ -15,6 +15,7 @@
#include <net/sock.h>
#include <net/tcp.h>
#include <net/net_namespace.h>
#include <net/page_pool.h>
#include <linux/error-injection.h>
#include <linux/smp.h>
#include <linux/sock_diag.h>
@ -53,10 +54,11 @@ static void bpf_test_timer_leave(struct bpf_test_timer *t)
rcu_read_unlock();
}
static bool bpf_test_timer_continue(struct bpf_test_timer *t, u32 repeat, int *err, u32 *duration)
static bool bpf_test_timer_continue(struct bpf_test_timer *t, int iterations,
u32 repeat, int *err, u32 *duration)
__must_hold(rcu)
{
t->i++;
t->i += iterations;
if (t->i >= repeat) {
/* We're done. */
t->time_spent += ktime_get_ns() - t->time_start;
@ -88,6 +90,286 @@ reset:
return false;
}
/* We put this struct at the head of each page with a context and frame
* initialised when the page is allocated, so we don't have to do this on each
* repetition of the test run.
*/
struct xdp_page_head {
struct xdp_buff orig_ctx;
struct xdp_buff ctx;
struct xdp_frame frm;
u8 data[];
};
struct xdp_test_data {
struct xdp_buff *orig_ctx;
struct xdp_rxq_info rxq;
struct net_device *dev;
struct page_pool *pp;
struct xdp_frame **frames;
struct sk_buff **skbs;
u32 batch_size;
u32 frame_cnt;
};
#define TEST_XDP_FRAME_SIZE (PAGE_SIZE - sizeof(struct xdp_page_head) \
- sizeof(struct skb_shared_info))
#define TEST_XDP_MAX_BATCH 256
static void xdp_test_run_init_page(struct page *page, void *arg)
{
struct xdp_page_head *head = phys_to_virt(page_to_phys(page));
struct xdp_buff *new_ctx, *orig_ctx;
u32 headroom = XDP_PACKET_HEADROOM;
struct xdp_test_data *xdp = arg;
size_t frm_len, meta_len;
struct xdp_frame *frm;
void *data;
orig_ctx = xdp->orig_ctx;
frm_len = orig_ctx->data_end - orig_ctx->data_meta;
meta_len = orig_ctx->data - orig_ctx->data_meta;
headroom -= meta_len;
new_ctx = &head->ctx;
frm = &head->frm;
data = &head->data;
memcpy(data + headroom, orig_ctx->data_meta, frm_len);
xdp_init_buff(new_ctx, TEST_XDP_FRAME_SIZE, &xdp->rxq);
xdp_prepare_buff(new_ctx, data, headroom, frm_len, true);
new_ctx->data = new_ctx->data_meta + meta_len;
xdp_update_frame_from_buff(new_ctx, frm);
frm->mem = new_ctx->rxq->mem;
memcpy(&head->orig_ctx, new_ctx, sizeof(head->orig_ctx));
}
static int xdp_test_run_setup(struct xdp_test_data *xdp, struct xdp_buff *orig_ctx)
{
struct xdp_mem_info mem = {};
struct page_pool *pp;
int err = -ENOMEM;
struct page_pool_params pp_params = {
.order = 0,
.flags = 0,
.pool_size = xdp->batch_size,
.nid = NUMA_NO_NODE,
.max_len = TEST_XDP_FRAME_SIZE,
.init_callback = xdp_test_run_init_page,
.init_arg = xdp,
};
xdp->frames = kvmalloc_array(xdp->batch_size, sizeof(void *), GFP_KERNEL);
if (!xdp->frames)
return -ENOMEM;
xdp->skbs = kvmalloc_array(xdp->batch_size, sizeof(void *), GFP_KERNEL);
if (!xdp->skbs)
goto err_skbs;
pp = page_pool_create(&pp_params);
if (IS_ERR(pp)) {
err = PTR_ERR(pp);
goto err_pp;
}
/* will copy 'mem.id' into pp->xdp_mem_id */
err = xdp_reg_mem_model(&mem, MEM_TYPE_PAGE_POOL, pp);
if (err)
goto err_mmodel;
xdp->pp = pp;
/* We create a 'fake' RXQ referencing the original dev, but with an
* xdp_mem_info pointing to our page_pool
*/
xdp_rxq_info_reg(&xdp->rxq, orig_ctx->rxq->dev, 0, 0);
xdp->rxq.mem.type = MEM_TYPE_PAGE_POOL;
xdp->rxq.mem.id = pp->xdp_mem_id;
xdp->dev = orig_ctx->rxq->dev;
xdp->orig_ctx = orig_ctx;
return 0;
err_mmodel:
page_pool_destroy(pp);
err_pp:
kfree(xdp->skbs);
err_skbs:
kfree(xdp->frames);
return err;
}
static void xdp_test_run_teardown(struct xdp_test_data *xdp)
{
page_pool_destroy(xdp->pp);
kfree(xdp->frames);
kfree(xdp->skbs);
}
static bool ctx_was_changed(struct xdp_page_head *head)
{
return head->orig_ctx.data != head->ctx.data ||
head->orig_ctx.data_meta != head->ctx.data_meta ||
head->orig_ctx.data_end != head->ctx.data_end;
}
static void reset_ctx(struct xdp_page_head *head)
{
if (likely(!ctx_was_changed(head)))
return;
head->ctx.data = head->orig_ctx.data;
head->ctx.data_meta = head->orig_ctx.data_meta;
head->ctx.data_end = head->orig_ctx.data_end;
xdp_update_frame_from_buff(&head->ctx, &head->frm);
}
static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
struct sk_buff **skbs,
struct net_device *dev)
{
gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
int i, n;
LIST_HEAD(list);
n = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, nframes, (void **)skbs);
if (unlikely(n == 0)) {
for (i = 0; i < nframes; i++)
xdp_return_frame(frames[i]);
return -ENOMEM;
}
for (i = 0; i < nframes; i++) {
struct xdp_frame *xdpf = frames[i];
struct sk_buff *skb = skbs[i];
skb = __xdp_build_skb_from_frame(xdpf, skb, dev);
if (!skb) {
xdp_return_frame(xdpf);
continue;
}
list_add_tail(&skb->list, &list);
}
netif_receive_skb_list(&list);
return 0;
}
static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
u32 repeat)
{
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
int err = 0, act, ret, i, nframes = 0, batch_sz;
struct xdp_frame **frames = xdp->frames;
struct xdp_page_head *head;
struct xdp_frame *frm;
bool redirect = false;
struct xdp_buff *ctx;
struct page *page;
batch_sz = min_t(u32, repeat, xdp->batch_size);
local_bh_disable();
xdp_set_return_frame_no_direct();
for (i = 0; i < batch_sz; i++) {
page = page_pool_dev_alloc_pages(xdp->pp);
if (!page) {
err = -ENOMEM;
goto out;
}
head = phys_to_virt(page_to_phys(page));
reset_ctx(head);
ctx = &head->ctx;
frm = &head->frm;
xdp->frame_cnt++;
act = bpf_prog_run_xdp(prog, ctx);
/* if program changed pkt bounds we need to update the xdp_frame */
if (unlikely(ctx_was_changed(head))) {
ret = xdp_update_frame_from_buff(ctx, frm);
if (ret) {
xdp_return_buff(ctx);
continue;
}
}
switch (act) {
case XDP_TX:
/* we can't do a real XDP_TX since we're not in the
* driver, so turn it into a REDIRECT back to the same
* index
*/
ri->tgt_index = xdp->dev->ifindex;
ri->map_id = INT_MAX;
ri->map_type = BPF_MAP_TYPE_UNSPEC;
fallthrough;
case XDP_REDIRECT:
redirect = true;
ret = xdp_do_redirect_frame(xdp->dev, ctx, frm, prog);
if (ret)
xdp_return_buff(ctx);
break;
case XDP_PASS:
frames[nframes++] = frm;
break;
default:
bpf_warn_invalid_xdp_action(NULL, prog, act);
fallthrough;
case XDP_DROP:
xdp_return_buff(ctx);
break;
}
}
out:
if (redirect)
xdp_do_flush();
if (nframes) {
ret = xdp_recv_frames(frames, nframes, xdp->skbs, xdp->dev);
if (ret)
err = ret;
}
xdp_clear_return_frame_no_direct();
local_bh_enable();
return err;
}
static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
u32 repeat, u32 batch_size, u32 *time)
{
struct xdp_test_data xdp = { .batch_size = batch_size };
struct bpf_test_timer t = { .mode = NO_MIGRATE };
int ret;
if (!repeat)
repeat = 1;
ret = xdp_test_run_setup(&xdp, ctx);
if (ret)
return ret;
bpf_test_timer_enter(&t);
do {
xdp.frame_cnt = 0;
ret = xdp_test_run_batch(&xdp, prog, repeat - t.i);
if (unlikely(ret < 0))
break;
} while (bpf_test_timer_continue(&t, xdp.frame_cnt, repeat, &ret, time));
bpf_test_timer_leave(&t);
xdp_test_run_teardown(&xdp);
return ret;
}
static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
u32 *retval, u32 *time, bool xdp)
{
@ -119,7 +401,7 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
*retval = bpf_prog_run_xdp(prog, ctx);
else
*retval = bpf_prog_run(prog, ctx);
} while (bpf_test_timer_continue(&t, repeat, &ret, time));
} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
bpf_reset_run_ctx(old_ctx);
bpf_test_timer_leave(&t);
@ -446,7 +728,7 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
int b = 2, err = -EFAULT;
u32 retval = 0;
if (kattr->test.flags || kattr->test.cpu)
if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
return -EINVAL;
switch (prog->expected_attach_type) {
@ -510,7 +792,7 @@ int bpf_prog_test_run_raw_tp(struct bpf_prog *prog,
/* doesn't support data_in/out, ctx_out, duration, or repeat */
if (kattr->test.data_in || kattr->test.data_out ||
kattr->test.ctx_out || kattr->test.duration ||
kattr->test.repeat)
kattr->test.repeat || kattr->test.batch_size)
return -EINVAL;
if (ctx_size_in < prog->aux->max_ctx_offset ||
@ -741,7 +1023,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
void *data;
int ret;
if (kattr->test.flags || kattr->test.cpu)
if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
return -EINVAL;
data = bpf_test_init(kattr, kattr->test.data_size_in,
@ -922,7 +1204,9 @@ static void xdp_convert_buff_to_md(struct xdp_buff *xdp, struct xdp_md *xdp_md)
int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
union bpf_attr __user *uattr)
{
bool do_live = (kattr->test.flags & BPF_F_TEST_XDP_LIVE_FRAMES);
u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
u32 batch_size = kattr->test.batch_size;
u32 size = kattr->test.data_size_in;
u32 headroom = XDP_PACKET_HEADROOM;
u32 retval, duration, max_data_sz;
@ -938,6 +1222,18 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
prog->expected_attach_type == BPF_XDP_CPUMAP)
return -EINVAL;
if (kattr->test.flags & ~BPF_F_TEST_XDP_LIVE_FRAMES)
return -EINVAL;
if (do_live) {
if (!batch_size)
batch_size = NAPI_POLL_WEIGHT;
else if (batch_size > TEST_XDP_MAX_BATCH)
return -E2BIG;
} else if (batch_size) {
return -EINVAL;
}
ctx = bpf_ctx_init(kattr, sizeof(struct xdp_md));
if (IS_ERR(ctx))
return PTR_ERR(ctx);
@ -946,14 +1242,20 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
/* There can't be user provided data before the meta data */
if (ctx->data_meta || ctx->data_end != size ||
ctx->data > ctx->data_end ||
unlikely(xdp_metalen_invalid(ctx->data)))
unlikely(xdp_metalen_invalid(ctx->data)) ||
(do_live && (kattr->test.data_out || kattr->test.ctx_out)))
goto free_ctx;
/* Meta data is allocated from the headroom */
headroom -= ctx->data;
}
max_data_sz = 4096 - headroom - tailroom;
size = min_t(u32, size, max_data_sz);
if (size > max_data_sz) {
/* disallow live data mode for jumbo frames */
if (do_live)
goto free_ctx;
size = max_data_sz;
}
data = bpf_test_init(kattr, size, max_data_sz, headroom, tailroom);
if (IS_ERR(data)) {
@ -1011,7 +1313,10 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
if (repeat > 1)
bpf_prog_change_xdp(NULL, prog);
ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true);
if (do_live)
ret = bpf_test_run_xdp_live(prog, &xdp, repeat, batch_size, &duration);
else
ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true);
/* We convert the xdp_buff back to an xdp_md before checking the return
* code so the reference count of any held netdevice will be decremented
* even if the test run failed.
@ -1073,7 +1378,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
return -EINVAL;
if (kattr->test.flags || kattr->test.cpu)
if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
return -EINVAL;
if (size < ETH_HLEN)
@ -1108,7 +1413,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
do {
retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
size, flags);
} while (bpf_test_timer_continue(&t, repeat, &ret, &duration));
} while (bpf_test_timer_continue(&t, 1, repeat, &ret, &duration));
bpf_test_timer_leave(&t);
if (ret < 0)
@ -1140,7 +1445,7 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
if (prog->type != BPF_PROG_TYPE_SK_LOOKUP)
return -EINVAL;
if (kattr->test.flags || kattr->test.cpu)
if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
return -EINVAL;
if (kattr->test.data_in || kattr->test.data_size_in || kattr->test.data_out ||
@ -1203,7 +1508,7 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
do {
ctx.selected_sk = NULL;
retval = BPF_PROG_SK_LOOKUP_RUN_ARRAY(progs, ctx, bpf_prog_run);
} while (bpf_test_timer_continue(&t, repeat, &ret, &duration));
} while (bpf_test_timer_continue(&t, 1, repeat, &ret, &duration));
bpf_test_timer_leave(&t);
if (ret < 0)
@ -1242,7 +1547,8 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
/* doesn't support data_in/out, ctx_out, duration, or repeat or flags */
if (kattr->test.data_in || kattr->test.data_out ||
kattr->test.ctx_out || kattr->test.duration ||
kattr->test.repeat || kattr->test.flags)
kattr->test.repeat || kattr->test.flags ||
kattr->test.batch_size)
return -EINVAL;
if (ctx_size_in < prog->aux->max_ctx_offset ||

View File

@ -1232,6 +1232,8 @@ enum {
/* If set, run the test on the cpu specified by bpf_attr.test.cpu */
#define BPF_F_TEST_RUN_ON_CPU (1U << 0)
/* If set, XDP frames will be transmitted after processing */
#define BPF_F_TEST_XDP_LIVE_FRAMES (1U << 1)
/* type for BPF_ENABLE_STATS */
enum bpf_stats_type {
@ -1393,6 +1395,7 @@ union bpf_attr {
__aligned_u64 ctx_out;
__u32 flags;
__u32 cpu;
__u32 batch_size;
} test;
struct { /* anonymous struct used by BPF_*_GET_*_ID */

View File

@ -995,6 +995,7 @@ int bpf_prog_test_run_opts(int prog_fd, struct bpf_test_run_opts *opts)
memset(&attr, 0, sizeof(attr));
attr.test.prog_fd = prog_fd;
attr.test.batch_size = OPTS_GET(opts, batch_size, 0);
attr.test.cpu = OPTS_GET(opts, cpu, 0);
attr.test.flags = OPTS_GET(opts, flags, 0);
attr.test.repeat = OPTS_GET(opts, repeat, 0);

View File

@ -512,8 +512,9 @@ struct bpf_test_run_opts {
__u32 duration; /* out: average per repetition in ns */
__u32 flags;
__u32 cpu;
__u32 batch_size;
};
#define bpf_test_run_opts__last_field cpu
#define bpf_test_run_opts__last_field batch_size
LIBBPF_API int bpf_prog_test_run_opts(int prog_fd,
struct bpf_test_run_opts *opts);

View File

@ -1,18 +1,25 @@
// SPDX-License-Identifier: GPL-2.0-only
#define _GNU_SOURCE
#include <errno.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sched.h>
#include <arpa/inet.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <linux/err.h>
#include <linux/in.h>
#include <linux/in6.h>
#include <linux/limits.h>
#include "bpf_util.h"
#include "network_helpers.h"
#include "test_progs.h"
#define clean_errno() (errno == 0 ? "None" : strerror(errno))
#define log_err(MSG, ...) ({ \
@ -356,3 +363,82 @@ char *ping_command(int family)
}
return "ping";
}
struct nstoken {
int orig_netns_fd;
};
static int setns_by_fd(int nsfd)
{
int err;
err = setns(nsfd, CLONE_NEWNET);
close(nsfd);
if (!ASSERT_OK(err, "setns"))
return err;
/* Switch /sys to the new namespace so that e.g. /sys/class/net
* reflects the devices in the new namespace.
*/
err = unshare(CLONE_NEWNS);
if (!ASSERT_OK(err, "unshare"))
return err;
/* Make our /sys mount private, so the following umount won't
* trigger the global umount in case it's shared.
*/
err = mount("none", "/sys", NULL, MS_PRIVATE, NULL);
if (!ASSERT_OK(err, "remount private /sys"))
return err;
err = umount2("/sys", MNT_DETACH);
if (!ASSERT_OK(err, "umount2 /sys"))
return err;
err = mount("sysfs", "/sys", "sysfs", 0, NULL);
if (!ASSERT_OK(err, "mount /sys"))
return err;
err = mount("bpffs", "/sys/fs/bpf", "bpf", 0, NULL);
if (!ASSERT_OK(err, "mount /sys/fs/bpf"))
return err;
return 0;
}
struct nstoken *open_netns(const char *name)
{
int nsfd;
char nspath[PATH_MAX];
int err;
struct nstoken *token;
token = malloc(sizeof(struct nstoken));
if (!ASSERT_OK_PTR(token, "malloc token"))
return NULL;
token->orig_netns_fd = open("/proc/self/ns/net", O_RDONLY);
if (!ASSERT_GE(token->orig_netns_fd, 0, "open /proc/self/ns/net"))
goto fail;
snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
if (!ASSERT_GE(nsfd, 0, "open netns fd"))
goto fail;
err = setns_by_fd(nsfd);
if (!ASSERT_OK(err, "setns_by_fd"))
goto fail;
return token;
fail:
free(token);
return NULL;
}
void close_netns(struct nstoken *token)
{
ASSERT_OK(setns_by_fd(token->orig_netns_fd), "setns_by_fd");
free(token);
}

View File

@ -55,4 +55,13 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
struct sockaddr_storage *addr, socklen_t *len);
char *ping_command(int family);
struct nstoken;
/**
* open_netns() - Switch to specified network namespace by name.
*
* Returns token with which to restore the original namespace
* using close_netns().
*/
struct nstoken *open_netns(const char *name);
void close_netns(struct nstoken *token);
#endif

View File

@ -10,8 +10,6 @@
* to drop unexpected traffic.
*/
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <linux/if.h>
#include <linux/if_tun.h>
@ -19,10 +17,8 @@
#include <linux/sysctl.h>
#include <linux/time_types.h>
#include <linux/net_tstamp.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <unistd.h>
@ -92,91 +88,6 @@ static int write_file(const char *path, const char *newval)
return 0;
}
struct nstoken {
int orig_netns_fd;
};
static int setns_by_fd(int nsfd)
{
int err;
err = setns(nsfd, CLONE_NEWNET);
close(nsfd);
if (!ASSERT_OK(err, "setns"))
return err;
/* Switch /sys to the new namespace so that e.g. /sys/class/net
* reflects the devices in the new namespace.
*/
err = unshare(CLONE_NEWNS);
if (!ASSERT_OK(err, "unshare"))
return err;
/* Make our /sys mount private, so the following umount won't
* trigger the global umount in case it's shared.
*/
err = mount("none", "/sys", NULL, MS_PRIVATE, NULL);
if (!ASSERT_OK(err, "remount private /sys"))
return err;
err = umount2("/sys", MNT_DETACH);
if (!ASSERT_OK(err, "umount2 /sys"))
return err;
err = mount("sysfs", "/sys", "sysfs", 0, NULL);
if (!ASSERT_OK(err, "mount /sys"))
return err;
err = mount("bpffs", "/sys/fs/bpf", "bpf", 0, NULL);
if (!ASSERT_OK(err, "mount /sys/fs/bpf"))
return err;
return 0;
}
/**
* open_netns() - Switch to specified network namespace by name.
*
* Returns token with which to restore the original namespace
* using close_netns().
*/
static struct nstoken *open_netns(const char *name)
{
int nsfd;
char nspath[PATH_MAX];
int err;
struct nstoken *token;
token = calloc(1, sizeof(struct nstoken));
if (!ASSERT_OK_PTR(token, "malloc token"))
return NULL;
token->orig_netns_fd = open("/proc/self/ns/net", O_RDONLY);
if (!ASSERT_GE(token->orig_netns_fd, 0, "open /proc/self/ns/net"))
goto fail;
snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
if (!ASSERT_GE(nsfd, 0, "open netns fd"))
goto fail;
err = setns_by_fd(nsfd);
if (!ASSERT_OK(err, "setns_by_fd"))
goto fail;
return token;
fail:
free(token);
return NULL;
}
static void close_netns(struct nstoken *token)
{
ASSERT_OK(setns_by_fd(token->orig_netns_fd), "setns_by_fd");
free(token);
}
static int netns_setup_namespaces(const char *verb)
{
const char * const *ns = namespaces;

View File

@ -0,0 +1,177 @@
// SPDX-License-Identifier: GPL-2.0
#include <test_progs.h>
#include <network_helpers.h>
#include <net/if.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <linux/ipv6.h>
#include <linux/in6.h>
#include <linux/udp.h>
#include <bpf/bpf_endian.h>
#include "test_xdp_do_redirect.skel.h"
#define SYS(fmt, ...) \
({ \
char cmd[1024]; \
snprintf(cmd, sizeof(cmd), fmt, ##__VA_ARGS__); \
if (!ASSERT_OK(system(cmd), cmd)) \
goto out; \
})
struct udp_packet {
struct ethhdr eth;
struct ipv6hdr iph;
struct udphdr udp;
__u8 payload[64 - sizeof(struct udphdr)
- sizeof(struct ethhdr) - sizeof(struct ipv6hdr)];
} __packed;
static struct udp_packet pkt_udp = {
.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
.eth.h_dest = {0x00, 0x11, 0x22, 0x33, 0x44, 0x55},
.eth.h_source = {0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb},
.iph.version = 6,
.iph.nexthdr = IPPROTO_UDP,
.iph.payload_len = bpf_htons(sizeof(struct udp_packet)
- offsetof(struct udp_packet, udp)),
.iph.hop_limit = 2,
.iph.saddr.s6_addr16 = {bpf_htons(0xfc00), 0, 0, 0, 0, 0, 0, bpf_htons(1)},
.iph.daddr.s6_addr16 = {bpf_htons(0xfc00), 0, 0, 0, 0, 0, 0, bpf_htons(2)},
.udp.source = bpf_htons(1),
.udp.dest = bpf_htons(1),
.udp.len = bpf_htons(sizeof(struct udp_packet)
- offsetof(struct udp_packet, udp)),
.payload = {0x42}, /* receiver XDP program matches on this */
};
static int attach_tc_prog(struct bpf_tc_hook *hook, int fd)
{
DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .handle = 1, .priority = 1, .prog_fd = fd);
int ret;
ret = bpf_tc_hook_create(hook);
if (!ASSERT_OK(ret, "create tc hook"))
return ret;
ret = bpf_tc_attach(hook, &opts);
if (!ASSERT_OK(ret, "bpf_tc_attach")) {
bpf_tc_hook_destroy(hook);
return ret;
}
return 0;
}
#define NUM_PKTS 10000
void test_xdp_do_redirect(void)
{
int err, xdp_prog_fd, tc_prog_fd, ifindex_src, ifindex_dst;
char data[sizeof(pkt_udp) + sizeof(__u32)];
struct test_xdp_do_redirect *skel = NULL;
struct nstoken *nstoken = NULL;
struct bpf_link *link;
struct xdp_md ctx_in = { .data = sizeof(__u32),
.data_end = sizeof(data) };
DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
.data_in = &data,
.data_size_in = sizeof(data),
.ctx_in = &ctx_in,
.ctx_size_in = sizeof(ctx_in),
.flags = BPF_F_TEST_XDP_LIVE_FRAMES,
.repeat = NUM_PKTS,
.batch_size = 64,
);
DECLARE_LIBBPF_OPTS(bpf_tc_hook, tc_hook,
.attach_point = BPF_TC_INGRESS);
memcpy(&data[sizeof(__u32)], &pkt_udp, sizeof(pkt_udp));
*((__u32 *)data) = 0x42; /* metadata test value */
skel = test_xdp_do_redirect__open();
if (!ASSERT_OK_PTR(skel, "skel"))
return;
/* The XDP program we run with bpf_prog_run() will cycle through all
* three xmit (PASS/TX/REDIRECT) return codes starting from above, and
* ending up with PASS, so we should end up with two packets on the dst
* iface and NUM_PKTS-2 in the TC hook. We match the packets on the UDP
* payload.
*/
SYS("ip netns add testns");
nstoken = open_netns("testns");
if (!ASSERT_OK_PTR(nstoken, "setns"))
goto out;
SYS("ip link add veth_src type veth peer name veth_dst");
SYS("ip link set dev veth_src address 00:11:22:33:44:55");
SYS("ip link set dev veth_dst address 66:77:88:99:aa:bb");
SYS("ip link set dev veth_src up");
SYS("ip link set dev veth_dst up");
SYS("ip addr add dev veth_src fc00::1/64");
SYS("ip addr add dev veth_dst fc00::2/64");
SYS("ip neigh add fc00::2 dev veth_src lladdr 66:77:88:99:aa:bb");
/* We enable forwarding in the test namespace because that will cause
* the packets that go through the kernel stack (with XDP_PASS) to be
* forwarded back out the same interface (because of the packet dst
* combined with the interface addresses). When this happens, the
* regular forwarding path will end up going through the same
* veth_xdp_xmit() call as the XDP_REDIRECT code, which can cause a
* deadlock if it happens on the same CPU. There's a local_bh_disable()
* in the test_run code to prevent this, but an earlier version of the
* code didn't have this, so we keep the test behaviour to make sure the
* bug doesn't resurface.
*/
SYS("sysctl -qw net.ipv6.conf.all.forwarding=1");
ifindex_src = if_nametoindex("veth_src");
ifindex_dst = if_nametoindex("veth_dst");
if (!ASSERT_NEQ(ifindex_src, 0, "ifindex_src") ||
!ASSERT_NEQ(ifindex_dst, 0, "ifindex_dst"))
goto out;
memcpy(skel->rodata->expect_dst, &pkt_udp.eth.h_dest, ETH_ALEN);
skel->rodata->ifindex_out = ifindex_src; /* redirect back to the same iface */
skel->rodata->ifindex_in = ifindex_src;
ctx_in.ingress_ifindex = ifindex_src;
tc_hook.ifindex = ifindex_src;
if (!ASSERT_OK(test_xdp_do_redirect__load(skel), "load"))
goto out;
link = bpf_program__attach_xdp(skel->progs.xdp_count_pkts, ifindex_dst);
if (!ASSERT_OK_PTR(link, "prog_attach"))
goto out;
skel->links.xdp_count_pkts = link;
tc_prog_fd = bpf_program__fd(skel->progs.tc_count_pkts);
if (attach_tc_prog(&tc_hook, tc_prog_fd))
goto out;
xdp_prog_fd = bpf_program__fd(skel->progs.xdp_redirect);
err = bpf_prog_test_run_opts(xdp_prog_fd, &opts);
if (!ASSERT_OK(err, "prog_run"))
goto out_tc;
/* wait for the packets to be flushed */
kern_sync_rcu();
/* There will be one packet sent through XDP_REDIRECT and one through
* XDP_TX; these will show up on the XDP counting program, while the
* rest will be counted at the TC ingress hook (and the counting program
* resets the packet payload so they don't get counted twice even though
* they are re-xmited out the veth device
*/
ASSERT_EQ(skel->bss->pkts_seen_xdp, 2, "pkt_count_xdp");
ASSERT_EQ(skel->bss->pkts_seen_zero, 2, "pkt_count_zero");
ASSERT_EQ(skel->bss->pkts_seen_tc, NUM_PKTS - 2, "pkt_count_tc");
out_tc:
bpf_tc_hook_destroy(&tc_hook);
out:
if (nstoken)
close_netns(nstoken);
system("ip netns del testns");
test_xdp_do_redirect__destroy(skel);
}

View File

@ -0,0 +1,100 @@
// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#define ETH_ALEN 6
#define HDR_SZ (sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + sizeof(struct udphdr))
const volatile int ifindex_out;
const volatile int ifindex_in;
const volatile __u8 expect_dst[ETH_ALEN];
volatile int pkts_seen_xdp = 0;
volatile int pkts_seen_zero = 0;
volatile int pkts_seen_tc = 0;
volatile int retcode = XDP_REDIRECT;
SEC("xdp")
int xdp_redirect(struct xdp_md *xdp)
{
__u32 *metadata = (void *)(long)xdp->data_meta;
void *data_end = (void *)(long)xdp->data_end;
void *data = (void *)(long)xdp->data;
__u8 *payload = data + HDR_SZ;
int ret = retcode;
if (payload + 1 > data_end)
return XDP_ABORTED;
if (xdp->ingress_ifindex != ifindex_in)
return XDP_ABORTED;
if (metadata + 1 > data)
return XDP_ABORTED;
if (*metadata != 0x42)
return XDP_ABORTED;
if (*payload == 0) {
*payload = 0x42;
pkts_seen_zero++;
}
if (bpf_xdp_adjust_meta(xdp, 4))
return XDP_ABORTED;
if (retcode > XDP_PASS)
retcode--;
if (ret == XDP_REDIRECT)
return bpf_redirect(ifindex_out, 0);
return ret;
}
static bool check_pkt(void *data, void *data_end)
{
struct ipv6hdr *iph = data + sizeof(struct ethhdr);
__u8 *payload = data + HDR_SZ;
if (payload + 1 > data_end)
return false;
if (iph->nexthdr != IPPROTO_UDP || *payload != 0x42)
return false;
/* reset the payload so the same packet doesn't get counted twice when
* it cycles back through the kernel path and out the dst veth
*/
*payload = 0;
return true;
}
SEC("xdp")
int xdp_count_pkts(struct xdp_md *xdp)
{
void *data = (void *)(long)xdp->data;
void *data_end = (void *)(long)xdp->data_end;
if (check_pkt(data, data_end))
pkts_seen_xdp++;
/* Return XDP_DROP to make sure the data page is recycled, like when it
* exits a physical NIC. Recycled pages will be counted in the
* pkts_seen_zero counter above.
*/
return XDP_DROP;
}
SEC("tc")
int tc_count_pkts(struct __sk_buff *skb)
{
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
if (check_pkt(data, data_end))
pkts_seen_tc++;
return 0;
}
char _license[] SEC("license") = "GPL";