LIBBPF_API and DECLARE_LIBBPF_OPTS are needed in many public libbpf API
headers. Extract them into libbpf_common.h to avoid unnecessary
interdependency between btf.h, libbpf.h, and bpf.h or code duplication.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191214014341.3442258-6-andriin@fb.com
Add a convenience macro BPF_EMBED_OBJ, which allows to embed other files
(typically used to embed BPF .o files) into a hosting userspace programs. To
C program it is exposed as struct bpf_embed_data, containing a pointer to
raw data and its size in bytes.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191214014341.3442258-5-andriin@fb.com
Few libbpf APIs are not public but currently exposed through libbpf.h to be
used by bpftool. Move them to libbpf_internal.h, where intent of being
non-stable and non-public is much more obvious.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191214014341.3442258-4-andriin@fb.com
Generalize BPF program attaching and allow libbpf to auto-detect type (and
extra parameters, where applicable) and attach supported BPF program types
based on program sections. Currently this is supported for:
- kprobe/kretprobe;
- tracepoint;
- raw tracepoint;
- tracing programs (typed raw TP/fentry/fexit).
More types support can be trivially added within this framework.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191214014341.3442258-3-andriin@fb.com
Reorganize bpf_object__open and bpf_object__load steps such that
bpf_object__open doesn't need root access. This was previously done for
feature probing and BTF sanitization. This doesn't have to happen on open,
though, so move all those steps into the load phase.
This is important, because it makes it possible for tools like bpftool, to
just open BPF object file and inspect their contents: programs, maps, BTF,
etc. For such operations it is prohibitive to require root access. On the
other hand, there is a lot of custom libbpf logic in those steps, so its best
avoided for tools to reimplement all that on their own.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191214014341.3442258-2-andriin@fb.com
Fedora binutils has been patched to show "other info" for a symbol at the
end of the line. This was done in order to support unmaintained scripts
that would break with the extra info. [1]
[1] b8265c46f7
This in turn has been done to fix the build of ruby, because of checksec.
[2] Thanks Michael Ellerman for the pointer.
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1479302
As libbpf Makefile is not unmaintained, we can simply deal with either
output format, by just removing the "other info" field, as it always comes
inside brackets.
Fixes: 3464afdf11 (libbpf: Fix readelf output parsing on powerpc with recent binutils)
Reported-by: Justin Forbes <jmforbes@linuxtx.org>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Link: https://lore.kernel.org/bpf/20191213101114.GA3986@calabresa
Paul Chaignon says:
====================
When working with frequently modified BPF programs, both the ID and the
tag may change. bpftool currently doesn't provide a "stable" way to match
such programs. This patchset allows bpftool to match programs and maps by
name.
When given a tag that matches several programs, bpftool currently only
considers the first match. The first patch changes that behavior to
either process all matching programs (for the show and dump commands) or
error out. The second patch implements program lookup by name, with the
same behavior as for tags in case of ambiguity. The last patch implements
map lookup by name.
Changelogs:
Changes in v2:
- Fix buffer overflow after realloc.
- Add example output to commit message.
- Properly close JSON arrays on errors.
- Fix style errors (line breaks, for loops, exit labels, type for
tagname).
- Move do_show code for argc == 2 to do_show_subset functions.
- Rebase.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch implements lookup by name for maps and changes the behavior of
lookups by tag to be consistent with prog subcommands. Similarly to
program subcommands, the show and dump commands will return all maps with
the given name (or tag), whereas other commands will error out if several
maps have the same name (resp. tag).
When a map has BTF info, it is dumped in JSON with available BTF info.
This patch requires that all matched maps have BTF info before switching
the output format to JSON.
Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/8de1c9f273860b3ea1680502928f4da2336b853e.1576263640.git.paul.chaignon@gmail.com
When working with frequently modified BPF programs, both the ID and the
tag may change. bpftool currently doesn't provide a "stable" way to match
such programs.
This patch implements lookup by name for programs. The show and dump
commands will return all programs with the given name, whereas other
commands will error out if several programs have the same name.
Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Link: https://lore.kernel.org/bpf/b5fc1a5dcfaeb5f16fc80295cdaa606dd2d91534.1576263640.git.paul.chaignon@gmail.com
When several BPF programs have the same tag, bpftool matches only the
first (in ID order). This patch changes that behavior such that dump and
show commands return all matched programs. Commands that require a single
program (e.g., pin and attach) will error out if given a tag that matches
several. bpftool prog dump will also error out if file or visual are
given and several programs have the given tag.
In the case of the dump command, a program header is added before each
dump only if the tag matches several programs; this patch doesn't change
the output if a single program matches. The output when several
programs match thus looks as follows.
$ ./bpftool prog dump xlated tag 6deef7357e7b4530
3: cgroup_skb tag 6deef7357e7b4530 gpl
0: (bf) r6 = r1
[...]
7: (95) exit
4: cgroup_skb tag 6deef7357e7b4530 gpl
0: (bf) r6 = r1
[...]
7: (95) exit
Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/fb1fe943202659a69cd21dd5b907c205af1e1e22.1576263640.git.paul.chaignon@gmail.com
Make sure we can pass arbitrary data in wire_len/gso_segs.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191213223028.161282-2-sdf@google.com
wire_len should not be less than real len and is capped by GSO_MAX_SIZE.
gso_segs is capped by GSO_MAX_SEGS.
v2:
* set wire_len to skb->len when passed wire_len is 0 (Alexei Starovoitov)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191213223028.161282-1-sdf@google.com
Björn Töpel says:
====================
Overview
========
This is the 6th iteration of the series that introduces the BPF
dispatcher, which is a mechanism to avoid indirect calls.
The BPF dispatcher is a multi-way branch code generator, targeted for
BPF programs. E.g. when an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. With
retpolines enabled, the indirect call has a substantial performance
impact. The dispatcher is a mechanism that transform indirect calls to
direct calls, and therefore avoids the retpoline. The dispatcher is
generated using the BPF JIT, and relies on text poking provided by
bpf_arch_text_poke().
The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 48
dispatch points. This can be extended in the future.
In this series, only one dispatcher instance is supported, and the
only user is XDP. The dispatcher is updated when an XDP program is
attached/detached to/from a netdev. An alternative to this could have
been to update the dispatcher at program load point, but as there are
usually more XDP programs loaded than attached, so the latter was
picked.
The XDP dispatcher is always enabled, if available, because it helps
even when retpolines are disabled. Please refer to the "Performance"
section below.
The first patch refactors the image allocation from the BPF trampoline
code. Patch two introduces the dispatcher, and patch three adds a
dispatcher for XDP, and wires up the XDP control-/ fast-path. Patch
four adds the dispatcher to BPF_TEST_RUN. Patch five adds a simple
selftest, and the last adds alignment to jump targets.
I have rebased the series on commit 679152d3a3 ("libbpf: Fix printf
compilation warnings on ppc64le arch").
Generated code, x86-64
======================
The dispatcher currently has a maximum of 48 entries, where one entry
is a unique BPF program. Multiple users of a dispatcher instance using
the same BPF program will share that entry.
The program/slot lookup is performed by a binary search, O(log
n). Let's have a look at the generated code.
The trampoline function has the following signature:
unsigned int tramp(const void *ctx,
const struct bpf_insn *insnsi,
unsigned int (*bpf_func)(const void *,
const struct bpf_insn *))
On Intel x86-64 this means that rdx will contain the bpf_func. To,
make it easier to read, I've let the BPF programs have the following
range: 0xffffffffffffffff (-1) to 0xfffffffffffffff0
(-16). 0xffffffff81c00f10 is the retpoline thunk, in this case
__x86_indirect_thunk_rdx. If retpolines are disabled the thunk will be
a regular indirect call.
The minimal dispatcher will then look like this:
ffffffffc0002000: cmp rdx,0xffffffffffffffff
ffffffffc0002007: je 0xffffffffffffffff ; -1
ffffffffc000200d: jmp 0xffffffff81c00f10
A 16 entry dispatcher looks like this:
ffffffffc0020000: cmp rdx,0xfffffffffffffff7 ; -9
ffffffffc0020007: jg 0xffffffffc0020130
ffffffffc002000d: cmp rdx,0xfffffffffffffff3 ; -13
ffffffffc0020014: jg 0xffffffffc00200a0
ffffffffc002001a: cmp rdx,0xfffffffffffffff1 ; -15
ffffffffc0020021: jg 0xffffffffc0020060
ffffffffc0020023: cmp rdx,0xfffffffffffffff0 ; -16
ffffffffc002002a: jg 0xffffffffc0020040
ffffffffc002002c: cmp rdx,0xfffffffffffffff0 ; -16
ffffffffc0020033: je 0xfffffffffffffff0 ; -16
ffffffffc0020039: jmp 0xffffffff81c00f10
ffffffffc002003e: xchg ax,ax
ffffffffc0020040: cmp rdx,0xfffffffffffffff1 ; -15
ffffffffc0020047: je 0xfffffffffffffff1 ; -15
ffffffffc002004d: jmp 0xffffffff81c00f10
ffffffffc0020052: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc002005a: nop WORD PTR [rax+rax*1+0x0]
ffffffffc0020060: cmp rdx,0xfffffffffffffff2 ; -14
ffffffffc0020067: jg 0xffffffffc0020080
ffffffffc0020069: cmp rdx,0xfffffffffffffff2 ; -14
ffffffffc0020070: je 0xfffffffffffffff2 ; -14
ffffffffc0020076: jmp 0xffffffff81c00f10
ffffffffc002007b: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc0020080: cmp rdx,0xfffffffffffffff3 ; -13
ffffffffc0020087: je 0xfffffffffffffff3 ; -13
ffffffffc002008d: jmp 0xffffffff81c00f10
ffffffffc0020092: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc002009a: nop WORD PTR [rax+rax*1+0x0]
ffffffffc00200a0: cmp rdx,0xfffffffffffffff5 ; -11
ffffffffc00200a7: jg 0xffffffffc00200f0
ffffffffc00200a9: cmp rdx,0xfffffffffffffff4 ; -12
ffffffffc00200b0: jg 0xffffffffc00200d0
ffffffffc00200b2: cmp rdx,0xfffffffffffffff4 ; -12
ffffffffc00200b9: je 0xfffffffffffffff4 ; -12
ffffffffc00200bf: jmp 0xffffffff81c00f10
ffffffffc00200c4: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc00200cc: nop DWORD PTR [rax+0x0]
ffffffffc00200d0: cmp rdx,0xfffffffffffffff5 ; -11
ffffffffc00200d7: je 0xfffffffffffffff5 ; -11
ffffffffc00200dd: jmp 0xffffffff81c00f10
ffffffffc00200e2: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc00200ea: nop WORD PTR [rax+rax*1+0x0]
ffffffffc00200f0: cmp rdx,0xfffffffffffffff6 ; -10
ffffffffc00200f7: jg 0xffffffffc0020110
ffffffffc00200f9: cmp rdx,0xfffffffffffffff6 ; -10
ffffffffc0020100: je 0xfffffffffffffff6 ; -10
ffffffffc0020106: jmp 0xffffffff81c00f10
ffffffffc002010b: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc0020110: cmp rdx,0xfffffffffffffff7 ; -9
ffffffffc0020117: je 0xfffffffffffffff7 ; -9
ffffffffc002011d: jmp 0xffffffff81c00f10
ffffffffc0020122: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc002012a: nop WORD PTR [rax+rax*1+0x0]
ffffffffc0020130: cmp rdx,0xfffffffffffffffb ; -5
ffffffffc0020137: jg 0xffffffffc00201d0
ffffffffc002013d: cmp rdx,0xfffffffffffffff9 ; -7
ffffffffc0020144: jg 0xffffffffc0020190
ffffffffc0020146: cmp rdx,0xfffffffffffffff8 ; -8
ffffffffc002014d: jg 0xffffffffc0020170
ffffffffc002014f: cmp rdx,0xfffffffffffffff8 ; -8
ffffffffc0020156: je 0xfffffffffffffff8 ; -8
ffffffffc002015c: jmp 0xffffffff81c00f10
ffffffffc0020161: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc0020169: nop DWORD PTR [rax+0x0]
ffffffffc0020170: cmp rdx,0xfffffffffffffff9 ; -7
ffffffffc0020177: je 0xfffffffffffffff9 ; -7
ffffffffc002017d: jmp 0xffffffff81c00f10
ffffffffc0020182: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc002018a: nop WORD PTR [rax+rax*1+0x0]
ffffffffc0020190: cmp rdx,0xfffffffffffffffa ; -6
ffffffffc0020197: jg 0xffffffffc00201b0
ffffffffc0020199: cmp rdx,0xfffffffffffffffa ; -6
ffffffffc00201a0: je 0xfffffffffffffffa ; -6
ffffffffc00201a6: jmp 0xffffffff81c00f10
ffffffffc00201ab: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc00201b0: cmp rdx,0xfffffffffffffffb ; -5
ffffffffc00201b7: je 0xfffffffffffffffb ; -5
ffffffffc00201bd: jmp 0xffffffff81c00f10
ffffffffc00201c2: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc00201ca: nop WORD PTR [rax+rax*1+0x0]
ffffffffc00201d0: cmp rdx,0xfffffffffffffffd ; -3
ffffffffc00201d7: jg 0xffffffffc0020220
ffffffffc00201d9: cmp rdx,0xfffffffffffffffc ; -4
ffffffffc00201e0: jg 0xffffffffc0020200
ffffffffc00201e2: cmp rdx,0xfffffffffffffffc ; -4
ffffffffc00201e9: je 0xfffffffffffffffc ; -4
ffffffffc00201ef: jmp 0xffffffff81c00f10
ffffffffc00201f4: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc00201fc: nop DWORD PTR [rax+0x0]
ffffffffc0020200: cmp rdx,0xfffffffffffffffd ; -3
ffffffffc0020207: je 0xfffffffffffffffd ; -3
ffffffffc002020d: jmp 0xffffffff81c00f10
ffffffffc0020212: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc002021a: nop WORD PTR [rax+rax*1+0x0]
ffffffffc0020220: cmp rdx,0xfffffffffffffffe ; -2
ffffffffc0020227: jg 0xffffffffc0020240
ffffffffc0020229: cmp rdx,0xfffffffffffffffe ; -2
ffffffffc0020230: je 0xfffffffffffffffe ; -2
ffffffffc0020236: jmp 0xffffffff81c00f10
ffffffffc002023b: nop DWORD PTR [rax+rax*1+0x0]
ffffffffc0020240: cmp rdx,0xffffffffffffffff ; -1
ffffffffc0020247: je 0xffffffffffffffff ; -1
ffffffffc002024d: jmp 0xffffffff81c00f10
The nops are there to align jump targets to 16 B.
Performance
===========
The tests were performed using the xdp_rxq_info sample program with
the following command-line:
1. XDP_DRV:
# xdp_rxq_info --dev eth0 --action XDP_DROP
2. XDP_SKB:
# xdp_rxq_info --dev eth0 -S --action XDP_DROP
3. xdp-perf, from selftests/bpf:
# test_progs -v -t xdp_perf
Run with mitigations=auto
-------------------------
Baseline:
1. 21.7 Mpps (21736190)
2. 3.8 Mpps (3837582)
3. 15 ns
Dispatcher:
1. 30.2 Mpps (30176320)
2. 4.0 Mpps (4015579)
3. 5 ns
Dispatcher (full; walk all entries, and fallback):
1. 22.0 Mpps (21986704)
2. 3.8 Mpps (3831298)
3. 17 ns
Run with mitigations=off
------------------------
Baseline:
1. 29.9 Mpps (29875135)
2. 4.1 Mpps (4100179)
3. 4 ns
Dispatcher:
1. 30.4 Mpps (30439241)
2. 4.1 Mpps (4109350)
1. 4 ns
Dispatcher (full; walk all entries, and fallback):
1. 28.9 Mpps (28903269)
2. 4.1 Mpps (4080078)
3. 5 ns
xdp-perf runs, aliged vs non-aligned jump targets
-------------------------------------------------
In this test dispatchers of different sizes, with and without jump
target alignment, were exercised. As outlined above the function
lookup is performed via binary search. This means that depending on
the pointer value of the function, it can reside in the upper or lower
part of the search table. The performed tests were:
1. aligned, mititations=auto, function entry < other entries
2. aligned, mititations=auto, function entry > other entries
3. non-aligned, mititations=auto, function entry < other entries
4. non-aligned, mititations=auto, function entry > other entries
5. aligned, mititations=off, function entry < other entries
6. aligned, mititations=off, function entry > other entries
7. non-aligned, mititations=off, function entry < other entries
8. non-aligned, mititations=off, function entry > other entries
The micro benchmarks showed that alignment of jump target has some
positive impact.
A reply to this cover letter will contain complete data for all runs.
Multiple xdp-perf baseline with mitigations=auto
------------------------------------------------
Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):
16.69 msec task-clock # 0.984 CPUs utilized ( +- 0.08% )
2 context-switches # 0.123 K/sec ( +- 1.11% )
0 cpu-migrations # 0.000 K/sec ( +- 70.68% )
97 page-faults # 0.006 M/sec ( +- 0.05% )
49,254,635 cycles # 2.951 GHz ( +- 0.09% ) (12.28%)
42,138,558 instructions # 0.86 insn per cycle ( +- 0.02% ) (36.15%)
7,315,291 branches # 438.300 M/sec ( +- 0.01% ) (59.43%)
1,011,201 branch-misses # 13.82% of all branches ( +- 0.01% ) (83.31%)
15,440,788 L1-dcache-loads # 925.143 M/sec ( +- 0.00% ) (99.40%)
39,067 L1-dcache-load-misses # 0.25% of all L1-dcache hits ( +- 0.04% )
6,531 LLC-loads # 0.391 M/sec ( +- 0.05% )
442 LLC-load-misses # 6.76% of all LL-cache hits ( +- 0.77% )
<not supported> L1-icache-loads
57,964 L1-icache-load-misses ( +- 0.06% )
15,442,496 dTLB-loads # 925.246 M/sec ( +- 0.00% )
514 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 0.73% ) (40.57%)
130 iTLB-loads # 0.008 M/sec ( +- 2.75% ) (16.69%)
<not counted> iTLB-load-misses ( +- 8.71% ) (0.60%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
0.0169558 +- 0.0000127 seconds time elapsed ( +- 0.07% )
Multiple xdp-perf dispatcher with mitigations=auto
--------------------------------------------------
Note that this includes generating the dispatcher.
Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):
4.80 msec task-clock # 0.953 CPUs utilized ( +- 0.06% )
1 context-switches # 0.258 K/sec ( +- 1.57% )
0 cpu-migrations # 0.000 K/sec
97 page-faults # 0.020 M/sec ( +- 0.05% )
14,185,861 cycles # 2.955 GHz ( +- 0.17% ) (50.49%)
45,691,935 instructions # 3.22 insn per cycle ( +- 0.01% ) (99.19%)
8,346,008 branches # 1738.709 M/sec ( +- 0.00% )
13,046 branch-misses # 0.16% of all branches ( +- 0.10% )
15,443,735 L1-dcache-loads # 3217.365 M/sec ( +- 0.00% )
39,585 L1-dcache-load-misses # 0.26% of all L1-dcache hits ( +- 0.05% )
7,138 LLC-loads # 1.487 M/sec ( +- 0.06% )
671 LLC-load-misses # 9.40% of all LL-cache hits ( +- 0.73% )
<not supported> L1-icache-loads
56,213 L1-icache-load-misses ( +- 0.08% )
15,443,735 dTLB-loads # 3217.365 M/sec ( +- 0.00% )
<not counted> dTLB-load-misses (0.00%)
<not counted> iTLB-loads (0.00%)
<not counted> iTLB-load-misses (0.00%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
0.00503705 +- 0.00000546 seconds time elapsed ( +- 0.11% )
Revisions
=========
v4->v5: [1]
* Fixed s/xdp_ctx/ctx/ type-o (Toke)
* Marked dispatcher trampoline with noinline attribute (Alexei)
v3->v4: [2]
* Moved away from doing dispatcher lookup based on the trampoline
function, to a model where the dispatcher instance is explicitly
passed to the bpf_dispatcher_change_prog() (Alexei)
v2->v3: [3]
* Removed xdp_call, and instead make the dispatcher available to all
XDP users via bpf_prog_run_xdp() and dev_xdp_install(). (Toke)
* Always enable the dispatcher, if available (Alexei)
* Reuse BPF trampoline image allocator (Alexei)
* Make sure the dispatcher is exercised in selftests (Alexei)
* Only allow one dispatcher, and wire it to XDP
v1->v2: [4]
* Fixed i386 build warning (kbuild robot)
* Made bpf_dispatcher_lookup() static (kbuild robot)
* Make sure xdp_call.h is only enabled for builtins
* Add xdp_call() to ixgbe, mlx4, and mlx5
RFC->v1: [5]
* Improved error handling (Edward and Andrii)
* Explicit cleanup (Andrii)
* Use 32B with sext cmp (Alexei)
* Align jump targets to 16B (Alexei)
* 4 to 16 entries (Toke)
* Added stats to xdp_call_run()
[1] https://lore.kernel.org/bpf/20191211123017.13212-1-bjorn.topel@gmail.com/
[2] https://lore.kernel.org/bpf/20191209135522.16576-1-bjorn.topel@gmail.com/
[3] https://lore.kernel.org/bpf/20191123071226.6501-1-bjorn.topel@gmail.com/
[4] https://lore.kernel.org/bpf/20191119160757.27714-1-bjorn.topel@gmail.com/
[5] https://lore.kernel.org/bpf/20191113204737.31623-1-bjorn.topel@gmail.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>From Intel 64 and IA-32 Architectures Optimization Reference Manual,
3.4.1.4 Code Alignment, Assembly/Compiler Coding Rule 11: All branch
targets should be 16-byte aligned.
This commits aligns branch targets according to the Intel manual.
The nops used to align branch targets make the dispatcher larger, and
therefore the number of supported dispatch points/programs are
descreased from 64 to 48.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-7-bjorn.topel@gmail.com
The xdp_perf is a dummy XDP test, only used to measure the the cost of
jumping into a naive XDP program one million times.
To build and run the program:
$ cd tools/testing/selftests/bpf
$ make
$ ./test_progs -v -t xdp_perf
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-6-bjorn.topel@gmail.com
In order to properly exercise the BPF dispatcher, this commit adds BPF
dispatcher usage to BPF_TEST_RUN when executing XDP programs.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-5-bjorn.topel@gmail.com
This commit adds a BPF dispatcher for XDP. The dispatcher is updated
from the XDP control-path, dev_xdp_install(), and used when an XDP
program is run via bpf_prog_run_xdp().
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-4-bjorn.topel@gmail.com
The BPF dispatcher is a multi-way branch code generator, mainly
targeted for XDP programs. When an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. The indirect
call has a substantial performance impact, when retpolines are
enabled. The dispatcher transform indirect calls to direct calls, and
therefore avoids the retpoline. The dispatcher is generated using the
BPF JIT, and relies on text poking provided by bpf_arch_text_poke().
The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 64
dispatch points. A user creates a dispatcher with its corresponding
trampoline with the DEFINE_BPF_DISPATCHER macro.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-3-bjorn.topel@gmail.com
Refactor the image allocation in the BPF trampoline code into a
separate function, so it can be shared with the BPF dispatcher in
upcoming commits.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-2-bjorn.topel@gmail.com
It's quite common on some systems to have more CPUs enlisted as "possible",
than there are (and could ever be) present/online CPUs. In such cases,
perf_buffer creationg will fail due to inability to create perf event on
missing CPU with error like this:
libbpf: failed to open perf buffer event on cpu #16: No such device
This patch fixes the logic of perf_buffer__new() to ignore CPUs that are
missing or currently offline. In rare cases where user explicitly listed
specific CPUs to connect to, behavior is unchanged: libbpf will try to open
perf event buffer on specified CPU(s) anyways.
Fixes: fb84b82246 ("libbpf: add perf buffer API")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212013609.1691168-1-andriin@fb.com
This logic is re-used for parsing a set of online CPUs. Having it as an
isolated piece of code working with input string makes it conveninent to test
this logic as well. While refactoring, also improve the robustness of original
implementation.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212013548.1690564-1-andriin@fb.com
Jakub Sitnicki says:
====================
This change has been suggested by Martin Lau [0] during a review of a
related patch set that extends reuseport tests [1].
Patches 1 & 2 address a warning due to unrecognized section name from
libbpf when running reuseport tests. We don't want to carry this warning
into test_progs.
Patches 3-8 massage the reuseport tests to ease the switch to test_progs
framework. The intention here is to show the work. Happy to squash these,
if needed.
Patches 9-10 do the actual move and conversion to test_progs.
Output from a test_progs run after changes pasted below.
Thanks,
Jakub
[0] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/T/#m607d822caeb1eb5db101172821a78cc3896ff1c3
[1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/T/#m55881bae9fb6e34837d07a0c0a7ffbc138f8d06f
====================
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The tests were originally written in abort-on-error style. With the switch
to test_progs we can no longer do that. So at the risk of not cleaning up
some resource on failure, we now return to the caller on error.
That said, failure inside one test should not affect others because we run
setup/cleanup before/after every test.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-11-jakub@cloudflare.com
Do a pure move the show the actual work needed to adapt the tests in
subsequent patch at the cost of breaking test_progs build for the moment.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-10-jakub@cloudflare.com
Again, prepare for switching reuseport tests to test_progs framework.
test_progs framework will print the subtest name for us if we set it.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-9-jakub@cloudflare.com
Prepare for switching reuseport tests to test_progs framework, where we
don't have the luxury to terminate the process on failure.
Modify setup helpers to signal failure via the return value with the help
of a macro similar to the one currently in use by the tests.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-8-jakub@cloudflare.com
Prepare for switching reuseport tests to test_progs framework. Loop over
the tests and perform setup/cleanup for each test separately, remembering
that with test_progs we can select tests to run.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-7-jakub@cloudflare.com
Prepare for iterating over individual tests without introducing another
nested loop in the main test function.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-6-jakub@cloudflare.com
Having string arrays to map socket family & type to a name prevents us from
unrolling the test runner loop in the subsequent patch. Introduce helpers
that do the same thing.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-5-jakub@cloudflare.com
Now that libbpf can recognize SK_REUSEPORT programs, we no longer have to
pass a prog_type hint before loading the object file.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-3-jakub@cloudflare.com
Allow loading BPF object files that contain SK_REUSEPORT programs without
having to manually set the program type before load if the the section name
is set to "sk_reuseport".
Makes user-space code needed to load SK_REUSEPORT BPF program more concise.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191212102259.418536-2-jakub@cloudflare.com
On ppc64le __u64 and __s64 are defined as long int and unsigned long int,
respectively. This causes compiler to emit warning when %lld/%llu are used to
printf 64-bit numbers. Fix this by casting to size_t/ssize_t with %zu and %zd
format specifiers, respectively.
v1->v2:
- use size_t/ssize_t instead of custom typedefs (Martin).
Fixes: 1f8e2bcb2c ("libbpf: Refactor relocation handling")
Fixes: abd29c9314 ("libbpf: allow specifying map definitions using BTF")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191212171918.638010-1-andriin@fb.com
After Spectre 2 fix via 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON
config") most major distros use BPF_JIT_ALWAYS_ON configuration these days
which compiles out the BPF interpreter entirely and always enables the
JIT. Also given recent fix in e1608f3fa8 ("bpf: Avoid setting bpf insns
pages read-only when prog is jited"), we additionally avoid fragmenting
the direct map for the BPF insns pages sitting in the general data heap
since they are not used during execution. Latter is only needed when run
through the interpreter.
Since both x86 and arm64 JITs have seen a lot of exposure over the years,
are generally most up to date and maintained, there is more downside in
!BPF_JIT_ALWAYS_ON configurations to have the interpreter enabled by default
rather than the JIT. Add a ARCH_WANT_DEFAULT_BPF_JIT config which archs can
use to set the bpf_jit_{enable,kallsyms} to 1. Back in the days the
bpf_jit_kallsyms knob was set to 0 by default since major distros still
had /proc/kallsyms addresses exposed to unprivileged user space which is
not the case anymore. Hence both knobs are set via BPF_JIT_DEFAULT_ON which
is set to 'y' in case of BPF_JIT_ALWAYS_ON or ARCH_WANT_DEFAULT_BPF_JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/f78ad24795c2966efcc2ee19025fa3459f622185.1575903816.git.daniel@iogearbox.net
Allow for audit messages to be emitted upon BPF program load and
unload for having a timeline of events. The load itself is in
syscall context, so additional info about the process initiating
the BPF prog creation can be logged and later directly correlated
to the unload event.
The only info really needed from BPF side is the globally unique
prog ID where then audit user space tooling can query / dump all
info needed about the specific BPF program right upon load event
and enrich the record, thus these changes needed here can be kept
small and non-intrusive to the core.
Raw example output:
# auditctl -D
# auditctl -a always,exit -F arch=x86_64 -S bpf
# ausearch --start recent -m 1334
...
----
time->Wed Nov 27 16:04:13 2019
type=PROCTITLE msg=audit(1574867053.120:84664): proctitle="./bpf"
type=SYSCALL msg=audit(1574867053.120:84664): arch=c000003e syscall=321 \
success=yes exit=3 a0=5 a1=7ffea484fbe0 a2=70 a3=0 items=0 ppid=7477 \
pid=12698 auid=1001 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001 \
egid=1001 sgid=1001 fsgid=1001 tty=pts2 ses=4 comm="bpf" \
exe="/home/jolsa/auditd/audit-testsuite/tests/bpf/bpf" \
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
type=UNKNOWN[1334] msg=audit(1574867053.120:84664): prog-id=76 op=LOAD
----
time->Wed Nov 27 16:04:13 2019
type=UNKNOWN[1334] msg=audit(1574867053.120:84665): prog-id=76 op=UNLOAD
...
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20191206214934.11319-1-jolsa@kernel.org
Switch existing pattern of "offsetof(..., member) + FIELD_SIZEOF(...,
member)' to "offsetofend(..., member)" which does exactly what
we need without all the copy-paste.
Suggested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191210191933.105321-1-sdf@google.com
New development cycles starts, bump to v0.0.7 proactively.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191209224022.3544519-1-andriin@fb.com
T6 has a separate region known as high priority filter region
that allows classifying packets going through ULD path. So,
query firmware for HPFILTER resources and enable the high
priority offload filter support when it is available.
Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/net/ethernet/freescale/enetc/enetc_qos.c: In function enetc_setup_tc_cbs:
drivers/net/ethernet/freescale/enetc/enetc_qos.c:195:6: warning: variable tc_max_sized_frame set but not used [-Wunused-but-set-variable]
Fixes: c431047c4e ("enetc: add support Credit Based Shaper(CBS) for hardware offload")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device stats are currently hard coded in the PCI BAR0 layout.
Add a ability to read them from the TLV area instead.
Names for the stats are maintained by the driver, and their
meaning documented. This allows us to more easily add and
remove device stats.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a TCP socket is created, sk->sk_state is initialized twice as
TCP_CLOSE in sock_init_data() and tcp_init_sock(). The tcp_init_sock() is
always called after the sock_init_data(), so it is not necessary to update
sk->sk_state in the tcp_init_sock().
Before v2.1.8, the code of the two functions was in the inet_create(). In
the patch of v2.1.8, the tcp_v4/v6_init_sock() were added and the code of
initialization of sk->state was duplicated.
Signed-off-by: Kuniyuki Iwashima <kuni1840@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a software TX timestamp and add it to the ethtool query
interface.
skb_tx_timestamp() is also needed if one would like to use PHY
timestamping.
Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy says:
====================
tipc: introduce variable window congestion control
We improve thoughput greatly by introducing a variety of the Reno
congestion control algorithm at the link level.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We introduce a simple variable window congestion control for links.
The algorithm is inspired by the Reno algorithm, covering both 'slow
start', 'congestion avoidance', and 'fast recovery' modes.
- We introduce hard lower and upper window limits per link, still
different and configurable per bearer type.
- We introduce a 'slow start theshold' variable, initially set to
the maximum window size.
- We let a link start at the minimum congestion window, i.e. in slow
start mode, and then let is grow rapidly (+1 per rceived ACK) until
it reaches the slow start threshold and enters congestion avoidance
mode.
- In congestion avoidance mode we increment the congestion window for
each window-size number of acked packets, up to a possible maximum
equal to the configured maximum window.
- For each non-duplicate NACK received, we drop back to fast recovery
mode, by setting the both the slow start threshold to and the
congestion window to (current_congestion_window / 2).
- If the timeout handler finds that the transmit queue has not moved
since the previous timeout, it drops the link back to slow start
and forces a probe containing the last sent sequence number to the
sent to the peer, so that this can discover the stale situation.
This change does in reality have effect only on unicast ethernet
transport, as we have seen that there is no room whatsoever for
increasing the window max size for the UDP bearer.
For now, we also choose to keep the limits for the broadcast link
unchanged and equal.
This algorithm seems to give a 50-100% throughput improvement for
messages larger than MTU.
Suggested-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we increase the link tranmsit window we often observe the following
scenario:
1) A STATE message bypasses a sequence of traffic packets and arrives
far ahead of those to the receiver. STATE messages contain a
'peers_nxt_snt' field to indicate which was the last packet sent
from the peer. This mechanism is intended as a last resort for the
receiver to detect missing packets, e.g., during very low traffic
when there is no packet flow to help early loss detection.
3) The receiving link compares the 'peer_nxt_snt' field to its own
'rcv_nxt', finds that there is a gap, and immediately sends a
NACK message back to the peer.
4) When this NACKs arrives at the sender, all the requested
retransmissions are performed, since it is a first-time request.
Just like in the scenario described in the previous commit this leads
to many redundant retransmissions, with decreased throughput as a
consequence.
We fix this by adding two more conditions before we send a NACK in
this sitution. First, the deferred queue must be empty, so we cannot
assume that the potential packet loss has already been detected by
other means. Second, we check the 'peers_snd_nxt' field only in probe/
probe_reply messages, thus turning this into a true mechanism of last
resort as it was really meant to be.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we increase the link send window we sometimes observe the
following scenario:
1) A packet #N arrives out of order far ahead of a sequence of older
packets which are still under way. The packet is added to the
deferred queue.
2) The missing packets arrive in sequence, and for each 16th of them
an ACK is sent back to the receiver, as it should be.
3) When building those ACK messages, it is checked if there is a gap
between the link's 'rcv_nxt' and the first packet in the deferred
queue. This is always the case until packet number #N-1 arrives, and
a 'gap' indicator is added, effectively turning them into NACK
messages.
4) When those NACKs arrive at the sender, all the requested
retransmissions are done, since it is a first-time request.
This sometimes leads to a huge amount of redundant retransmissions,
causing a drop in max throughput. This problem gets worse when we
in a later commit introduce variable window congestion control,
since it drops the link back to 'fast recovery' much more often
than necessary.
We now fix this by not sending any 'gap' indicator in regular ACK
messages. We already have a mechanism for sending explicit NACKs
in place, and this is sufficient to keep up the packet flow.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>