2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Linux Socket Filter Data Structures
|
|
|
|
*/
|
|
|
|
#ifndef __LINUX_FILTER_H__
|
|
|
|
#define __LINUX_FILTER_H__
|
|
|
|
|
2014-09-10 21:01:02 +08:00
|
|
|
#include <stdarg.h>
|
|
|
|
|
2011-07-27 07:09:06 +08:00
|
|
|
#include <linux/atomic.h>
|
2012-04-13 05:47:53 +08:00
|
|
|
#include <linux/compat.h>
|
2014-07-03 22:56:54 +08:00
|
|
|
#include <linux/skbuff.h>
|
2014-09-10 21:01:02 +08:00
|
|
|
#include <linux/linkage.h>
|
|
|
|
#include <linux/printk.h>
|
2013-10-04 15:14:06 +08:00
|
|
|
#include <linux/workqueue.h>
|
2015-07-30 18:42:49 +08:00
|
|
|
#include <linux/sched.h>
|
2015-10-08 01:55:41 +08:00
|
|
|
#include <net/sch_generic.h>
|
2014-09-10 21:01:02 +08:00
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
#include <asm/cacheflush.h>
|
2014-09-10 21:01:02 +08:00
|
|
|
|
|
|
|
#include <uapi/linux/filter.h>
|
2014-09-05 13:17:18 +08:00
|
|
|
#include <uapi/linux/bpf.h>
|
2014-09-03 04:53:44 +08:00
|
|
|
|
|
|
|
struct sk_buff;
|
|
|
|
struct sock;
|
|
|
|
struct seccomp_data;
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux;
|
2011-05-22 15:08:11 +08:00
|
|
|
|
2014-05-02 00:34:19 +08:00
|
|
|
/* ArgX, context and stack frame pointer register positions. Note,
|
|
|
|
* Arg1, Arg2, Arg3, etc are used as argument mappings of function
|
|
|
|
* calls in BPF_CALL instruction.
|
|
|
|
*/
|
|
|
|
#define BPF_REG_ARG1 BPF_REG_1
|
|
|
|
#define BPF_REG_ARG2 BPF_REG_2
|
|
|
|
#define BPF_REG_ARG3 BPF_REG_3
|
|
|
|
#define BPF_REG_ARG4 BPF_REG_4
|
|
|
|
#define BPF_REG_ARG5 BPF_REG_5
|
|
|
|
#define BPF_REG_CTX BPF_REG_6
|
|
|
|
#define BPF_REG_FP BPF_REG_10
|
|
|
|
|
|
|
|
/* Additional register mappings for converted user programs. */
|
|
|
|
#define BPF_REG_A BPF_REG_0
|
|
|
|
#define BPF_REG_X BPF_REG_7
|
|
|
|
#define BPF_REG_TMP BPF_REG_8
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:25 +08:00
|
|
|
|
|
|
|
/* BPF program can access up to 512 bytes of stack space. */
|
|
|
|
#define MAX_BPF_STACK 512
|
|
|
|
|
2014-05-29 16:22:51 +08:00
|
|
|
/* Helper macros for filter block array initializers. */
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_ALU64_REG(OP, DST, SRC) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_ALU32_REG(OP, DST, SRC) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU | BPF_OP(OP) | BPF_X, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_ALU64_IMM(OP, DST, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_ALU32_IMM(OP, DST, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU | BPF_OP(OP) | BPF_K, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
|
|
|
/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_ENDIAN(TYPE, DST, LEN) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = LEN })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Short form of mov, dst_reg = src_reg */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_MOV64_REG(DST, SRC) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU64 | BPF_MOV | BPF_X, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_MOV32_REG(DST, SRC) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU | BPF_MOV | BPF_X, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Short form of mov, dst_reg = imm32 */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_MOV64_IMM(DST, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU64 | BPF_MOV | BPF_K, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_MOV32_IMM(DST, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU | BPF_MOV | BPF_K, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)
User space generated programs will use this instruction to load constants only.
To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 13:17:17 +08:00
|
|
|
/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
|
|
|
|
#define BPF_LD_IMM64(DST, IMM) \
|
|
|
|
BPF_LD_IMM64_RAW(DST, 0, IMM)
|
|
|
|
|
|
|
|
#define BPF_LD_IMM64_RAW(DST, SRC, IMM) \
|
|
|
|
((struct bpf_insn) { \
|
|
|
|
.code = BPF_LD | BPF_DW | BPF_IMM, \
|
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
|
|
|
.off = 0, \
|
|
|
|
.imm = (__u32) (IMM) }), \
|
|
|
|
((struct bpf_insn) { \
|
|
|
|
.code = 0, /* zero is reserved opcode */ \
|
|
|
|
.dst_reg = 0, \
|
|
|
|
.src_reg = 0, \
|
|
|
|
.off = 0, \
|
|
|
|
.imm = ((__u64) (IMM)) >> 32 })
|
|
|
|
|
2014-09-26 15:17:04 +08:00
|
|
|
/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
|
|
|
|
#define BPF_LD_MAP_FD(DST, MAP_FD) \
|
|
|
|
BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_LD_ABS(SIZE, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = 0, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.imm = IMM })
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_LD_IND(SIZE, SRC, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = 0, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.imm = IMM })
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = OFF, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
|
|
|
|
|
|
|
|
#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = OFF, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2015-05-12 13:22:44 +08:00
|
|
|
/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
|
|
|
|
|
|
|
|
#define BPF_STX_XADD(SIZE, DST, SRC, OFF) \
|
|
|
|
((struct bpf_insn) { \
|
|
|
|
.code = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD, \
|
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
|
|
|
.off = OFF, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
|
|
|
|
|
|
|
|
#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-06-07 05:46:06 +08:00
|
|
|
.code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
|
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = 0, \
|
|
|
|
.off = OFF, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
|
|
|
/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_JMP_REG(OP, DST, SRC, OFF) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_JMP | BPF_OP(OP) | BPF_X, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = OFF, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
|
2014-05-29 16:22:51 +08:00
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_JMP_IMM(OP, DST, IMM, OFF) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_JMP | BPF_OP(OP) | BPF_K, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = OFF, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
|
|
|
/* Function call */
|
|
|
|
|
|
|
|
#define BPF_EMIT_CALL(FUNC) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_JMP | BPF_CALL, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = 0, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = ((FUNC) - __bpf_call_base) })
|
|
|
|
|
|
|
|
/* Raw code statement block */
|
|
|
|
|
2014-06-07 05:46:06 +08:00
|
|
|
#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = CODE, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = DST, \
|
|
|
|
.src_reg = SRC, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = OFF, \
|
|
|
|
.imm = IMM })
|
|
|
|
|
|
|
|
/* Program exit */
|
|
|
|
|
|
|
|
#define BPF_EXIT_INSN() \
|
2014-07-25 07:38:21 +08:00
|
|
|
((struct bpf_insn) { \
|
2014-05-29 16:22:51 +08:00
|
|
|
.code = BPF_JMP | BPF_EXIT, \
|
2014-06-07 05:46:06 +08:00
|
|
|
.dst_reg = 0, \
|
|
|
|
.src_reg = 0, \
|
2014-05-29 16:22:51 +08:00
|
|
|
.off = 0, \
|
|
|
|
.imm = 0 })
|
|
|
|
|
2015-05-13 19:12:43 +08:00
|
|
|
/* Internal classic blocks for direct assignment */
|
|
|
|
|
|
|
|
#define __BPF_STMT(CODE, K) \
|
|
|
|
((struct sock_filter) BPF_STMT(CODE, K))
|
|
|
|
|
|
|
|
#define __BPF_JUMP(CODE, K, JT, JF) \
|
|
|
|
((struct sock_filter) BPF_JUMP(CODE, K, JT, JF))
|
|
|
|
|
2014-05-29 16:22:51 +08:00
|
|
|
#define bytes_to_bpf_size(bytes) \
|
|
|
|
({ \
|
|
|
|
int bpf_size = -EINVAL; \
|
|
|
|
\
|
|
|
|
if (bytes == sizeof(u8)) \
|
|
|
|
bpf_size = BPF_B; \
|
|
|
|
else if (bytes == sizeof(u16)) \
|
|
|
|
bpf_size = BPF_H; \
|
|
|
|
else if (bytes == sizeof(u32)) \
|
|
|
|
bpf_size = BPF_W; \
|
|
|
|
else if (bytes == sizeof(u64)) \
|
|
|
|
bpf_size = BPF_DW; \
|
|
|
|
\
|
|
|
|
bpf_size; \
|
|
|
|
})
|
2014-05-09 05:10:51 +08:00
|
|
|
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:25 +08:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
/* A struct sock_filter is architecture independent. */
|
2012-04-13 05:47:53 +08:00
|
|
|
struct compat_sock_fprog {
|
|
|
|
u16 len;
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:25 +08:00
|
|
|
compat_uptr_t filter; /* struct sock_filter * */
|
2012-04-13 05:47:53 +08:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:19 +08:00
|
|
|
struct sock_fprog_kern {
|
|
|
|
u16 len;
|
|
|
|
struct sock_filter *filter;
|
|
|
|
};
|
|
|
|
|
2014-09-08 14:04:47 +08:00
|
|
|
struct bpf_binary_header {
|
|
|
|
unsigned int pages;
|
|
|
|
u8 image[];
|
|
|
|
};
|
|
|
|
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
struct bpf_prog {
|
2014-09-08 14:04:49 +08:00
|
|
|
u16 pages; /* Number of allocated pages */
|
2015-09-30 07:41:50 +08:00
|
|
|
kmemcheck_bitfield_begin(meta);
|
|
|
|
u16 jited:1, /* Is our filter JIT'ed? */
|
2015-09-30 07:41:51 +08:00
|
|
|
gpl_compatible:1, /* Is filter GPL compatible? */
|
2015-10-08 01:55:41 +08:00
|
|
|
cb_access:1, /* Is control block accessed? */
|
2015-09-30 07:41:51 +08:00
|
|
|
dst_needed:1; /* Do we need dst entry? */
|
2015-09-30 07:41:50 +08:00
|
|
|
kmemcheck_bitfield_end(meta);
|
2014-09-08 14:04:49 +08:00
|
|
|
u32 len; /* Number of filter blocks */
|
2015-03-01 19:31:47 +08:00
|
|
|
enum bpf_prog_type type; /* Type of BPF program */
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux *aux; /* Auxiliary fields */
|
2015-03-01 19:31:47 +08:00
|
|
|
struct sock_fprog_kern *orig_prog; /* Original BPF program */
|
2011-04-20 17:27:32 +08:00
|
|
|
unsigned int (*bpf_func)(const struct sk_buff *skb,
|
2014-07-25 07:38:21 +08:00
|
|
|
const struct bpf_insn *filter);
|
2014-09-03 04:53:44 +08:00
|
|
|
/* Instructions for interpreter */
|
2013-10-04 15:14:06 +08:00
|
|
|
union {
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:25 +08:00
|
|
|
struct sock_filter insns[0];
|
2014-07-25 07:38:21 +08:00
|
|
|
struct bpf_insn insnsi[0];
|
2013-10-04 15:14:06 +08:00
|
|
|
};
|
2008-04-10 16:33:47 +08:00
|
|
|
};
|
|
|
|
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
struct sk_filter {
|
|
|
|
atomic_t refcnt;
|
|
|
|
struct rcu_head rcu;
|
|
|
|
struct bpf_prog *prog;
|
|
|
|
};
|
|
|
|
|
|
|
|
#define BPF_PROG_RUN(filter, ctx) (*filter->bpf_func)(ctx, filter->insnsi)
|
|
|
|
|
2016-01-07 05:32:16 +08:00
|
|
|
#define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
|
|
|
|
|
|
|
|
static inline u8 *bpf_skb_cb(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
/* eBPF programs may read/write skb->cb[] area to transfer meta
|
|
|
|
* data between tail calls. Since this also needs to work with
|
|
|
|
* tc, that scratch memory is mapped to qdisc_skb_cb's data area.
|
|
|
|
*
|
|
|
|
* In some socket filter cases, the cb unfortunately needs to be
|
|
|
|
* saved/restored so that protocol specific skb->cb[] data won't
|
|
|
|
* be lost. In any case, due to unpriviledged eBPF programs
|
|
|
|
* attached to sockets, we need to clear the bpf_skb_cb() area
|
|
|
|
* to not leak previous contents to user space.
|
|
|
|
*/
|
|
|
|
BUILD_BUG_ON(FIELD_SIZEOF(struct __sk_buff, cb) != BPF_SKB_CB_LEN);
|
|
|
|
BUILD_BUG_ON(FIELD_SIZEOF(struct __sk_buff, cb) !=
|
|
|
|
FIELD_SIZEOF(struct qdisc_skb_cb, data));
|
|
|
|
|
|
|
|
return qdisc_skb_cb(skb)->data;
|
|
|
|
}
|
|
|
|
|
2015-10-08 01:55:41 +08:00
|
|
|
static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog,
|
|
|
|
struct sk_buff *skb)
|
|
|
|
{
|
2016-01-07 05:32:16 +08:00
|
|
|
u8 *cb_data = bpf_skb_cb(skb);
|
|
|
|
u8 cb_saved[BPF_SKB_CB_LEN];
|
2015-10-08 01:55:41 +08:00
|
|
|
u32 res;
|
|
|
|
|
|
|
|
if (unlikely(prog->cb_access)) {
|
2016-01-07 05:32:16 +08:00
|
|
|
memcpy(cb_saved, cb_data, sizeof(cb_saved));
|
|
|
|
memset(cb_data, 0, sizeof(cb_saved));
|
2015-10-08 01:55:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
res = BPF_PROG_RUN(prog, skb);
|
|
|
|
|
|
|
|
if (unlikely(prog->cb_access))
|
2016-01-07 05:32:16 +08:00
|
|
|
memcpy(cb_data, cb_saved, sizeof(cb_saved));
|
2015-10-08 01:55:41 +08:00
|
|
|
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
|
|
|
|
struct sk_buff *skb)
|
|
|
|
{
|
2016-01-07 05:32:16 +08:00
|
|
|
u8 *cb_data = bpf_skb_cb(skb);
|
2015-10-08 01:55:41 +08:00
|
|
|
|
|
|
|
if (unlikely(prog->cb_access))
|
2016-01-07 05:32:16 +08:00
|
|
|
memset(cb_data, 0, BPF_SKB_CB_LEN);
|
|
|
|
|
2015-10-08 01:55:41 +08:00
|
|
|
return BPF_PROG_RUN(prog, skb);
|
|
|
|
}
|
|
|
|
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
static inline unsigned int bpf_prog_size(unsigned int proglen)
|
2008-04-10 16:33:47 +08:00
|
|
|
{
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
return max(sizeof(struct bpf_prog),
|
|
|
|
offsetof(struct bpf_prog, insns[proglen]));
|
2008-04-10 16:33:47 +08:00
|
|
|
}
|
|
|
|
|
2015-07-30 18:42:47 +08:00
|
|
|
static inline bool bpf_prog_was_classic(const struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
/* When classic BPF programs have been loaded and the arch
|
|
|
|
* does not have a classic BPF JIT (anymore), they have been
|
|
|
|
* converted via bpf_migrate_filter() to eBPF and thus always
|
|
|
|
* have an unspec program type.
|
|
|
|
*/
|
|
|
|
return prog->type == BPF_PROG_TYPE_UNSPEC;
|
|
|
|
}
|
|
|
|
|
2014-07-31 11:34:13 +08:00
|
|
|
#define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:19 +08:00
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
#ifdef CONFIG_DEBUG_SET_MODULE_RONX
|
|
|
|
static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
set_memory_ro((unsigned long)fp, fp->pages);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
set_memory_rw((unsigned long)fp, fp->pages);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_DEBUG_SET_MODULE_RONX */
|
|
|
|
|
2014-03-29 01:58:20 +08:00
|
|
|
int sk_filter(struct sock *sk, struct sk_buff *skb);
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:25 +08:00
|
|
|
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
int bpf_prog_select_runtime(struct bpf_prog *fp);
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
void bpf_prog_free(struct bpf_prog *fp);
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:25 +08:00
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags);
|
|
|
|
struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
|
|
|
|
gfp_t gfp_extra_flags);
|
|
|
|
void __bpf_prog_free(struct bpf_prog *fp);
|
|
|
|
|
|
|
|
static inline void bpf_prog_unlock_free(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
bpf_prog_unlock_ro(fp);
|
|
|
|
__bpf_prog_free(fp);
|
|
|
|
}
|
|
|
|
|
2015-05-06 22:12:30 +08:00
|
|
|
typedef int (*bpf_aux_classic_check_t)(struct sock_filter *filter,
|
|
|
|
unsigned int flen);
|
|
|
|
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog);
|
2015-05-06 22:12:30 +08:00
|
|
|
int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
|
2015-10-02 21:17:33 +08:00
|
|
|
bpf_aux_classic_check_t trans, bool save_orig);
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
void bpf_prog_destroy(struct bpf_prog *fp);
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 01:58:19 +08:00
|
|
|
|
2014-03-29 01:58:20 +08:00
|
|
|
int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
|
2016-03-31 08:13:18 +08:00
|
|
|
int __sk_attach_filter(struct sock_fprog *fprog, struct sock *sk,
|
|
|
|
bool locked);
|
2014-12-02 07:06:35 +08:00
|
|
|
int sk_attach_bpf(u32 ufd, struct sock *sk);
|
2016-01-05 06:41:47 +08:00
|
|
|
int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk);
|
|
|
|
int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk);
|
2014-03-29 01:58:20 +08:00
|
|
|
int sk_detach_filter(struct sock *sk);
|
2016-03-31 08:13:18 +08:00
|
|
|
int __sk_detach_filter(struct sock *sk, bool locked);
|
|
|
|
|
2014-03-29 01:58:20 +08:00
|
|
|
int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
|
|
|
|
unsigned int len);
|
|
|
|
|
2014-07-31 11:34:12 +08:00
|
|
|
bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
|
2014-03-29 01:58:20 +08:00
|
|
|
void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
|
2011-04-20 17:27:32 +08:00
|
|
|
|
net: filter: x86: internal BPF JIT
Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.
Performance:
1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
No performance difference is observed for filters that were JIT-able before
Example assembler code for BPF filter "tcpdump port 22"
original BPF -> old JIT: original BPF -> internal BPF -> new JIT:
0: push %rbp 0: push %rbp
1: mov %rsp,%rbp 1: mov %rsp,%rbp
4: sub $0x60,%rsp 4: sub $0x228,%rsp
8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue
12: mov %r13,-0x220(%rbp)
19: mov %r14,-0x218(%rbp)
20: mov %r15,-0x210(%rbp)
27: xor %eax,%eax // clear A
c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X
e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d
12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d
16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10
3b: mov %rdi,%rbx
1d: mov $0xc,%esi 3e: mov $0xc,%esi
22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75
27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax
2c: jne 0x0000000000000069 4f: jne 0x000000000000009a
2e: mov $0x14,%esi 51: mov $0x14,%esi
33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91
38: cmp $0x84,%eax 5b: cmp $0x84,%rax
3d: je 0x0000000000000049 62: je 0x0000000000000074
3f: cmp $0x6,%eax 64: cmp $0x6,%rax
42: je 0x0000000000000049 68: je 0x0000000000000074
44: cmp $0x11,%eax 6a: cmp $0x11,%rax
47: jne 0x00000000000000c6 6e: jne 0x0000000000000117
49: mov $0x36,%esi 74: mov $0x36,%esi
4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75
53: cmp $0x16,%eax 7e: cmp $0x16,%rax
56: je 0x00000000000000bf 82: je 0x0000000000000110
58: mov $0x38,%esi 88: mov $0x38,%esi
5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75
62: cmp $0x16,%eax 92: cmp $0x16,%rax
65: je 0x00000000000000bf 96: je 0x0000000000000110
67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117
69: cmp $0x800,%eax 9a: cmp $0x800,%rax
6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117
70: mov $0x17,%esi a3: mov $0x17,%esi
75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91
7a: cmp $0x84,%eax ad: cmp $0x84,%rax
7f: je 0x000000000000008b b4: je 0x00000000000000c2
81: cmp $0x6,%eax b6: cmp $0x6,%rax
84: je 0x000000000000008b ba: je 0x00000000000000c2
86: cmp $0x11,%eax bc: cmp $0x11,%rax
89: jne 0x00000000000000c6 c0: jne 0x0000000000000117
8b: mov $0x14,%esi c2: mov $0x14,%esi
90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75
95: test $0x1fff,%ax cc: test $0x1fff,%rax
99: jne 0x00000000000000c6 d3: jne 0x0000000000000117
d5: mov %rax,%r14
9b: mov $0xe,%esi d8: mov $0xe,%esi
a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH
e2: and $0xf,%eax
e5: shl $0x2,%eax
e8: mov %rax,%r13
eb: mov %r14,%rax
ee: mov %r13,%rsi
a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi
a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d
ad: cmp $0x16,%eax f9: cmp $0x16,%rax
b0: je 0x00000000000000bf fd: je 0x0000000000000110
ff: mov %r13,%rsi
b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi
b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d
ba: cmp $0x16,%eax 10a: cmp $0x16,%rax
bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117
bf: mov $0xffff,%eax 110: mov $0xffff,%eax
c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c
c6: xor %eax,%eax 117: mov $0x0,%eax
c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue
cc: leaveq 123: mov -0x220(%rbp),%r13
cd: retq 12a: mov -0x218(%rbp),%r14
131: mov -0x210(%rbp),%r15
138: leaveq
139: retq
On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.
The difference in generated assembler is due to the following:
Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.
New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.
Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.
2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
extensions. New JIT supports all BPF extensions.
Performance of such filters improves 2-4 times depending on a filter.
The longer the filter the higher performance gain.
Synthetic benchmarks with many ancillary loads see 20x speedup
which seems to be the maximum gain from JIT
Notes:
. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
and can be used to see generated assembler
. there are two jit_compile() functions and code flow for classic filters is:
sk_attach_filter() - load classic BPF
bpf_jit_compile() - try to JIT from classic BPF
sk_convert_filter() - convert classic to internal
bpf_int_jit_compile() - JIT from internal BPF
seccomp and tracing filters will just call bpf_int_jit_compile()
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-14 10:50:46 +08:00
|
|
|
u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
void bpf_int_jit_compile(struct bpf_prog *fp);
|
2015-07-21 11:34:18 +08:00
|
|
|
bool bpf_helper_changes_skb_data(void *func);
|
net: filter: x86: internal BPF JIT
Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.
Performance:
1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
No performance difference is observed for filters that were JIT-able before
Example assembler code for BPF filter "tcpdump port 22"
original BPF -> old JIT: original BPF -> internal BPF -> new JIT:
0: push %rbp 0: push %rbp
1: mov %rsp,%rbp 1: mov %rsp,%rbp
4: sub $0x60,%rsp 4: sub $0x228,%rsp
8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue
12: mov %r13,-0x220(%rbp)
19: mov %r14,-0x218(%rbp)
20: mov %r15,-0x210(%rbp)
27: xor %eax,%eax // clear A
c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X
e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d
12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d
16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10
3b: mov %rdi,%rbx
1d: mov $0xc,%esi 3e: mov $0xc,%esi
22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75
27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax
2c: jne 0x0000000000000069 4f: jne 0x000000000000009a
2e: mov $0x14,%esi 51: mov $0x14,%esi
33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91
38: cmp $0x84,%eax 5b: cmp $0x84,%rax
3d: je 0x0000000000000049 62: je 0x0000000000000074
3f: cmp $0x6,%eax 64: cmp $0x6,%rax
42: je 0x0000000000000049 68: je 0x0000000000000074
44: cmp $0x11,%eax 6a: cmp $0x11,%rax
47: jne 0x00000000000000c6 6e: jne 0x0000000000000117
49: mov $0x36,%esi 74: mov $0x36,%esi
4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75
53: cmp $0x16,%eax 7e: cmp $0x16,%rax
56: je 0x00000000000000bf 82: je 0x0000000000000110
58: mov $0x38,%esi 88: mov $0x38,%esi
5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75
62: cmp $0x16,%eax 92: cmp $0x16,%rax
65: je 0x00000000000000bf 96: je 0x0000000000000110
67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117
69: cmp $0x800,%eax 9a: cmp $0x800,%rax
6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117
70: mov $0x17,%esi a3: mov $0x17,%esi
75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91
7a: cmp $0x84,%eax ad: cmp $0x84,%rax
7f: je 0x000000000000008b b4: je 0x00000000000000c2
81: cmp $0x6,%eax b6: cmp $0x6,%rax
84: je 0x000000000000008b ba: je 0x00000000000000c2
86: cmp $0x11,%eax bc: cmp $0x11,%rax
89: jne 0x00000000000000c6 c0: jne 0x0000000000000117
8b: mov $0x14,%esi c2: mov $0x14,%esi
90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75
95: test $0x1fff,%ax cc: test $0x1fff,%rax
99: jne 0x00000000000000c6 d3: jne 0x0000000000000117
d5: mov %rax,%r14
9b: mov $0xe,%esi d8: mov $0xe,%esi
a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH
e2: and $0xf,%eax
e5: shl $0x2,%eax
e8: mov %rax,%r13
eb: mov %r14,%rax
ee: mov %r13,%rsi
a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi
a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d
ad: cmp $0x16,%eax f9: cmp $0x16,%rax
b0: je 0x00000000000000bf fd: je 0x0000000000000110
ff: mov %r13,%rsi
b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi
b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d
ba: cmp $0x16,%eax 10a: cmp $0x16,%rax
bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117
bf: mov $0xffff,%eax 110: mov $0xffff,%eax
c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c
c6: xor %eax,%eax 117: mov $0x0,%eax
c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue
cc: leaveq 123: mov -0x220(%rbp),%r13
cd: retq 12a: mov -0x218(%rbp),%r14
131: mov -0x210(%rbp),%r15
138: leaveq
139: retq
On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.
The difference in generated assembler is due to the following:
Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.
New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.
Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.
2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
extensions. New JIT supports all BPF extensions.
Performance of such filters improves 2-4 times depending on a filter.
The longer the filter the higher performance gain.
Synthetic benchmarks with many ancillary loads see 20x speedup
which seems to be the maximum gain from JIT
Notes:
. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
and can be used to see generated assembler
. there are two jit_compile() functions and code flow for classic filters is:
sk_attach_filter() - load classic BPF
bpf_jit_compile() - try to JIT from classic BPF
sk_convert_filter() - convert classic to internal
bpf_int_jit_compile() - JIT from internal BPF
seccomp and tracing filters will just call bpf_int_jit_compile()
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-14 10:50:46 +08:00
|
|
|
|
2014-09-10 21:01:02 +08:00
|
|
|
#ifdef CONFIG_BPF_JIT
|
|
|
|
typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
|
|
|
|
|
|
|
|
struct bpf_binary_header *
|
|
|
|
bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
|
|
|
|
unsigned int alignment,
|
|
|
|
bpf_jit_fill_hole_t bpf_fill_ill_insns);
|
|
|
|
void bpf_jit_binary_free(struct bpf_binary_header *hdr);
|
|
|
|
|
|
|
|
void bpf_jit_compile(struct bpf_prog *fp);
|
|
|
|
void bpf_jit_free(struct bpf_prog *fp);
|
|
|
|
|
|
|
|
static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
|
|
|
|
u32 pass, void *image)
|
|
|
|
{
|
2015-07-30 18:42:49 +08:00
|
|
|
pr_err("flen=%u proglen=%u pass=%u image=%pK from=%s pid=%d\n", flen,
|
|
|
|
proglen, pass, image, current->comm, task_pid_nr(current));
|
|
|
|
|
2014-09-10 21:01:02 +08:00
|
|
|
if (image)
|
|
|
|
print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
|
|
|
|
16, 1, image, proglen, false);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void bpf_jit_compile(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_jit_free(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
bpf_prog_unlock_free(fp);
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_BPF_JIT */
|
|
|
|
|
net: filter: get rid of BPF_S_* enum
This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.
Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.
Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().
Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-29 16:22:50 +08:00
|
|
|
#define BPF_ANC BIT(15)
|
|
|
|
|
2016-01-05 23:23:07 +08:00
|
|
|
static inline bool bpf_needs_clear_a(const struct sock_filter *first)
|
|
|
|
{
|
|
|
|
switch (first->code) {
|
|
|
|
case BPF_RET | BPF_K:
|
|
|
|
case BPF_LD | BPF_W | BPF_LEN:
|
|
|
|
return false;
|
|
|
|
|
|
|
|
case BPF_LD | BPF_W | BPF_ABS:
|
|
|
|
case BPF_LD | BPF_H | BPF_ABS:
|
|
|
|
case BPF_LD | BPF_B | BPF_ABS:
|
|
|
|
if (first->k == SKF_AD_OFF + SKF_AD_ALU_XOR_X)
|
|
|
|
return true;
|
|
|
|
return false;
|
|
|
|
|
|
|
|
default:
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
net: filter: get rid of BPF_S_* enum
This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.
Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.
Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().
Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-29 16:22:50 +08:00
|
|
|
static inline u16 bpf_anc_helper(const struct sock_filter *ftest)
|
|
|
|
{
|
|
|
|
BUG_ON(ftest->code & BPF_ANC);
|
|
|
|
|
|
|
|
switch (ftest->code) {
|
|
|
|
case BPF_LD | BPF_W | BPF_ABS:
|
|
|
|
case BPF_LD | BPF_H | BPF_ABS:
|
|
|
|
case BPF_LD | BPF_B | BPF_ABS:
|
|
|
|
#define BPF_ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
|
|
|
|
return BPF_ANC | SKF_AD_##CODE
|
|
|
|
switch (ftest->k) {
|
|
|
|
BPF_ANCILLARY(PROTOCOL);
|
|
|
|
BPF_ANCILLARY(PKTTYPE);
|
|
|
|
BPF_ANCILLARY(IFINDEX);
|
|
|
|
BPF_ANCILLARY(NLATTR);
|
|
|
|
BPF_ANCILLARY(NLATTR_NEST);
|
|
|
|
BPF_ANCILLARY(MARK);
|
|
|
|
BPF_ANCILLARY(QUEUE);
|
|
|
|
BPF_ANCILLARY(HATYPE);
|
|
|
|
BPF_ANCILLARY(RXHASH);
|
|
|
|
BPF_ANCILLARY(CPU);
|
|
|
|
BPF_ANCILLARY(ALU_XOR_X);
|
|
|
|
BPF_ANCILLARY(VLAN_TAG);
|
|
|
|
BPF_ANCILLARY(VLAN_TAG_PRESENT);
|
|
|
|
BPF_ANCILLARY(PAY_OFFSET);
|
|
|
|
BPF_ANCILLARY(RANDOM);
|
2015-03-24 21:48:41 +08:00
|
|
|
BPF_ANCILLARY(VLAN_TPID);
|
net: filter: get rid of BPF_S_* enum
This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.
Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.
Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().
Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-29 16:22:50 +08:00
|
|
|
}
|
|
|
|
/* Fallthrough. */
|
|
|
|
default:
|
|
|
|
return ftest->code;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-07-03 22:56:54 +08:00
|
|
|
void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb,
|
|
|
|
int k, unsigned int size);
|
|
|
|
|
|
|
|
static inline void *bpf_load_pointer(const struct sk_buff *skb, int k,
|
|
|
|
unsigned int size, void *buffer)
|
|
|
|
{
|
|
|
|
if (k >= 0)
|
|
|
|
return skb_header_pointer(skb, k, size, buffer);
|
|
|
|
|
|
|
|
return bpf_internal_load_pointer_neg_helper(skb, k, size);
|
|
|
|
}
|
|
|
|
|
2014-01-18 00:09:45 +08:00
|
|
|
static inline int bpf_tell_extensions(void)
|
|
|
|
{
|
net: filter: let bpf_tell_extensions return SKF_AD_MAX
Michal Sekletar added in commit ea02f9411d9f ("net: introduce
SO_BPF_EXTENSIONS") a facility where user space can enquire
the BPF ancillary instruction set, which is imho a step into
the right direction for letting user space high-level to BPF
optimizers make an informed decision for possibly using these
extensions.
The original rationale was to return through a getsockopt(2)
a bitfield of which instructions are supported and which
are not, as of right now, we just return 0 to indicate a
base support for SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET.
Limitations of this approach are that this API which we need
to maintain for a long time can only support a maximum of 32
extensions, and needs to be additionally maintained/updated
when each new extension that comes in.
I thought about this a bit more and what we can do here to
overcome this is to just return SKF_AD_MAX. Since we never
remove any extension since we cannot break user space and
always linearly increase SKF_AD_MAX on each newly added
extension, user space can make a decision on what extensions
are supported in the whole set of extensions and which aren't,
by just checking which of them from the whole set have an
offset < SKF_AD_MAX of the underlying kernel.
Since SKF_AD_MAX must be updated each time we add new ones,
we don't need to introduce an additional enum and got
maintenance for free. At some point in time when
SO_BPF_EXTENSIONS becomes ubiquitous for most kernels, then
an application can simply make use of this and easily be run
on newer or older underlying kernels without needing to be
recompiled, of course. Since that is for 3.14, it's not too
late to do this change.
Cc: Michal Sekletar <msekleta@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Michal Sekletar <msekleta@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 07:19:37 +08:00
|
|
|
return SKF_AD_MAX;
|
2014-01-18 00:09:45 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* __LINUX_FILTER_H__ */
|