OpenCloudOS-Kernel

History

Alexei Starovoitov bd4cf0ed33 net: filter: rework/optimize internal BPF interpreter's instruction set This patch replaces/reworks the kernel-internal BPF interpreter with an optimized BPF instruction set format that is modelled closer to mimic native instruction sets and is designed to be JITed with one to one mapping. Thus, the new interpreter is noticeably faster than the current implementation of sk_run_filter(); mainly for two reasons: 1. Fall-through jumps: BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. The new BPF jump instructions have only one branch and fall-through otherwise, which fits the CPU branch predictor logic better. `perf stat` shows drastic difference for branch-misses between the old and new code. 2. Jump-threaded implementation of interpreter vs switch statement: Instead of single table-jump at the top of 'switch' statement, gcc will now generate multiple table-jump instructions, which helps CPU branch predictor logic. Note that the verification of filters is still being done through sk_chk_filter() in classical BPF format, so filters from user- or kernel space are verified in the same way as we do now, and same restrictions/constraints hold as well. We reuse current BPF JIT compilers in a way that this upgrade would even be fine as is, but nevertheless allows for a successive upgrade of BPF JIT compilers to the new format. The internal instruction set migration is being done after the probing for JIT compilation, so in case JIT compilers are able to create a native opcode image, we're going to use that, and in all other cases we're doing a follow-up migration of the BPF program's instruction set, so that it can be transparently run in the new interpreter. In short, the internal format extends BPF in the following way (more details can be taken from the appended documentation): - Number of registers increase from 2 to 10 - Register width increases from 32-bit to 64-bit - Conditional jt/jf targets replaced with jt/fall-through - Adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - Introduction of bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions - Adds arithmetic right shift and endianness conversion insns - Adds atomic_add insn - Old tax/txa insns are replaced with 'mov dst,src' insn Performance of two BPF filters generated by libpcap resp. bpf_asm was measured on x86_64, i386 and arm32 (other libpcap programs have similar performance differences): fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call, smaller is better: --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 new BPF 31 71 47 97 old BPF jit 12 34 17 44 new BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 107 136 227 252 new BPF 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 202 300 475 540 new BPF 180 270 330 470 old BPF jit 26 182 37 202 new BPF jit TBD Thus, without changing any userland BPF filters, applications on top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf classifier, netfilter's xt_bpf, team driver's load-balancing mode, and many more will have better interpreter filtering performance. While we are replacing the internal BPF interpreter, we also need to convert seccomp BPF in the same step to make use of the new internal structure since it makes use of lower-level API details without being further decoupled through higher-level calls like sk_unattached_filter_{create,destroy}(), for example. Just as for normal socket filtering, also seccomp BPF experiences a time-to-verdict speedup: 05-sim-long_jumps.c of libseccomp was used as micro-benchmark: seccomp_rule_add_exact(ctx,... seccomp_rule_add_exact(ctx,... rc = seccomp_load(ctx); for (i = 0; i < 10000000; i++) syscall(199, 100); 'short filter' has 2 rules 'large filter' has 200 rules 'short filter' performance is slightly better on x86_64/i386/arm32 'large filter' is much faster on x86_64 and i386 and shows no difference on arm32 --x86_64-- short filter old BPF: 2.7 sec 39.12% bench libc-2.15.so [.] syscall 8.10% bench [kernel.kallsyms] [k] sk_run_filter 6.31% bench [kernel.kallsyms] [k] system_call 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller 3.70% bench [kernel.kallsyms] [k] __secure_computing 3.67% bench [kernel.kallsyms] [k] lock_is_held 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load new BPF: 2.58 sec 42.05% bench libc-2.15.so [.] syscall 6.91% bench [kernel.kallsyms] [k] system_call 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 6.07% bench [kernel.kallsyms] [k] __secure_computing 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp --arm32-- short filter old BPF: 4.0 sec 39.92% bench [kernel.kallsyms] [k] vector_swi 16.60% bench [kernel.kallsyms] [k] sk_run_filter 14.66% bench libc-2.17.so [.] syscall 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load 5.10% bench [kernel.kallsyms] [k] __secure_computing new BPF: 3.7 sec 35.93% bench [kernel.kallsyms] [k] vector_swi 21.89% bench libc-2.17.so [.] syscall 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 6.25% bench [kernel.kallsyms] [k] __secure_computing 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit --x86_64-- large filter old BPF: 8.6 seconds 73.38% bench [kernel.kallsyms] [k] sk_run_filter 10.70% bench libc-2.15.so [.] syscall 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.97% bench [kernel.kallsyms] [k] system_call new BPF: 5.7 seconds 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 16.75% bench libc-2.15.so [.] syscall 3.31% bench [kernel.kallsyms] [k] system_call 2.88% bench [kernel.kallsyms] [k] __secure_computing --i386-- large filter old BPF: 5.4 sec new BPF: 3.8 sec --arm32-- large filter old BPF: 13.5 sec 73.88% bench [kernel.kallsyms] [k] sk_run_filter 10.29% bench [kernel.kallsyms] [k] vector_swi 6.46% bench libc-2.17.so [.] syscall 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.19% bench [kernel.kallsyms] [k] __secure_computing 0.87% bench [kernel.kallsyms] [k] sys_getuid new BPF: 13.5 sec 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 10.98% bench [kernel.kallsyms] [k] vector_swi 5.87% bench libc-2.17.so [.] syscall 1.77% bench [kernel.kallsyms] [k] __secure_computing 0.93% bench [kernel.kallsyms] [k] sys_getuid BPF filters generated by seccomp are very branchy, so the new internal BPF performance is better than the old one. Performance gains will be even higher when BPF JIT is committed for the new structure, which is planned in future work (as successive JIT migrations). BPF has also been stress-tested with trinity's BPF fuzzer. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Moore <pmoore@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: linux-kernel@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>		2014-03-31 00:45:09 -04:00
..
cpu	sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding	2014-01-13 17:38:55 +01:00
debug	kgdb/kdb: Fix no KDB config problem	2014-01-25 08:55:09 +01:00
events	perf: Fix hotplug splat	2014-02-27 12:38:03 +01:00
gcov	gcov: reuse kbasename helper	2013-11-13 12:09:34 +09:00
irq	genirq: Include missing header file in irqdomain.c	2014-02-27 13:29:35 +01:00
locking	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-01-20 10:42:08 -08:00
power	arm, pm, vmpressure: add missing slab.h includes	2014-02-03 13:24:01 -05:00
printk	printk: fix syslog() overflowing user buffer	2014-02-17 12:24:45 -08:00
rcu	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-01-28 08:38:04 -08:00
sched	sched/clock: Prevent tracing recursion in sched_clock_cpu()	2014-03-11 11:33:48 +01:00
time	sched_clock: Prevent callers from seeing half-updated data	2014-02-19 17:07:22 +01:00
trace	tracing: Fix traceon trigger condition to actually turn tracing on	2014-03-25 23:39:41 -04:00
.gitignore	Ignore generated file kernel/x509_certificate_list	2013-12-10 18:21:34 +00:00
Kconfig.freezer	…
Kconfig.hz	kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS	2013-11-15 09:32:22 +09:00
Kconfig.locks	locking: Fix copy/paste errors of "ARCH_INLINE_*_UNLOCK_BH"	2013-05-28 08:50:00 +02:00
Kconfig.preempt	…
Makefile	KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y	2013-12-13 15:59:11 +00:00
acct.c	fs: Fix hang with BSD accounting on frozen filesystem	2013-05-04 14:57:58 -04:00
async.c	async: rename and redefine async_func_ptr	2013-03-12 13:59:14 -07:00
audit.c	audit: Update kdoc for audit_send_reply and audit_list_rules_send	2014-03-08 15:31:54 -08:00
audit.h	audit: Use struct net not pid_t to remember the network namespce to reply in	2014-02-28 04:04:33 -08:00
audit_tree.c	inotify: Fix reporting of cookies for inotify events	2014-02-18 11:17:17 +01:00
audit_watch.c	inotify: Fix reporting of cookies for inotify events	2014-02-18 11:17:17 +01:00
auditfilter.c	audit: Update kdoc for audit_send_reply and audit_list_rules_send	2014-03-08 15:31:54 -08:00
auditsc.c	execve: use 'struct filename *' for executable name passing	2014-02-05 12:54:53 -08:00
backtracetest.c	…
bounds.c	mm: do not allocate page->ptl dynamically, if spinlock_t fits to long	2013-12-20 12:25:45 -08:00
capability.c	audit: Simplify and correct audit_log_capset	2014-01-13 22:26:48 -05:00
cgroup.c	cgroup: fix a failure path in create_css()	2014-03-18 17:15:36 -04:00
cgroup_freezer.c	cgroup: replace cftype->read_seq_string() with cftype->seq_show()	2013-12-05 12:28:04 -05:00
compat.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal	2013-05-01 07:21:43 -07:00
configs.c	proc: Supply PDE attribute setting accessor functions	2013-05-01 17:29:18 -04:00
context_tracking.c	context_tracking: Wrap static key check into more intuitive function name	2013-12-02 20:43:14 +01:00
cpu.c	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-11-14 16:55:11 +09:00
cpu_pm.c	…
cpuset.c	cpuset: fix a race condition in __cpuset_node_allowed_softwall()	2014-02-27 09:39:54 -05:00
crash_dump.c	…
cred.c	…
delayacct.c	kernel/delayacct.c: remove redundant checking in __delayacct_add_tsk()	2013-11-13 12:09:12 +09:00
dma.c	…
elfcore.c	switch elf_core_write_extra_phdrs() to dump_emit()	2013-11-09 00:16:23 -05:00
exec_domain.c	…
exit.c	introduce for_each_thread() to replace the buggy while_each_thread()	2014-01-21 16:19:46 -08:00
extable.c	kernel/extable: fix address-checks for core_kernel and init areas	2013-11-28 09:49:41 -08:00
fork.c	exec: kill task_struct->did_exec	2014-01-23 16:37:02 -08:00
freezer.c	libata, freezer: avoid block device removal while system is frozen	2013-12-19 13:50:32 -05:00
futex.c	futex: revert back to the explicit waiter counting code	2014-03-20 22:11:17 -07:00
futex_compat.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal	2013-02-23 18:50:11 -08:00
groups.c	userns: Kill nsown_capable it makes the wrong thing easy	2013-08-30 23:44:11 -07:00
hrtimer.c	sched/deadline: Add SCHED_DEADLINE structures & implementation	2014-01-13 13:41:06 +01:00
hung_task.c	hung_task: Display every hung task warning	2014-01-25 12:13:33 +01:00
irq_work.c	Merge branch 'nohz/printk-v8' into irq/core	2013-02-05 00:48:46 +01:00
itimer.c	…
jump_label.c	static_key: WARN on usage before jump_label_init was called	2013-10-19 19:45:35 -04:00
kallsyms.c	kernel: kallsyms: memory override issue, need check destination buffer length	2013-04-15 15:17:26 +09:30
kcmp.c	…
kexec.c	kernel/kexec.c: use vscnprintf() instead of vsnprintf() in vmcoreinfo_append_str()	2014-01-27 21:02:40 -08:00
kmod.c	execve: use 'struct filename *' for executable name passing	2014-02-05 12:54:53 -08:00
kprobes.c	kprobes: use KSYM_NAME_LEN to size identifier buffers	2013-11-13 12:09:26 +09:00
ksysfs.c	kdump: fix exported size of vmcoreinfo note	2014-01-23 16:37:03 -08:00
kthread.c	kthread: make kthread_create() killable	2013-11-13 12:08:59 +09:00
latencytop.c	…
module-internal.h	KEYS: Separate the kernel signature checking keyring from module signing	2013-09-25 17:17:01 +01:00
module.c	module: Add missing newline in printk call.	2014-01-21 09:59:16 +10:30
module_signing.c	keys: change asymmetric keys to use common hash definitions	2013-10-25 17:15:18 -04:00
notifier.c	…
nsproxy.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2013-09-07 14:35:32 -07:00
padata.c	padata: Fix wrong usage of rcu_dereference()	2013-12-05 21:28:42 +08:00
panic.c	panic: Make panic_timeout configurable	2013-11-26 12:12:26 +01:00
params.c	params: improve standard definitions	2013-12-04 14:09:46 +10:30
pid.c	pidns: fix free_pid() to handle the first fork failure	2013-09-30 14:31:03 -07:00
pid_namespace.c	pid_namespace: make freeing struct pid_namespace rcu-delayed	2013-10-24 23:43:29 -04:00
posix-cpu-timers.c	posix-timers: Convert abuses of BUG_ON to WARN_ON	2013-12-09 16:56:29 +01:00
posix-timers.c	posix-timers: Remove unused variable	2013-04-18 12:51:19 +02:00
profile.c	mm: fix GFP_THISNODE callers and clarify	2014-03-10 17:26:19 -07:00
ptrace.c	exec/ptrace: fix get_dumpable() incorrect tests	2013-11-13 12:09:33 +09:00
range.c	range: Do not add new blank slot with add_range_with_merge	2013-06-18 11:32:10 -05:00
reboot.c	kexec: migrate to reboot cpu	2013-12-18 19:04:50 -08:00
relay.c	kernel: delete __cpuinit usage from all core kernel files	2013-07-14 19:36:59 -04:00
res_counter.c	memcg: reduce function dereference	2013-09-12 15:38:02 -07:00
resource.c	kernel/resource.c: remove the unneeded assignment in function __find_resource	2013-07-03 16:08:06 -07:00
seccomp.c	net: filter: rework/optimize internal BPF interpreter's instruction set	2014-03-31 00:45:09 -04:00
signal.c	kernel/signal.c: change do_signal_stop/do_sigaction to use while_each_thread()	2014-01-23 16:37:02 -08:00
smp.c	kernel/smp.c: remove cpumask_ipi	2014-01-30 16:56:54 -08:00
smpboot.c	kernel: delete __cpuinit usage from all core kernel files	2013-07-14 19:36:59 -04:00
smpboot.h	…
softirq.c	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-01-31 09:02:51 -08:00
stacktrace.c	…
stop_machine.c	stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus()	2014-03-11 11:33:47 +01:00
sys.c	kernel/sys.c: k_getrusage() can use while_each_thread()	2014-01-23 16:37:02 -08:00
sys_ni.c	unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE	2013-05-09 13:46:38 -04:00
sysctl.c	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-01-31 08:59:46 -08:00
sysctl_binary.c	kernel/sysctl_binary.c: use scnprintf() instead of snprintf()	2013-11-13 12:09:33 +09:00
system_certificates.S	KEYS: correct alignment of system_certificate_list content in assembly file	2013-12-10 18:25:28 +00:00
system_keyring.c	KEYS: correct alignment of system_certificate_list content in assembly file	2013-12-10 18:25:28 +00:00
task_work.c	task_work: documentation	2013-09-11 15:58:27 -07:00
taskstats.c	genetlink: only pass array to genl_register_family_with_ops()	2013-11-19 16:39:05 -05:00
test_kprobes.c	kernel/: rename random32() to prandom_u32()	2013-04-29 18:28:42 -07:00
time.c	sched: Rename sched.c as sched/core.c in comments and Documentation	2013-06-19 12:58:42 +02:00
timeconst.bc	kernel: Replace timeconst.pl with a bc script	2013-02-16 23:17:25 +01:00
timer.c	timer: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)	2013-11-19 14:59:50 +01:00
tracepoint.c	tracing: Do not add event files for modules that fail tracepoints	2014-03-03 21:11:05 -05:00
tsacct.c	…
uid16.c	userns: Kill nsown_capable it makes the wrong thing easy	2013-08-30 23:44:11 -07:00
up.c	kernel: provide a __smp_call_function_single stub for !CONFIG_SMP	2013-11-15 09:32:22 +09:00
user-return-notifier.c	hlist: drop the node parameter from iterators	2013-02-27 19:10:24 -08:00
user.c	KEYS: fix uninitialized persistent_keyring_register_sem	2013-12-13 15:59:11 +00:00
user_namespace.c	user_namespace.c: Remove duplicated word in comment	2014-02-20 11:58:35 -08:00
utsname.c	userns: Kill nsown_capable it makes the wrong thing easy	2013-08-30 23:44:11 -07:00
utsname_sysctl.c	kernel/utsname_sysctl.c: put get/get_uts() into CONFIG_PROC_SYSCTL code block	2013-02-27 19:10:22 -08:00
watchdog.c	watchdog: update watchdog_thresh properly	2013-09-24 17:00:25 -07:00
workqueue.c	workqueue: ensure @task is valid across kthread_stop()	2014-02-18 16:35:20 -05:00
workqueue_internal.h	sched: Rename sched.c as sched/core.c in comments and Documentation	2013-06-19 12:58:42 +02:00