2014-07-23 14:01:58 +08:00
|
|
|
/*
|
|
|
|
* Linux Socket Filter - Kernel level socket filtering
|
|
|
|
*
|
|
|
|
* Based on the design of the Berkeley Packet Filter. The new
|
|
|
|
* internal format has been designed by PLUMgrid:
|
|
|
|
*
|
|
|
|
* Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
|
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
*
|
|
|
|
* Jay Schulist <jschlst@samba.org>
|
|
|
|
* Alexei Starovoitov <ast@plumgrid.com>
|
|
|
|
* Daniel Borkmann <dborkman@redhat.com>
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*
|
|
|
|
* Andi Kleen - Fix a few bad bugs and races.
|
2014-07-31 11:34:14 +08:00
|
|
|
* Kris Katterjohn - Added many additional checks in bpf_check_classic()
|
2014-07-23 14:01:58 +08:00
|
|
|
*/
|
2014-09-08 14:04:47 +08:00
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
#include <linux/filter.h>
|
|
|
|
#include <linux/skbuff.h>
|
2014-09-03 04:53:44 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2014-09-08 14:04:47 +08:00
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/moduleloader.h>
|
2014-09-26 15:17:00 +08:00
|
|
|
#include <linux/bpf.h>
|
2016-02-29 12:22:37 +08:00
|
|
|
#include <linux/frame.h>
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
#include <linux/rbtree_latch.h>
|
|
|
|
#include <linux/kallsyms.h>
|
|
|
|
#include <linux/rcupdate.h>
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
#include <asm/unaligned.h>
|
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
/* Registers */
|
|
|
|
#define BPF_R0 regs[BPF_REG_0]
|
|
|
|
#define BPF_R1 regs[BPF_REG_1]
|
|
|
|
#define BPF_R2 regs[BPF_REG_2]
|
|
|
|
#define BPF_R3 regs[BPF_REG_3]
|
|
|
|
#define BPF_R4 regs[BPF_REG_4]
|
|
|
|
#define BPF_R5 regs[BPF_REG_5]
|
|
|
|
#define BPF_R6 regs[BPF_REG_6]
|
|
|
|
#define BPF_R7 regs[BPF_REG_7]
|
|
|
|
#define BPF_R8 regs[BPF_REG_8]
|
|
|
|
#define BPF_R9 regs[BPF_REG_9]
|
|
|
|
#define BPF_R10 regs[BPF_REG_10]
|
|
|
|
|
|
|
|
/* Named registers */
|
|
|
|
#define DST regs[insn->dst_reg]
|
|
|
|
#define SRC regs[insn->src_reg]
|
|
|
|
#define FP regs[BPF_REG_FP]
|
|
|
|
#define ARG1 regs[BPF_REG_ARG1]
|
|
|
|
#define CTX regs[BPF_REG_CTX]
|
|
|
|
#define IMM insn->imm
|
|
|
|
|
|
|
|
/* No hurry in this branch
|
|
|
|
*
|
|
|
|
* Exported for the bpf jit load helper.
|
|
|
|
*/
|
|
|
|
void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
|
|
|
|
{
|
|
|
|
u8 *ptr = NULL;
|
|
|
|
|
|
|
|
if (k >= SKF_NET_OFF)
|
|
|
|
ptr = skb_network_header(skb) + k - SKF_NET_OFF;
|
|
|
|
else if (k >= SKF_LL_OFF)
|
|
|
|
ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
|
2015-05-30 05:23:07 +08:00
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
|
|
|
|
return ptr;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
|
|
|
|
{
|
2017-05-09 06:57:44 +08:00
|
|
|
gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux *aux;
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *fp;
|
|
|
|
|
|
|
|
size = round_up(size, PAGE_SIZE);
|
|
|
|
fp = __vmalloc(size, gfp_flags, PAGE_KERNEL);
|
|
|
|
if (fp == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
2015-09-30 07:41:50 +08:00
|
|
|
kmemcheck_annotate_bitfield(fp, meta);
|
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
aux = kzalloc(sizeof(*aux), GFP_KERNEL | gfp_extra_flags);
|
|
|
|
if (aux == NULL) {
|
2014-09-03 04:53:44 +08:00
|
|
|
vfree(fp);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
fp->pages = size / PAGE_SIZE;
|
2014-09-26 15:17:00 +08:00
|
|
|
fp->aux = aux;
|
2015-10-29 21:58:08 +08:00
|
|
|
fp->aux->prog = fp;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
INIT_LIST_HEAD_RCU(&fp->aux->ksym_lnode);
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
return fp;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(bpf_prog_alloc);
|
|
|
|
|
|
|
|
struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
|
|
|
|
gfp_t gfp_extra_flags)
|
|
|
|
{
|
2017-05-09 06:57:44 +08:00
|
|
|
gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
|
2014-09-03 04:53:44 +08:00
|
|
|
struct bpf_prog *fp;
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
u32 pages, delta;
|
|
|
|
int ret;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
|
|
|
BUG_ON(fp_old == NULL);
|
|
|
|
|
|
|
|
size = round_up(size, PAGE_SIZE);
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
pages = size / PAGE_SIZE;
|
|
|
|
if (pages <= fp_old->pages)
|
2014-09-03 04:53:44 +08:00
|
|
|
return fp_old;
|
|
|
|
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
delta = pages - fp_old->pages;
|
|
|
|
ret = __bpf_prog_charge(fp_old->aux->user, delta);
|
|
|
|
if (ret)
|
|
|
|
return NULL;
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
fp = __vmalloc(size, gfp_flags, PAGE_KERNEL);
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
if (fp == NULL) {
|
|
|
|
__bpf_prog_uncharge(fp_old->aux->user, delta);
|
|
|
|
} else {
|
2015-09-30 07:41:50 +08:00
|
|
|
kmemcheck_annotate_bitfield(fp, meta);
|
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
|
bpf: fix overflow in prog accounting
Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-18 08:52:58 +08:00
|
|
|
fp->pages = pages;
|
2015-10-29 21:58:08 +08:00
|
|
|
fp->aux->prog = fp;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
/* We keep fp->aux from fp_old around in the new
|
2014-09-03 04:53:44 +08:00
|
|
|
* reallocated structure.
|
|
|
|
*/
|
2014-09-26 15:17:00 +08:00
|
|
|
fp_old->aux = NULL;
|
2014-09-03 04:53:44 +08:00
|
|
|
__bpf_prog_free(fp_old);
|
|
|
|
}
|
|
|
|
|
|
|
|
return fp;
|
|
|
|
}
|
|
|
|
|
|
|
|
void __bpf_prog_free(struct bpf_prog *fp)
|
|
|
|
{
|
2014-09-26 15:17:00 +08:00
|
|
|
kfree(fp->aux);
|
2014-09-03 04:53:44 +08:00
|
|
|
vfree(fp);
|
|
|
|
}
|
|
|
|
|
2017-01-14 06:38:15 +08:00
|
|
|
int bpf_prog_calc_tag(struct bpf_prog *fp)
|
2016-12-05 06:19:41 +08:00
|
|
|
{
|
|
|
|
const u32 bits_offset = SHA_MESSAGE_BYTES - sizeof(__be64);
|
2017-01-14 06:38:15 +08:00
|
|
|
u32 raw_size = bpf_prog_tag_scratch_size(fp);
|
|
|
|
u32 digest[SHA_DIGEST_WORDS];
|
2016-12-18 08:52:57 +08:00
|
|
|
u32 ws[SHA_WORKSPACE_WORDS];
|
2016-12-05 06:19:41 +08:00
|
|
|
u32 i, bsize, psize, blocks;
|
2016-12-18 08:52:57 +08:00
|
|
|
struct bpf_insn *dst;
|
2016-12-05 06:19:41 +08:00
|
|
|
bool was_ld_map;
|
2016-12-18 08:52:57 +08:00
|
|
|
u8 *raw, *todo;
|
2016-12-05 06:19:41 +08:00
|
|
|
__be32 *result;
|
|
|
|
__be64 *bits;
|
|
|
|
|
2016-12-18 08:52:57 +08:00
|
|
|
raw = vmalloc(raw_size);
|
|
|
|
if (!raw)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2017-01-14 06:38:15 +08:00
|
|
|
sha_init(digest);
|
2016-12-05 06:19:41 +08:00
|
|
|
memset(ws, 0, sizeof(ws));
|
|
|
|
|
|
|
|
/* We need to take out the map fd for the digest calculation
|
|
|
|
* since they are unstable from user space side.
|
|
|
|
*/
|
2016-12-18 08:52:57 +08:00
|
|
|
dst = (void *)raw;
|
2016-12-05 06:19:41 +08:00
|
|
|
for (i = 0, was_ld_map = false; i < fp->len; i++) {
|
|
|
|
dst[i] = fp->insnsi[i];
|
|
|
|
if (!was_ld_map &&
|
|
|
|
dst[i].code == (BPF_LD | BPF_IMM | BPF_DW) &&
|
|
|
|
dst[i].src_reg == BPF_PSEUDO_MAP_FD) {
|
|
|
|
was_ld_map = true;
|
|
|
|
dst[i].imm = 0;
|
|
|
|
} else if (was_ld_map &&
|
|
|
|
dst[i].code == 0 &&
|
|
|
|
dst[i].dst_reg == 0 &&
|
|
|
|
dst[i].src_reg == 0 &&
|
|
|
|
dst[i].off == 0) {
|
|
|
|
was_ld_map = false;
|
|
|
|
dst[i].imm = 0;
|
|
|
|
} else {
|
|
|
|
was_ld_map = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-12-18 08:52:57 +08:00
|
|
|
psize = bpf_prog_insn_size(fp);
|
|
|
|
memset(&raw[psize], 0, raw_size - psize);
|
2016-12-05 06:19:41 +08:00
|
|
|
raw[psize++] = 0x80;
|
|
|
|
|
|
|
|
bsize = round_up(psize, SHA_MESSAGE_BYTES);
|
|
|
|
blocks = bsize / SHA_MESSAGE_BYTES;
|
2016-12-18 08:52:57 +08:00
|
|
|
todo = raw;
|
2016-12-05 06:19:41 +08:00
|
|
|
if (bsize - psize >= sizeof(__be64)) {
|
|
|
|
bits = (__be64 *)(todo + bsize - sizeof(__be64));
|
|
|
|
} else {
|
|
|
|
bits = (__be64 *)(todo + bsize + bits_offset);
|
|
|
|
blocks++;
|
|
|
|
}
|
|
|
|
*bits = cpu_to_be64((psize - 1) << 3);
|
|
|
|
|
|
|
|
while (blocks--) {
|
2017-01-14 06:38:15 +08:00
|
|
|
sha_transform(digest, todo, ws);
|
2016-12-05 06:19:41 +08:00
|
|
|
todo += SHA_MESSAGE_BYTES;
|
|
|
|
}
|
|
|
|
|
2017-01-14 06:38:15 +08:00
|
|
|
result = (__force __be32 *)digest;
|
2016-12-05 06:19:41 +08:00
|
|
|
for (i = 0; i < SHA_DIGEST_WORDS; i++)
|
2017-01-14 06:38:15 +08:00
|
|
|
result[i] = cpu_to_be32(digest[i]);
|
|
|
|
memcpy(fp->tag, result, sizeof(fp->tag));
|
2016-12-18 08:52:57 +08:00
|
|
|
|
|
|
|
vfree(raw);
|
|
|
|
return 0;
|
2016-12-05 06:19:41 +08:00
|
|
|
}
|
|
|
|
|
2016-05-14 01:08:30 +08:00
|
|
|
static bool bpf_is_jmp_and_has_target(const struct bpf_insn *insn)
|
|
|
|
{
|
|
|
|
return BPF_CLASS(insn->code) == BPF_JMP &&
|
|
|
|
/* Call and Exit are both special jumps with no
|
|
|
|
* target inside the BPF instruction image.
|
|
|
|
*/
|
|
|
|
BPF_OP(insn->code) != BPF_CALL &&
|
|
|
|
BPF_OP(insn->code) != BPF_EXIT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta)
|
|
|
|
{
|
|
|
|
struct bpf_insn *insn = prog->insnsi;
|
|
|
|
u32 i, insn_cnt = prog->len;
|
|
|
|
|
|
|
|
for (i = 0; i < insn_cnt; i++, insn++) {
|
|
|
|
if (!bpf_is_jmp_and_has_target(insn))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Adjust offset of jmps if we cross boundaries. */
|
|
|
|
if (i < pos && i + insn->off + 1 > pos)
|
|
|
|
insn->off += delta;
|
|
|
|
else if (i > pos + delta && i + insn->off + 1 <= pos + delta)
|
|
|
|
insn->off -= delta;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
|
|
|
|
const struct bpf_insn *patch, u32 len)
|
|
|
|
{
|
|
|
|
u32 insn_adj_cnt, insn_rest, insn_delta = len - 1;
|
|
|
|
struct bpf_prog *prog_adj;
|
|
|
|
|
|
|
|
/* Since our patchlet doesn't expand the image, we're done. */
|
|
|
|
if (insn_delta == 0) {
|
|
|
|
memcpy(prog->insnsi + off, patch, sizeof(*patch));
|
|
|
|
return prog;
|
|
|
|
}
|
|
|
|
|
|
|
|
insn_adj_cnt = prog->len + insn_delta;
|
|
|
|
|
|
|
|
/* Several new instructions need to be inserted. Make room
|
|
|
|
* for them. Likely, there's no need for a new allocation as
|
|
|
|
* last page could have large enough tailroom.
|
|
|
|
*/
|
|
|
|
prog_adj = bpf_prog_realloc(prog, bpf_prog_size(insn_adj_cnt),
|
|
|
|
GFP_USER);
|
|
|
|
if (!prog_adj)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
prog_adj->len = insn_adj_cnt;
|
|
|
|
|
|
|
|
/* Patching happens in 3 steps:
|
|
|
|
*
|
|
|
|
* 1) Move over tail of insnsi from next instruction onwards,
|
|
|
|
* so we can patch the single target insn with one or more
|
|
|
|
* new ones (patching is always from 1 to n insns, n > 0).
|
|
|
|
* 2) Inject new instructions at the target location.
|
|
|
|
* 3) Adjust branch offsets if necessary.
|
|
|
|
*/
|
|
|
|
insn_rest = insn_adj_cnt - off - len;
|
|
|
|
|
|
|
|
memmove(prog_adj->insnsi + off + len, prog_adj->insnsi + off + 1,
|
|
|
|
sizeof(*patch) * insn_rest);
|
|
|
|
memcpy(prog_adj->insnsi + off, patch, sizeof(*patch) * len);
|
|
|
|
|
|
|
|
bpf_adj_branches(prog_adj, off, insn_delta);
|
|
|
|
|
|
|
|
return prog_adj;
|
|
|
|
}
|
|
|
|
|
2014-09-10 21:01:02 +08:00
|
|
|
#ifdef CONFIG_BPF_JIT
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
static __always_inline void
|
|
|
|
bpf_get_prog_addr_region(const struct bpf_prog *prog,
|
|
|
|
unsigned long *symbol_start,
|
|
|
|
unsigned long *symbol_end)
|
|
|
|
{
|
|
|
|
const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(prog);
|
|
|
|
unsigned long addr = (unsigned long)hdr;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!bpf_prog_ebpf_jited(prog));
|
|
|
|
|
|
|
|
*symbol_start = addr;
|
|
|
|
*symbol_end = addr + hdr->pages * PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
|
|
|
|
{
|
|
|
|
BUILD_BUG_ON(sizeof("bpf_prog_") +
|
|
|
|
sizeof(prog->tag) * 2 + 1 > KSYM_NAME_LEN);
|
|
|
|
|
|
|
|
sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
|
|
|
|
sym = bin2hex(sym, prog->tag, sizeof(prog->tag));
|
|
|
|
*sym = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline unsigned long
|
|
|
|
bpf_get_prog_addr_start(struct latch_tree_node *n)
|
|
|
|
{
|
|
|
|
unsigned long symbol_start, symbol_end;
|
|
|
|
const struct bpf_prog_aux *aux;
|
|
|
|
|
|
|
|
aux = container_of(n, struct bpf_prog_aux, ksym_tnode);
|
|
|
|
bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
|
|
|
|
|
|
|
|
return symbol_start;
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline bool bpf_tree_less(struct latch_tree_node *a,
|
|
|
|
struct latch_tree_node *b)
|
|
|
|
{
|
|
|
|
return bpf_get_prog_addr_start(a) < bpf_get_prog_addr_start(b);
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline int bpf_tree_comp(void *key, struct latch_tree_node *n)
|
|
|
|
{
|
|
|
|
unsigned long val = (unsigned long)key;
|
|
|
|
unsigned long symbol_start, symbol_end;
|
|
|
|
const struct bpf_prog_aux *aux;
|
|
|
|
|
|
|
|
aux = container_of(n, struct bpf_prog_aux, ksym_tnode);
|
|
|
|
bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
|
|
|
|
|
|
|
|
if (val < symbol_start)
|
|
|
|
return -1;
|
|
|
|
if (val >= symbol_end)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct latch_tree_ops bpf_tree_ops = {
|
|
|
|
.less = bpf_tree_less,
|
|
|
|
.comp = bpf_tree_comp,
|
|
|
|
};
|
|
|
|
|
|
|
|
static DEFINE_SPINLOCK(bpf_lock);
|
|
|
|
static LIST_HEAD(bpf_kallsyms);
|
|
|
|
static struct latch_tree_root bpf_tree __cacheline_aligned;
|
|
|
|
|
|
|
|
int bpf_jit_kallsyms __read_mostly;
|
|
|
|
|
|
|
|
static void bpf_prog_ksym_node_add(struct bpf_prog_aux *aux)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!list_empty(&aux->ksym_lnode));
|
|
|
|
list_add_tail_rcu(&aux->ksym_lnode, &bpf_kallsyms);
|
|
|
|
latch_tree_insert(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_prog_ksym_node_del(struct bpf_prog_aux *aux)
|
|
|
|
{
|
|
|
|
if (list_empty(&aux->ksym_lnode))
|
|
|
|
return;
|
|
|
|
|
|
|
|
latch_tree_erase(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);
|
|
|
|
list_del_rcu(&aux->ksym_lnode);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
return fp->jited && !bpf_prog_was_classic(fp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
return list_empty(&fp->aux->ksym_lnode) ||
|
|
|
|
fp->aux->ksym_lnode.prev == LIST_POISON2;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_kallsyms_add(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
if (!bpf_prog_kallsyms_candidate(fp) ||
|
|
|
|
!capable(CAP_SYS_ADMIN))
|
|
|
|
return;
|
|
|
|
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_lock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
bpf_prog_ksym_node_add(fp->aux);
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_unlock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_prog_kallsyms_del(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
if (!bpf_prog_kallsyms_candidate(fp))
|
|
|
|
return;
|
|
|
|
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_lock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
bpf_prog_ksym_node_del(fp->aux);
|
2017-04-27 07:39:33 +08:00
|
|
|
spin_unlock_bh(&bpf_lock);
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr)
|
|
|
|
{
|
|
|
|
struct latch_tree_node *n;
|
|
|
|
|
|
|
|
if (!bpf_jit_kallsyms_enabled())
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
n = latch_tree_find((void *)addr, &bpf_tree, &bpf_tree_ops);
|
|
|
|
return n ?
|
|
|
|
container_of(n, struct bpf_prog_aux, ksym_tnode)->prog :
|
|
|
|
NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
|
|
|
|
unsigned long *off, char *sym)
|
|
|
|
{
|
|
|
|
unsigned long symbol_start, symbol_end;
|
|
|
|
struct bpf_prog *prog;
|
|
|
|
char *ret = NULL;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
prog = bpf_prog_kallsyms_find(addr);
|
|
|
|
if (prog) {
|
|
|
|
bpf_get_prog_addr_region(prog, &symbol_start, &symbol_end);
|
|
|
|
bpf_get_prog_name(prog, sym);
|
|
|
|
|
|
|
|
ret = sym;
|
|
|
|
if (size)
|
|
|
|
*size = symbol_end - symbol_start;
|
|
|
|
if (off)
|
|
|
|
*off = addr - symbol_start;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool is_bpf_text_address(unsigned long addr)
|
|
|
|
{
|
|
|
|
bool ret;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
ret = bpf_prog_kallsyms_find(addr) != NULL;
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
|
|
|
|
char *sym)
|
|
|
|
{
|
|
|
|
unsigned long symbol_start, symbol_end;
|
|
|
|
struct bpf_prog_aux *aux;
|
|
|
|
unsigned int it = 0;
|
|
|
|
int ret = -ERANGE;
|
|
|
|
|
|
|
|
if (!bpf_jit_kallsyms_enabled())
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(aux, &bpf_kallsyms, ksym_lnode) {
|
|
|
|
if (it++ != symnum)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
|
|
|
|
bpf_get_prog_name(aux->prog, sym);
|
|
|
|
|
|
|
|
*value = symbol_start;
|
|
|
|
*type = BPF_SYM_ELF_TYPE;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-09-08 14:04:47 +08:00
|
|
|
struct bpf_binary_header *
|
|
|
|
bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
|
|
|
|
unsigned int alignment,
|
|
|
|
bpf_jit_fill_hole_t bpf_fill_ill_insns)
|
|
|
|
{
|
|
|
|
struct bpf_binary_header *hdr;
|
|
|
|
unsigned int size, hole, start;
|
|
|
|
|
|
|
|
/* Most of BPF filters are really small, but if some of them
|
|
|
|
* fill a page, allow at least 128 extra bytes to insert a
|
|
|
|
* random section of illegal instructions.
|
|
|
|
*/
|
|
|
|
size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
|
|
|
|
hdr = module_alloc(size);
|
|
|
|
if (hdr == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* Fill space with illegal/arch-dep instructions. */
|
|
|
|
bpf_fill_ill_insns(hdr, size);
|
|
|
|
|
|
|
|
hdr->pages = size / PAGE_SIZE;
|
|
|
|
hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
|
|
|
|
PAGE_SIZE - sizeof(*hdr));
|
2016-05-18 20:14:28 +08:00
|
|
|
start = (get_random_int() % hole) & ~(alignment - 1);
|
2014-09-08 14:04:47 +08:00
|
|
|
|
|
|
|
/* Leave a random number of instructions before BPF code. */
|
|
|
|
*image_ptr = &hdr->image[start];
|
|
|
|
|
|
|
|
return hdr;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_jit_binary_free(struct bpf_binary_header *hdr)
|
|
|
|
{
|
2015-01-20 06:37:05 +08:00
|
|
|
module_memfree(hdr);
|
2014-09-08 14:04:47 +08:00
|
|
|
}
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
|
bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 05:24:50 +08:00
|
|
|
/* This symbol is only overridden by archs that have different
|
|
|
|
* requirements than the usual eBPF JITs, f.e. when they only
|
|
|
|
* implement cBPF JIT, do not set images read-only, etc.
|
|
|
|
*/
|
|
|
|
void __weak bpf_jit_free(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
if (fp->jited) {
|
|
|
|
struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
|
|
|
|
|
|
|
|
bpf_jit_binary_unlock_ro(hdr);
|
|
|
|
bpf_jit_binary_free(hdr);
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
|
|
|
|
}
|
|
|
|
|
|
|
|
bpf_prog_unlock_free(fp);
|
|
|
|
}
|
|
|
|
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
int bpf_jit_harden __read_mostly;
|
|
|
|
|
|
|
|
static int bpf_jit_blind_insn(const struct bpf_insn *from,
|
|
|
|
const struct bpf_insn *aux,
|
|
|
|
struct bpf_insn *to_buff)
|
|
|
|
{
|
|
|
|
struct bpf_insn *to = to_buff;
|
2016-05-18 20:14:28 +08:00
|
|
|
u32 imm_rnd = get_random_int();
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
s16 off;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(BPF_REG_AX + 1 != MAX_BPF_JIT_REG);
|
|
|
|
BUILD_BUG_ON(MAX_BPF_REG + 1 != MAX_BPF_JIT_REG);
|
|
|
|
|
|
|
|
if (from->imm == 0 &&
|
|
|
|
(from->code == (BPF_ALU | BPF_MOV | BPF_K) ||
|
|
|
|
from->code == (BPF_ALU64 | BPF_MOV | BPF_K))) {
|
|
|
|
*to++ = BPF_ALU64_REG(BPF_XOR, from->dst_reg, from->dst_reg);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (from->code) {
|
|
|
|
case BPF_ALU | BPF_ADD | BPF_K:
|
|
|
|
case BPF_ALU | BPF_SUB | BPF_K:
|
|
|
|
case BPF_ALU | BPF_AND | BPF_K:
|
|
|
|
case BPF_ALU | BPF_OR | BPF_K:
|
|
|
|
case BPF_ALU | BPF_XOR | BPF_K:
|
|
|
|
case BPF_ALU | BPF_MUL | BPF_K:
|
|
|
|
case BPF_ALU | BPF_MOV | BPF_K:
|
|
|
|
case BPF_ALU | BPF_DIV | BPF_K:
|
|
|
|
case BPF_ALU | BPF_MOD | BPF_K:
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_ALU64 | BPF_ADD | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_SUB | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_AND | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_OR | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_XOR | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MUL | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MOV | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_DIV | BPF_K:
|
|
|
|
case BPF_ALU64 | BPF_MOD | BPF_K:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU64_REG(from->code, from->dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_JMP | BPF_JEQ | BPF_K:
|
|
|
|
case BPF_JMP | BPF_JNE | BPF_K:
|
|
|
|
case BPF_JMP | BPF_JGT | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JLT | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JGE | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JLE | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JSGT | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JSLT | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JSGE | BPF_K:
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
case BPF_JMP | BPF_JSLE | BPF_K:
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
case BPF_JMP | BPF_JSET | BPF_K:
|
|
|
|
/* Accommodate for extra offset in case of a backjump. */
|
|
|
|
off = from->off;
|
|
|
|
if (off < 0)
|
|
|
|
off -= 2;
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_JMP_REG(from->code, from->dst_reg, BPF_REG_AX, off);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_LD | BPF_ABS | BPF_W:
|
|
|
|
case BPF_LD | BPF_ABS | BPF_H:
|
|
|
|
case BPF_LD | BPF_ABS | BPF_B:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_LD_IND(from->code, BPF_REG_AX, 0);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_LD | BPF_IND | BPF_W:
|
|
|
|
case BPF_LD | BPF_IND | BPF_H:
|
|
|
|
case BPF_LD | BPF_IND | BPF_B:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU32_REG(BPF_ADD, BPF_REG_AX, from->src_reg);
|
|
|
|
*to++ = BPF_LD_IND(from->code, BPF_REG_AX, 0);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_LD | BPF_IMM | BPF_DW:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[1].imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32);
|
|
|
|
*to++ = BPF_ALU64_REG(BPF_MOV, aux[0].dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[0].imm);
|
|
|
|
*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_ALU64_REG(BPF_OR, aux[0].dst_reg, BPF_REG_AX);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case BPF_ST | BPF_MEM | BPF_DW:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_W:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_H:
|
|
|
|
case BPF_ST | BPF_MEM | BPF_B:
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
|
|
|
|
*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
|
|
|
|
*to++ = BPF_STX_MEM(from->code, from->dst_reg, BPF_REG_AX, from->off);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return to - to_buff;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_prog *bpf_prog_clone_create(struct bpf_prog *fp_other,
|
|
|
|
gfp_t gfp_extra_flags)
|
|
|
|
{
|
2017-05-09 06:57:44 +08:00
|
|
|
gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
|
bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 01:08:32 +08:00
|
|
|
struct bpf_prog *fp;
|
|
|
|
|
|
|
|
fp = __vmalloc(fp_other->pages * PAGE_SIZE, gfp_flags, PAGE_KERNEL);
|
|
|
|
if (fp != NULL) {
|
|
|
|
kmemcheck_annotate_bitfield(fp, meta);
|
|
|
|
|
|
|
|
/* aux->prog still points to the fp_other one, so
|
|
|
|
* when promoting the clone to the real program,
|
|
|
|
* this still needs to be adapted.
|
|
|
|
*/
|
|
|
|
memcpy(fp, fp_other, fp_other->pages * PAGE_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return fp;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_prog_clone_free(struct bpf_prog *fp)
|
|
|
|
{
|
|
|
|
/* aux was stolen by the other clone, so we cannot free
|
|
|
|
* it from this path! It will be freed eventually by the
|
|
|
|
* other program on release.
|
|
|
|
*
|
|
|
|
* At this point, we don't need a deferred release since
|
|
|
|
* clone is guaranteed to not be locked.
|
|
|
|
*/
|
|
|
|
fp->aux = NULL;
|
|
|
|
__bpf_prog_free(fp);
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_jit_prog_release_other(struct bpf_prog *fp, struct bpf_prog *fp_other)
|
|
|
|
{
|
|
|
|
/* We have to repoint aux->prog to self, as we don't
|
|
|
|
* know whether fp here is the clone or the original.
|
|
|
|
*/
|
|
|
|
fp->aux->prog = fp;
|
|
|
|
bpf_prog_clone_free(fp_other);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
struct bpf_insn insn_buff[16], aux[2];
|
|
|
|
struct bpf_prog *clone, *tmp;
|
|
|
|
int insn_delta, insn_cnt;
|
|
|
|
struct bpf_insn *insn;
|
|
|
|
int i, rewritten;
|
|
|
|
|
|
|
|
if (!bpf_jit_blinding_enabled())
|
|
|
|
return prog;
|
|
|
|
|
|
|
|
clone = bpf_prog_clone_create(prog, GFP_USER);
|
|
|
|
if (!clone)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
insn_cnt = clone->len;
|
|
|
|
insn = clone->insnsi;
|
|
|
|
|
|
|
|
for (i = 0; i < insn_cnt; i++, insn++) {
|
|
|
|
/* We temporarily need to hold the original ld64 insn
|
|
|
|
* so that we can still access the first part in the
|
|
|
|
* second blinding run.
|
|
|
|
*/
|
|
|
|
if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW) &&
|
|
|
|
insn[1].code == 0)
|
|
|
|
memcpy(aux, insn, sizeof(aux));
|
|
|
|
|
|
|
|
rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
|
|
|
|
if (!rewritten)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
tmp = bpf_patch_insn_single(clone, i, insn_buff, rewritten);
|
|
|
|
if (!tmp) {
|
|
|
|
/* Patching may have repointed aux->prog during
|
|
|
|
* realloc from the original one, so we need to
|
|
|
|
* fix it up here on error.
|
|
|
|
*/
|
|
|
|
bpf_jit_prog_release_other(prog, clone);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
clone = tmp;
|
|
|
|
insn_delta = rewritten - 1;
|
|
|
|
|
|
|
|
/* Walk new program and skip insns we just inserted. */
|
|
|
|
insn = clone->insnsi + i + insn_delta;
|
|
|
|
insn_cnt += insn_delta;
|
|
|
|
i += insn_delta;
|
|
|
|
}
|
|
|
|
|
|
|
|
return clone;
|
|
|
|
}
|
2014-09-10 21:01:02 +08:00
|
|
|
#endif /* CONFIG_BPF_JIT */
|
2014-09-08 14:04:47 +08:00
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
/* Base function for offset calculation. Needs to go into .text section,
|
|
|
|
* therefore keeping it non-static as well; will also be used by JITs
|
|
|
|
* anyway later on, so do not let the compiler omit it.
|
|
|
|
*/
|
|
|
|
noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2015-07-21 11:34:19 +08:00
|
|
|
EXPORT_SYMBOL_GPL(__bpf_call_base);
|
2014-07-23 14:01:58 +08:00
|
|
|
|
|
|
|
/**
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
* __bpf_prog_run - run eBPF program on a given context
|
|
|
|
* @ctx: is the data we are operating on
|
|
|
|
* @insn: is the array of eBPF instructions
|
2014-07-23 14:01:58 +08:00
|
|
|
*
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
* Decode and execute eBPF instructions.
|
2014-07-23 14:01:58 +08:00
|
|
|
*/
|
2017-05-31 04:31:28 +08:00
|
|
|
static unsigned int ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn,
|
|
|
|
u64 *stack)
|
2014-07-23 14:01:58 +08:00
|
|
|
{
|
2017-05-31 04:31:28 +08:00
|
|
|
u64 tmp;
|
2014-07-23 14:01:58 +08:00
|
|
|
static const void *jumptable[256] = {
|
|
|
|
[0 ... 255] = &&default_label,
|
|
|
|
/* Now overwrite non-defaults ... */
|
|
|
|
/* 32 bit ALU operations */
|
|
|
|
[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
|
|
|
|
[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
|
|
|
|
[BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
|
|
|
|
[BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
|
|
|
|
[BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
|
|
|
|
[BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
|
|
|
|
[BPF_ALU | BPF_OR | BPF_X] = &&ALU_OR_X,
|
|
|
|
[BPF_ALU | BPF_OR | BPF_K] = &&ALU_OR_K,
|
|
|
|
[BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
|
|
|
|
[BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
|
|
|
|
[BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
|
|
|
|
[BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
|
|
|
|
[BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
|
|
|
|
[BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
|
|
|
|
[BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
|
|
|
|
[BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
|
|
|
|
[BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
|
|
|
|
[BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
|
|
|
|
[BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
|
|
|
|
[BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
|
|
|
|
[BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
|
|
|
|
[BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
|
|
|
|
[BPF_ALU | BPF_NEG] = &&ALU_NEG,
|
|
|
|
[BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
|
|
|
|
[BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
|
|
|
|
/* 64 bit ALU operations */
|
|
|
|
[BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
|
|
|
|
[BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
|
|
|
|
[BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
|
|
|
|
[BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
|
|
|
|
[BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
|
|
|
|
[BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
|
|
|
|
[BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
|
|
|
|
[BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
|
|
|
|
[BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
|
|
|
|
[BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
|
|
|
|
[BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
|
|
|
|
[BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
|
|
|
|
[BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
|
|
|
|
[BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
|
|
|
|
[BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
|
|
|
|
[BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
|
|
|
|
[BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
|
|
|
|
[BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
|
|
|
|
[BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
|
|
|
|
[BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
|
|
|
|
[BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
|
|
|
|
[BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
|
|
|
|
[BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
|
|
|
|
[BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
|
|
|
|
[BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
|
|
|
|
/* Call instruction */
|
|
|
|
[BPF_JMP | BPF_CALL] = &&JMP_CALL,
|
2017-05-31 04:31:27 +08:00
|
|
|
[BPF_JMP | BPF_TAIL_CALL] = &&JMP_TAIL_CALL,
|
2014-07-23 14:01:58 +08:00
|
|
|
/* Jumps */
|
|
|
|
[BPF_JMP | BPF_JA] = &&JMP_JA,
|
|
|
|
[BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
|
|
|
|
[BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
|
|
|
|
[BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
|
|
|
|
[BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
|
|
|
|
[BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
|
|
|
|
[BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
[BPF_JMP | BPF_JLT | BPF_X] = &&JMP_JLT_X,
|
|
|
|
[BPF_JMP | BPF_JLT | BPF_K] = &&JMP_JLT_K,
|
2014-07-23 14:01:58 +08:00
|
|
|
[BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
|
|
|
|
[BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
[BPF_JMP | BPF_JLE | BPF_X] = &&JMP_JLE_X,
|
|
|
|
[BPF_JMP | BPF_JLE | BPF_K] = &&JMP_JLE_K,
|
2014-07-23 14:01:58 +08:00
|
|
|
[BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
|
|
|
|
[BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
[BPF_JMP | BPF_JSLT | BPF_X] = &&JMP_JSLT_X,
|
|
|
|
[BPF_JMP | BPF_JSLT | BPF_K] = &&JMP_JSLT_K,
|
2014-07-23 14:01:58 +08:00
|
|
|
[BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
|
|
|
|
[BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
[BPF_JMP | BPF_JSLE | BPF_X] = &&JMP_JSLE_X,
|
|
|
|
[BPF_JMP | BPF_JSLE | BPF_K] = &&JMP_JSLE_K,
|
2014-07-23 14:01:58 +08:00
|
|
|
[BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
|
|
|
|
[BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
|
|
|
|
/* Program return */
|
|
|
|
[BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
|
|
|
|
/* Store instructions */
|
|
|
|
[BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
|
|
|
|
[BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
|
|
|
|
[BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
|
|
|
|
[BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
|
|
|
|
[BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
|
|
|
|
[BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
|
|
|
|
[BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
|
|
|
|
[BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
|
|
|
|
[BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
|
|
|
|
[BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
|
|
|
|
/* Load instructions */
|
|
|
|
[BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
|
|
|
|
[BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
|
|
|
|
[BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
|
|
|
|
[BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
|
|
|
|
[BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
|
|
|
|
[BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
|
|
|
|
[BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
|
|
|
|
[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
|
|
|
|
[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
|
|
|
|
[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
|
net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)
User space generated programs will use this instruction to load constants only.
To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 13:17:17 +08:00
|
|
|
[BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW,
|
2014-07-23 14:01:58 +08:00
|
|
|
};
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
u32 tail_call_cnt = 0;
|
2014-07-23 14:01:58 +08:00
|
|
|
void *ptr;
|
|
|
|
int off;
|
|
|
|
|
|
|
|
#define CONT ({ insn++; goto select_insn; })
|
|
|
|
#define CONT_JMP ({ insn++; goto select_insn; })
|
|
|
|
|
|
|
|
select_insn:
|
|
|
|
goto *jumptable[insn->code];
|
|
|
|
|
|
|
|
/* ALU */
|
|
|
|
#define ALU(OPCODE, OP) \
|
|
|
|
ALU64_##OPCODE##_X: \
|
|
|
|
DST = DST OP SRC; \
|
|
|
|
CONT; \
|
|
|
|
ALU_##OPCODE##_X: \
|
|
|
|
DST = (u32) DST OP (u32) SRC; \
|
|
|
|
CONT; \
|
|
|
|
ALU64_##OPCODE##_K: \
|
|
|
|
DST = DST OP IMM; \
|
|
|
|
CONT; \
|
|
|
|
ALU_##OPCODE##_K: \
|
|
|
|
DST = (u32) DST OP (u32) IMM; \
|
|
|
|
CONT;
|
|
|
|
|
|
|
|
ALU(ADD, +)
|
|
|
|
ALU(SUB, -)
|
|
|
|
ALU(AND, &)
|
|
|
|
ALU(OR, |)
|
|
|
|
ALU(LSH, <<)
|
|
|
|
ALU(RSH, >>)
|
|
|
|
ALU(XOR, ^)
|
|
|
|
ALU(MUL, *)
|
|
|
|
#undef ALU
|
|
|
|
ALU_NEG:
|
|
|
|
DST = (u32) -DST;
|
|
|
|
CONT;
|
|
|
|
ALU64_NEG:
|
|
|
|
DST = -DST;
|
|
|
|
CONT;
|
|
|
|
ALU_MOV_X:
|
|
|
|
DST = (u32) SRC;
|
|
|
|
CONT;
|
|
|
|
ALU_MOV_K:
|
|
|
|
DST = (u32) IMM;
|
|
|
|
CONT;
|
|
|
|
ALU64_MOV_X:
|
|
|
|
DST = SRC;
|
|
|
|
CONT;
|
|
|
|
ALU64_MOV_K:
|
|
|
|
DST = IMM;
|
|
|
|
CONT;
|
net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)
User space generated programs will use this instruction to load constants only.
To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 13:17:17 +08:00
|
|
|
LD_IMM_DW:
|
|
|
|
DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;
|
|
|
|
insn++;
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
ALU64_ARSH_X:
|
|
|
|
(*(s64 *) &DST) >>= SRC;
|
|
|
|
CONT;
|
|
|
|
ALU64_ARSH_K:
|
|
|
|
(*(s64 *) &DST) >>= IMM;
|
|
|
|
CONT;
|
|
|
|
ALU64_MOD_X:
|
|
|
|
if (unlikely(SRC == 0))
|
|
|
|
return 0;
|
2015-04-28 05:40:37 +08:00
|
|
|
div64_u64_rem(DST, SRC, &tmp);
|
|
|
|
DST = tmp;
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_MOD_X:
|
|
|
|
if (unlikely(SRC == 0))
|
|
|
|
return 0;
|
|
|
|
tmp = (u32) DST;
|
|
|
|
DST = do_div(tmp, (u32) SRC);
|
|
|
|
CONT;
|
|
|
|
ALU64_MOD_K:
|
2015-04-28 05:40:37 +08:00
|
|
|
div64_u64_rem(DST, IMM, &tmp);
|
|
|
|
DST = tmp;
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_MOD_K:
|
|
|
|
tmp = (u32) DST;
|
|
|
|
DST = do_div(tmp, (u32) IMM);
|
|
|
|
CONT;
|
|
|
|
ALU64_DIV_X:
|
|
|
|
if (unlikely(SRC == 0))
|
|
|
|
return 0;
|
2015-04-28 05:40:37 +08:00
|
|
|
DST = div64_u64(DST, SRC);
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_DIV_X:
|
|
|
|
if (unlikely(SRC == 0))
|
|
|
|
return 0;
|
|
|
|
tmp = (u32) DST;
|
|
|
|
do_div(tmp, (u32) SRC);
|
|
|
|
DST = (u32) tmp;
|
|
|
|
CONT;
|
|
|
|
ALU64_DIV_K:
|
2015-04-28 05:40:37 +08:00
|
|
|
DST = div64_u64(DST, IMM);
|
2014-07-23 14:01:58 +08:00
|
|
|
CONT;
|
|
|
|
ALU_DIV_K:
|
|
|
|
tmp = (u32) DST;
|
|
|
|
do_div(tmp, (u32) IMM);
|
|
|
|
DST = (u32) tmp;
|
|
|
|
CONT;
|
|
|
|
ALU_END_TO_BE:
|
|
|
|
switch (IMM) {
|
|
|
|
case 16:
|
|
|
|
DST = (__force u16) cpu_to_be16(DST);
|
|
|
|
break;
|
|
|
|
case 32:
|
|
|
|
DST = (__force u32) cpu_to_be32(DST);
|
|
|
|
break;
|
|
|
|
case 64:
|
|
|
|
DST = (__force u64) cpu_to_be64(DST);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
ALU_END_TO_LE:
|
|
|
|
switch (IMM) {
|
|
|
|
case 16:
|
|
|
|
DST = (__force u16) cpu_to_le16(DST);
|
|
|
|
break;
|
|
|
|
case 32:
|
|
|
|
DST = (__force u32) cpu_to_le32(DST);
|
|
|
|
break;
|
|
|
|
case 64:
|
|
|
|
DST = (__force u64) cpu_to_le64(DST);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
|
|
|
|
/* CALL */
|
|
|
|
JMP_CALL:
|
|
|
|
/* Function call scratches BPF_R1-BPF_R5 registers,
|
|
|
|
* preserves BPF_R6-BPF_R9, and stores return value
|
|
|
|
* into BPF_R0.
|
|
|
|
*/
|
|
|
|
BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
|
|
|
|
BPF_R4, BPF_R5);
|
|
|
|
CONT;
|
|
|
|
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
JMP_TAIL_CALL: {
|
|
|
|
struct bpf_map *map = (struct bpf_map *) (unsigned long) BPF_R2;
|
|
|
|
struct bpf_array *array = container_of(map, struct bpf_array, map);
|
|
|
|
struct bpf_prog *prog;
|
2017-10-04 06:37:20 +08:00
|
|
|
u32 index = BPF_R3;
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
|
|
|
|
if (unlikely(index >= array->map.max_entries))
|
|
|
|
goto out;
|
|
|
|
if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
tail_call_cnt++;
|
|
|
|
|
2015-08-06 15:02:33 +08:00
|
|
|
prog = READ_ONCE(array->ptrs[index]);
|
2016-06-28 18:18:23 +08:00
|
|
|
if (!prog)
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
goto out;
|
|
|
|
|
2015-07-14 02:49:32 +08:00
|
|
|
/* ARG1 at this point is guaranteed to point to CTX from
|
|
|
|
* the verifier side due to the fact that the tail call is
|
|
|
|
* handeled like a helper, that is, bpf_tail_call_proto,
|
|
|
|
* where arg1_type is ARG_PTR_TO_CTX.
|
|
|
|
*/
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
insn = prog->insnsi;
|
|
|
|
goto select_insn;
|
|
|
|
out:
|
|
|
|
CONT;
|
|
|
|
}
|
2014-07-23 14:01:58 +08:00
|
|
|
/* JMP */
|
|
|
|
JMP_JA:
|
|
|
|
insn += insn->off;
|
|
|
|
CONT;
|
|
|
|
JMP_JEQ_X:
|
|
|
|
if (DST == SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JEQ_K:
|
|
|
|
if (DST == IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JNE_X:
|
|
|
|
if (DST != SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JNE_K:
|
|
|
|
if (DST != IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JGT_X:
|
|
|
|
if (DST > SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JGT_K:
|
|
|
|
if (DST > IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JLT_X:
|
|
|
|
if (DST < SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JLT_K:
|
|
|
|
if (DST < IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JGE_X:
|
|
|
|
if (DST >= SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JGE_K:
|
|
|
|
if (DST >= IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JLE_X:
|
|
|
|
if (DST <= SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JLE_K:
|
|
|
|
if (DST <= IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JSGT_X:
|
|
|
|
if (((s64) DST) > ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSGT_K:
|
|
|
|
if (((s64) DST) > ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JSLT_X:
|
|
|
|
if (((s64) DST) < ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSLT_K:
|
|
|
|
if (((s64) DST) < ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JSGE_X:
|
|
|
|
if (((s64) DST) >= ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSGE_K:
|
|
|
|
if (((s64) DST) >= ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 07:39:55 +08:00
|
|
|
JMP_JSLE_X:
|
|
|
|
if (((s64) DST) <= ((s64) SRC)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSLE_K:
|
|
|
|
if (((s64) DST) <= ((s64) IMM)) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
2014-07-23 14:01:58 +08:00
|
|
|
JMP_JSET_X:
|
|
|
|
if (DST & SRC) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_JSET_K:
|
|
|
|
if (DST & IMM) {
|
|
|
|
insn += insn->off;
|
|
|
|
CONT_JMP;
|
|
|
|
}
|
|
|
|
CONT;
|
|
|
|
JMP_EXIT:
|
|
|
|
return BPF_R0;
|
|
|
|
|
|
|
|
/* STX and ST and LDX*/
|
|
|
|
#define LDST(SIZEOP, SIZE) \
|
|
|
|
STX_MEM_##SIZEOP: \
|
|
|
|
*(SIZE *)(unsigned long) (DST + insn->off) = SRC; \
|
|
|
|
CONT; \
|
|
|
|
ST_MEM_##SIZEOP: \
|
|
|
|
*(SIZE *)(unsigned long) (DST + insn->off) = IMM; \
|
|
|
|
CONT; \
|
|
|
|
LDX_MEM_##SIZEOP: \
|
|
|
|
DST = *(SIZE *)(unsigned long) (SRC + insn->off); \
|
|
|
|
CONT;
|
|
|
|
|
|
|
|
LDST(B, u8)
|
|
|
|
LDST(H, u16)
|
|
|
|
LDST(W, u32)
|
|
|
|
LDST(DW, u64)
|
|
|
|
#undef LDST
|
|
|
|
STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
|
|
|
|
atomic_add((u32) SRC, (atomic_t *)(unsigned long)
|
|
|
|
(DST + insn->off));
|
|
|
|
CONT;
|
|
|
|
STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
|
|
|
|
atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
|
|
|
|
(DST + insn->off));
|
|
|
|
CONT;
|
|
|
|
LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
|
|
|
|
off = IMM;
|
|
|
|
load_word:
|
2017-04-11 18:10:58 +08:00
|
|
|
/* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are only
|
|
|
|
* appearing in the programs where ctx == skb
|
|
|
|
* (see may_access_skb() in the verifier). All programs
|
|
|
|
* keep 'ctx' in regs[BPF_REG_CTX] == BPF_R6,
|
|
|
|
* bpf_convert_filter() saves it in BPF_R6, internal BPF
|
|
|
|
* verifier will check that BPF_R6 == ctx.
|
2014-07-23 14:01:58 +08:00
|
|
|
*
|
|
|
|
* BPF_ABS and BPF_IND are wrappers of function calls,
|
|
|
|
* so they scratch BPF_R1-BPF_R5 registers, preserve
|
|
|
|
* BPF_R6-BPF_R9, and store return value into BPF_R0.
|
|
|
|
*
|
|
|
|
* Implicit input:
|
|
|
|
* ctx == skb == BPF_R6 == CTX
|
|
|
|
*
|
|
|
|
* Explicit input:
|
|
|
|
* SRC == any register
|
|
|
|
* IMM == 32-bit immediate
|
|
|
|
*
|
|
|
|
* Output:
|
|
|
|
* BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
|
|
|
|
*/
|
|
|
|
|
|
|
|
ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
|
|
|
|
if (likely(ptr != NULL)) {
|
|
|
|
BPF_R0 = get_unaligned_be32(ptr);
|
|
|
|
CONT;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
|
|
|
|
off = IMM;
|
|
|
|
load_half:
|
|
|
|
ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
|
|
|
|
if (likely(ptr != NULL)) {
|
|
|
|
BPF_R0 = get_unaligned_be16(ptr);
|
|
|
|
CONT;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
|
|
|
|
off = IMM;
|
|
|
|
load_byte:
|
|
|
|
ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
|
|
|
|
if (likely(ptr != NULL)) {
|
|
|
|
BPF_R0 = *(u8 *)ptr;
|
|
|
|
CONT;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
|
|
|
|
off = IMM + SRC;
|
|
|
|
goto load_word;
|
|
|
|
LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
|
|
|
|
off = IMM + SRC;
|
|
|
|
goto load_half;
|
|
|
|
LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
|
|
|
|
off = IMM + SRC;
|
|
|
|
goto load_byte;
|
|
|
|
|
|
|
|
default_label:
|
|
|
|
/* If we ever reach this, we have a bug somewhere. */
|
|
|
|
WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
|
|
|
|
return 0;
|
|
|
|
}
|
2017-05-31 04:31:28 +08:00
|
|
|
STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
|
|
|
|
|
2017-05-31 04:31:33 +08:00
|
|
|
#define PROG_NAME(stack_size) __bpf_prog_run##stack_size
|
|
|
|
#define DEFINE_BPF_PROG_RUN(stack_size) \
|
|
|
|
static unsigned int PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \
|
|
|
|
{ \
|
|
|
|
u64 stack[stack_size / sizeof(u64)]; \
|
|
|
|
u64 regs[MAX_BPF_REG]; \
|
|
|
|
\
|
|
|
|
FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; \
|
|
|
|
ARG1 = (u64) (unsigned long) ctx; \
|
|
|
|
return ___bpf_prog_run(regs, insn, stack); \
|
2017-05-31 04:31:28 +08:00
|
|
|
}
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2017-05-31 04:31:33 +08:00
|
|
|
#define EVAL1(FN, X) FN(X)
|
|
|
|
#define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
|
|
|
|
#define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
|
|
|
|
#define EVAL4(FN, X, Y...) FN(X) EVAL3(FN, Y)
|
|
|
|
#define EVAL5(FN, X, Y...) FN(X) EVAL4(FN, Y)
|
|
|
|
#define EVAL6(FN, X, Y...) FN(X) EVAL5(FN, Y)
|
|
|
|
|
|
|
|
EVAL6(DEFINE_BPF_PROG_RUN, 32, 64, 96, 128, 160, 192);
|
|
|
|
EVAL6(DEFINE_BPF_PROG_RUN, 224, 256, 288, 320, 352, 384);
|
|
|
|
EVAL4(DEFINE_BPF_PROG_RUN, 416, 448, 480, 512);
|
|
|
|
|
|
|
|
#define PROG_NAME_LIST(stack_size) PROG_NAME(stack_size),
|
|
|
|
|
|
|
|
static unsigned int (*interpreters[])(const void *ctx,
|
|
|
|
const struct bpf_insn *insn) = {
|
|
|
|
EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
|
|
|
|
EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
|
|
|
|
EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
|
|
|
|
};
|
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
bool bpf_prog_array_compatible(struct bpf_array *array,
|
|
|
|
const struct bpf_prog *fp)
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
{
|
2015-05-30 05:23:07 +08:00
|
|
|
if (!array->owner_prog_type) {
|
|
|
|
/* There's no owner yet where we could check for
|
|
|
|
* compatibility.
|
|
|
|
*/
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
array->owner_prog_type = fp->type;
|
|
|
|
array->owner_jited = fp->jited;
|
2015-05-30 05:23:07 +08:00
|
|
|
|
|
|
|
return true;
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
}
|
2015-05-30 05:23:07 +08:00
|
|
|
|
|
|
|
return array->owner_prog_type == fp->type &&
|
|
|
|
array->owner_jited == fp->jited;
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
}
|
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
static int bpf_check_tail_call(const struct bpf_prog *fp)
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
{
|
|
|
|
struct bpf_prog_aux *aux = fp->aux;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < aux->used_map_cnt; i++) {
|
2015-05-30 05:23:07 +08:00
|
|
|
struct bpf_map *map = aux->used_maps[i];
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
struct bpf_array *array;
|
|
|
|
|
|
|
|
if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
|
|
|
|
continue;
|
2015-05-30 05:23:07 +08:00
|
|
|
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
array = container_of(map, struct bpf_array, map);
|
|
|
|
if (!bpf_prog_array_compatible(array, fp))
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-07-23 14:01:58 +08:00
|
|
|
/**
|
2015-05-30 05:23:07 +08:00
|
|
|
* bpf_prog_select_runtime - select exec runtime for BPF program
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
* @fp: bpf_prog populated with internal BPF program
|
2016-05-14 01:08:31 +08:00
|
|
|
* @err: pointer to error variable
|
2014-07-23 14:01:58 +08:00
|
|
|
*
|
2015-05-30 05:23:07 +08:00
|
|
|
* Try to JIT eBPF program, if JIT is not available, use interpreter.
|
|
|
|
* The BPF program will be executed via BPF_PROG_RUN() macro.
|
2014-07-23 14:01:58 +08:00
|
|
|
*/
|
2016-05-14 01:08:31 +08:00
|
|
|
struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
|
2014-07-23 14:01:58 +08:00
|
|
|
{
|
2017-06-29 01:41:24 +08:00
|
|
|
u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
|
|
|
|
|
|
|
|
fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2016-05-14 01:08:31 +08:00
|
|
|
/* eBPF JITs can rewrite the program in case constant
|
|
|
|
* blinding is active. However, in case of error during
|
|
|
|
* blinding, bpf_int_jit_compile() must always return a
|
|
|
|
* valid program, which in this case would simply not
|
|
|
|
* be JITed, but falls back to the interpreter.
|
|
|
|
*/
|
|
|
|
fp = bpf_int_jit_compile(fp);
|
2014-09-03 04:53:44 +08:00
|
|
|
bpf_prog_lock_ro(fp);
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 07:59:03 +08:00
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
/* The tail call compatibility check can only be done at
|
|
|
|
* this late stage as we need to determine, if we deal
|
|
|
|
* with JITed or non JITed program concatenations and not
|
|
|
|
* all eBPF JITs might immediately support all features.
|
|
|
|
*/
|
2016-05-14 01:08:31 +08:00
|
|
|
*err = bpf_check_tail_call(fp);
|
|
|
|
|
|
|
|
return fp;
|
2014-07-23 14:01:58 +08:00
|
|
|
}
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
|
2014-07-23 14:01:58 +08:00
|
|
|
|
2014-09-03 04:53:44 +08:00
|
|
|
static void bpf_prog_free_deferred(struct work_struct *work)
|
|
|
|
{
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux *aux;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
aux = container_of(work, struct bpf_prog_aux, work);
|
|
|
|
bpf_jit_free(aux->prog);
|
2014-09-03 04:53:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Free internal BPF program */
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
void bpf_prog_free(struct bpf_prog *fp)
|
2014-07-23 14:01:58 +08:00
|
|
|
{
|
2014-09-26 15:17:00 +08:00
|
|
|
struct bpf_prog_aux *aux = fp->aux;
|
2014-09-03 04:53:44 +08:00
|
|
|
|
2014-09-26 15:17:00 +08:00
|
|
|
INIT_WORK(&aux->work, bpf_prog_free_deferred);
|
|
|
|
schedule_work(&aux->work);
|
2014-07-23 14:01:58 +08:00
|
|
|
}
|
net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 11:34:16 +08:00
|
|
|
EXPORT_SYMBOL_GPL(bpf_prog_free);
|
2014-10-24 09:41:08 +08:00
|
|
|
|
bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 07:20:39 +08:00
|
|
|
/* RNG for unpriviledged user space with separated state from prandom_u32(). */
|
|
|
|
static DEFINE_PER_CPU(struct rnd_state, bpf_user_rnd_state);
|
|
|
|
|
|
|
|
void bpf_user_rnd_init_once(void)
|
|
|
|
{
|
|
|
|
prandom_init_once(&bpf_user_rnd_state);
|
|
|
|
}
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_0(bpf_user_rnd_u32)
|
bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 07:20:39 +08:00
|
|
|
{
|
|
|
|
/* Should someone ever have the rather unwise idea to use some
|
|
|
|
* of the registers passed into this function, then note that
|
|
|
|
* this function is called from native eBPF and classic-to-eBPF
|
|
|
|
* transformations. Register assignments from both sides are
|
|
|
|
* different, f.e. classic always sets fn(ctx, A, X) here.
|
|
|
|
*/
|
|
|
|
struct rnd_state *state;
|
|
|
|
u32 res;
|
|
|
|
|
|
|
|
state = &get_cpu_var(bpf_user_rnd_state);
|
|
|
|
res = prandom_u32_state(state);
|
2016-09-27 23:42:41 +08:00
|
|
|
put_cpu_var(bpf_user_rnd_state);
|
bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 07:20:39 +08:00
|
|
|
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2015-03-06 06:27:51 +08:00
|
|
|
/* Weak definitions of helper functions in case we don't have bpf syscall. */
|
|
|
|
const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_map_update_elem_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
|
|
|
|
|
2015-03-14 09:27:16 +08:00
|
|
|
const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
|
2015-03-14 09:27:17 +08:00
|
|
|
const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
|
2016-10-21 18:46:33 +08:00
|
|
|
const struct bpf_func_proto bpf_get_numa_node_id_proto __weak;
|
2015-05-30 05:23:06 +08:00
|
|
|
const struct bpf_func_proto bpf_ktime_get_ns_proto __weak;
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
|
2015-06-13 10:39:12 +08:00
|
|
|
const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_get_current_comm_proto __weak;
|
2017-08-17 06:02:32 +08:00
|
|
|
const struct bpf_func_proto bpf_sock_map_update_proto __weak;
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
|
2015-06-13 10:39:13 +08:00
|
|
|
const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
2015-03-14 09:27:16 +08:00
|
|
|
|
2016-07-15 00:08:05 +08:00
|
|
|
u64 __weak
|
|
|
|
bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
|
|
|
|
void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
{
|
2016-07-15 00:08:05 +08:00
|
|
|
return -ENOTSUPP;
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 03:01:24 +08:00
|
|
|
}
|
|
|
|
|
2015-05-30 05:23:07 +08:00
|
|
|
/* Always built-in helper functions. */
|
|
|
|
const struct bpf_func_proto bpf_tail_call_proto = {
|
|
|
|
.func = NULL,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_VOID,
|
|
|
|
.arg1_type = ARG_PTR_TO_CTX,
|
|
|
|
.arg2_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2017-02-17 05:24:49 +08:00
|
|
|
/* Stub for JITs that only support cBPF. eBPF programs are interpreted.
|
|
|
|
* It is encouraged to implement bpf_int_jit_compile() instead, so that
|
|
|
|
* eBPF and implicitly also cBPF can get JITed!
|
|
|
|
*/
|
2016-05-14 01:08:31 +08:00
|
|
|
struct bpf_prog * __weak bpf_int_jit_compile(struct bpf_prog *prog)
|
2015-05-30 05:23:07 +08:00
|
|
|
{
|
2016-05-14 01:08:31 +08:00
|
|
|
return prog;
|
2015-05-30 05:23:07 +08:00
|
|
|
}
|
|
|
|
|
2017-02-17 05:24:49 +08:00
|
|
|
/* Stub for JITs that support eBPF. All cBPF code gets transformed into
|
|
|
|
* eBPF by the kernel and is later compiled by bpf_int_jit_compile().
|
|
|
|
*/
|
|
|
|
void __weak bpf_jit_compile(struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2016-12-08 07:53:11 +08:00
|
|
|
bool __weak bpf_helper_changes_pkt_data(void *func)
|
2016-05-06 10:49:10 +08:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-10-24 09:41:08 +08:00
|
|
|
/* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
|
|
|
|
* skb_copy_bits(), so provide a weak definition of it for NET-less config.
|
|
|
|
*/
|
|
|
|
int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
|
|
|
|
int len)
|
|
|
|
{
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.
For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.
Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:
# ./perf record -a -e bpf:* sleep 10
# ./perf script
sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
[...]
sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
[1] https://lwn.net/Articles/705270/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 09:28:18 +08:00
|
|
|
|
|
|
|
/* All definitions of tracepoints related to BPF. */
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <linux/bpf_trace.h>
|
|
|
|
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
|
|
|
|
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(bpf_prog_get_type);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(bpf_prog_put_rcu);
|