linux-sg2042/kernel
Daniel Borkmann f6b1b3bf0d bpf: fix subprog verifier bypass by div/mod by 0 exception
One of the ugly leftovers from the early eBPF days is that div/mod
operations based on registers have a hard-coded src_reg == 0 test
in the interpreter as well as in JIT code generators that would
return from the BPF program with exit code 0. This was basically
adopted from cBPF interpreter for historical reasons.

There are multiple reasons why this is very suboptimal and prone
to bugs. To name one: the return code mapping for such abnormal
program exit of 0 does not always match with a suitable program
type's exit code mapping. For example, '0' in tc means action 'ok'
where the packet gets passed further up the stack, which is just
undesirable for such cases (e.g. when implementing policy) and
also does not match with other program types.

While trying to work out an exception handling scheme, I also
noticed that programs crafted like the following will currently
pass the verifier:

  0: (bf) r6 = r1
  1: (85) call pc+8
  caller:
   R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  callee:
   frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
  10: (b4) (u32) r2 = (u32) 0
  11: (b4) (u32) r3 = (u32) 1
  12: (3c) (u32) r3 /= (u32) r2
  13: (61) r0 = *(u32 *)(r1 +76)
  14: (95) exit
  returning from callee:
   frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
           R1=ctx(id=0,off=0,imm=0) R2_w=inv0
           R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
           R10=fp0,call_1
  to caller at 2:
   R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
   R10=fp0,call_-1

  from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  2: (bf) r1 = r6
  3: (61) r1 = *(u32 *)(r1 +80)
  4: (bf) r2 = r0
  5: (07) r2 += 8
  6: (2d) if r2 > r1 goto pc+1
   R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
   R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
   R10=fp0,call_-1
  7: (71) r0 = *(u8 *)(r0 +0)
  8: (b7) r0 = 1
  9: (95) exit

  from 6 to 8: safe
  processed 16 insns (limit 131072), stack depth 0+0

Basically what happens is that in the subprog we make use of a
div/mod by 0 exception and in the 'normal' subprog's exit path
we just return skb->data back to the main prog. This has the
implication that the verifier thinks we always get a pkt pointer
in R0 while we still have the implicit 'return 0' from the div
as an alternative unconditional return path earlier. Thus, R0
then contains 0, meaning back in the parent prog we get the
address range of [0x0, skb->data_end] as read and writeable.
Similar can be crafted with other pointer register types.

Since i) BPF_ABS/IND is not allowed in programs that contain
BPF to BPF calls (and generally it's also disadvised to use in
native eBPF context), ii) unknown opcodes don't return zero
anymore, iii) we don't return an exception code in dead branches,
the only last missing case affected and to fix is the div/mod
handling.

What we would really need is some infrastructure to propagate
exceptions all the way to the original prog unwinding the
current stack and returning that code to the caller of the
BPF program. In user space such exception handling for similar
runtimes is typically implemented with setjmp(3) and longjmp(3)
as one possibility which is not available in the kernel,
though (kgdb used to implement it in kernel long time ago). I
implemented a PoC exception handling mechanism into the BPF
interpreter with porting setjmp()/longjmp() into x86_64 and
adding a new internal BPF_ABRT opcode that can use a program
specific exception code for all exception cases we have (e.g.
div/mod by 0, unknown opcodes, etc). While this seems to work
in the constrained BPF environment (meaning, here, we don't
need to deal with state e.g. from memory allocations that we
would need to undo before going into exception state), it still
has various drawbacks: i) we would need to implement the
setjmp()/longjmp() for every arch supported in the kernel and
for x86_64, arm64, sparc64 JITs currently supporting calls,
ii) it has unconditional additional cost on main program
entry to store CPU register state in initial setjmp() call,
and we would need some way to pass the jmp_buf down into
___bpf_prog_run() for main prog and all subprogs, but also
storing on stack is not really nice (other option would be
per-cpu storage for this, but it also has the drawback that
we need to disable preemption for every BPF program types).
All in all this approach would add a lot of complexity.

Another poor-man's solution would be to have some sort of
additional shared register or scratch buffer to hold state
for exceptions, and test that after every call return to
chain returns and pass R0 all the way down to BPF prog caller.
This is also problematic in various ways: i) an additional
register doesn't map well into JITs, and some other scratch
space could only be on per-cpu storage, which, again has the
side-effect that this only works when we disable preemption,
or somewhere in the input context which is not available
everywhere either, and ii) this adds significant runtime
overhead by putting conditionals after each and every call,
as well as implementation complexity.

Yet another option is to teach verifier that div/mod can
return an integer, which however is also complex to implement
as verifier would need to walk such fake 'mov r0,<code>; exit;'
sequeuence and there would still be no guarantee for having
propagation of this further down to the BPF caller as proper
exception code. For parent prog, it is also is not distinguishable
from a normal return of a constant scalar value.

The approach taken here is a completely different one with
little complexity and no additional overhead involved in
that we make use of the fact that a div/mod by 0 is undefined
behavior. Instead of bailing out, we adapt the same behavior
as on some major archs like ARMv8 [0] into eBPF as well:
X div 0 results in 0, and X mod 0 results in X. aarch64 and
aarch32 ISA do not generate any traps or otherwise aborts
of program execution for unsigned divides. I verified this
also with a test program compiled by gcc and clang, and the
behavior matches with the spec. Going forward we adapt the
eBPF verifier to emit such rewrites once div/mod by register
was seen. cBPF is not touched and will keep existing 'return 0'
semantics. Given the options, it seems the most suitable from
all of them, also since major archs have similar schemes in
place. Given this is all in the realm of undefined behavior,
we still have the option to adapt if deemed necessary and
this way we would also have the option of more flexibility
from LLVM code generation side (which is then fully visible
to verifier). Thus, this patch i) fixes the panic seen in
above program and ii) doesn't bypass the verifier observations.

  [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
      http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
      1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
         "A division by zero results in a zero being written to
          the destination register, without any indication that
          the division by zero occurred."
      2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
         "For the SDIV and UDIV instructions, division by zero
          always returns a zero result."

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-26 16:42:05 -08:00
..
bpf bpf: fix subprog verifier bypass by div/mod by 0 exception 2018-01-26 16:42:05 -08:00
cgroup cgroup: make cgroup.threads delegatable 2018-01-10 09:42:32 -08:00
configs ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES. 2017-08-22 18:43:23 -07:00
debug kdb: Fix handling of kallsyms_symbol_next() return value 2017-12-06 16:12:43 -06:00
events Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2017-12-18 10:51:06 -05:00
gcov License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
irq genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI 2017-12-29 21:13:05 +01:00
livepatch Merge branch 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching 2017-11-15 10:21:58 -08:00
locking futex: Avoid violating the 10th rule of futex 2018-01-14 18:49:16 +01:00
power Revert "cpuset: Make cpuset hotplug synchronous" 2017-12-04 14:41:11 -08:00
printk remove task and stack pointer printout from oops dump 2017-12-05 08:23:20 -08:00
rcu Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-11-13 17:56:58 -08:00
sched delayacct: Account blkio completion on the correct task 2018-01-16 03:29:36 +01:00
time timers: Unconditionally check deferrable base 2018-01-14 23:25:33 +01:00
trace Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2018-01-20 22:03:46 -05:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
Makefile error-injection: Support fault injection framework 2018-01-12 17:33:38 -08:00
acct.c kernel/acct.c: fix the acct->needcheck check in check_free_space() 2018-01-04 16:45:09 -08:00
async.c async: Adjust system_state checks 2017-05-23 10:01:37 +02:00
audit.c Audit: remove unused audit_log_secctx function 2017-11-10 16:08:47 -05:00
audit.h audit/stable-4.15 PR 20171113 2017-11-15 13:28:48 -08:00
audit_fsnotify.c Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2017-05-03 11:05:15 -07:00
audit_tree.c Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2017-11-14 14:08:20 -08:00
audit_watch.c audit/stable-4.13 PR 20170816 2017-08-16 16:48:34 -07:00
auditfilter.c audit: filter PATH records keyed on filesystem magic 2017-11-10 16:08:56 -05:00
auditsc.c audit/stable-4.15 PR 20171113 2017-11-15 13:28:48 -08:00
backtracetest.c
bounds.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
capability.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
compat.c sched_rr_get_interval(): move compat to native, get rid of set_fs() 2017-09-20 00:30:57 -04:00
configs.c
context_tracking.c
cpu.c Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-12-31 12:30:34 -08:00
cpu_pm.c PM / CPU: replace raw_notifier with atomic_notifier 2017-07-31 13:09:49 +02:00
crash_core.c kdump: write correct address of mem_section into vmcoreinfo 2018-01-13 10:42:48 -08:00
crash_dump.c
cred.c doc: ReSTify credentials.txt 2017-05-18 10:30:19 -06:00
delayacct.c delayacct: Account blkio completion on the correct task 2018-01-16 03:29:36 +01:00
dma.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
elfcore.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
exec_domain.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
exit.c kernel/exit.c: export abort() to modules 2018-01-04 16:45:09 -08:00
extable.c kprobes, x86/alternatives: Use text_mutex to protect smp_alt_modules 2017-11-07 12:20:09 +01:00
fail_function.c error-injection: Support fault injection framework 2018-01-12 17:33:38 -08:00
fork.c Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-12-23 11:53:04 -08:00
freezer.c
futex.c futex: Prevent overflow by strengthen input validation 2018-01-14 18:55:03 +01:00
futex_compat.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
groups.c kernel: make groups_sort calling a responsibility group_info allocators 2017-12-14 16:00:49 -08:00
hung_task.c kernel/hung_task.c: defer showing held locks 2017-05-08 17:15:10 -07:00
irq_work.c Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-11-13 17:33:11 -08:00
jump_label.c jump_label: Invoke jump_label_test() via early_initcall() 2017-11-14 08:41:41 +01:00
kallsyms.c kallsyms: take advantage of the new '%px' format 2017-11-29 10:30:13 -08:00
kcmp.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
kcov.c kcov: fix comparison callback signature 2017-12-14 16:00:48 -08:00
kexec.c kdump: protect vmcoreinfo data under the crash memory 2017-07-12 16:26:00 -07:00
kexec_core.c x86/mm, kexec: Allow kexec to be used with SME 2017-07-18 11:38:04 +02:00
kexec_file.c resource: Provide resource struct in resource walk callback 2017-11-07 15:35:57 +01:00
kexec_internal.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
kmod.c kmod: move #ifdef CONFIG_MODULES wrapper to Makefile 2017-09-08 18:26:51 -07:00
kprobes.c error-injection: Separate error-injection from kprobe 2018-01-12 17:33:38 -08:00
ksysfs.c kexec: move vmcoreinfo out of the kernel's .bss section 2017-07-12 16:25:59 -07:00
kthread.c treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts 2017-11-21 16:35:54 -08:00
latencytop.c
memremap.c memremap: add scheduling point to devm_memremap_pages 2017-10-03 17:54:25 -07:00
module-internal.h
module.c error-injection: Separate error-injection from kprobe 2018-01-12 17:33:38 -08:00
module_signing.c
notifier.c
nsproxy.c
padata.c treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
panic.c kernel/panic.c: add TAINT_AUX 2017-11-17 16:10:04 -08:00
params.c kernel/params.c: improve STANDARD_PARAM_DEF readability 2017-10-03 17:54:26 -07:00
pid.c pid: Handle failure to allocate the first pid in a pid namespace 2017-12-23 21:00:09 -06:00
pid_namespace.c pid: remove pidhash 2017-11-17 16:10:04 -08:00
profile.c
ptrace.c signal: Remove kernel interal si_code magic 2017-07-24 14:30:28 -05:00
range.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
reboot.c kernel/reboot.c: add devm_register_reboot_notifier() 2017-11-17 16:10:04 -08:00
relay.c Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-05-02 11:38:06 -07:00
resource.c x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages 2017-11-07 15:35:58 +01:00
seccomp.c locking/barriers: Convert users of lockless_dereference() to READ_ONCE() 2017-12-17 13:57:15 +01:00
signal.c Merge branch 'akpm' (patches from Andrew) 2017-11-17 16:56:17 -08:00
smp.c smp/core: Use lockdep to assert IRQs are disabled/enabled 2017-11-08 11:13:50 +01:00
smpboot.c watchdog/core, powerpc: Lock cpus across reconfiguration 2017-10-04 10:53:54 +02:00
smpboot.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
softirq.c kmemcheck: rip it out 2017-11-15 18:21:05 -08:00
stacktrace.c
stop_machine.c stop_machine: Provide stop_machine_cpuslocked() 2017-05-26 10:10:36 +02:00
sys.c arm64 updates for 4.15 2017-11-15 10:56:56 -08:00
sys_ni.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
sysctl.c kernel/sysctl.c: code cleanups 2017-11-17 16:10:03 -08:00
sysctl_binary.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
task_work.c locking/barriers: Convert users of lockless_dereference() to READ_ONCE() 2017-12-17 13:57:15 +01:00
taskstats.c taskstats: add e/u/stime for TGID command 2017-05-08 17:15:12 -07:00
test_kprobes.c kprobes: Disable the jprobes test code 2017-10-20 11:02:54 +02:00
torture.c torture: Fix typo suppressing CPU-hotplug statistics 2017-07-25 13:04:45 -07:00
tracepoint.c
tsacct.c
ucount.c
uid16.c kernel: make groups_sort calling a responsibility group_info allocators 2017-12-14 16:00:49 -08:00
umh.c kernel/umh.c: optimize 'proc_cap_handler()' 2017-11-17 16:10:01 -08:00
up.c smp: Avoid using two cache lines for struct call_single_data 2017-08-29 15:14:38 +02:00
user-return-notifier.c
user.c userns: use union in {g,u}idmap struct 2017-10-31 17:22:58 -05:00
user_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2017-11-16 12:20:15 -08:00
utsname.c
utsname_sysctl.c
watchdog.c Merge branch 'linus' into sched/core, to pick up fixes 2017-11-08 10:17:15 +01:00
watchdog_hld.c Merge branch 'linus' into core/urgent, to pick up dependent commits 2017-11-04 08:53:04 +01:00
workqueue.c workqueue: avoid hard lockups in show_workqueue_state() 2018-01-12 11:39:49 -08:00
workqueue_internal.h Merge branch 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2017-11-06 12:26:49 -08:00