In the current code, the actual max tail call count is 33 which is greater
than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
We can see the historical evolution from commit 04fd61ab36 ("bpf: allow
bpf programs to tail-call other bpf programs") and commit f9dabe016b
("bpf: Undo off-by-one in interpreter tail call count limit"). In order
to avoid changing existing behavior, the actual limit is 33 now, this is
reasonable.
After commit 874be05f52 ("bpf, tests: Add tail call test suite"), we can
see there exists failed testcase.
On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
On some archs:
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
Although the above failed testcase has been fixed in commit 18935a72eb
("bpf/tests: Fix error in tail call limit tests"), it would still be good
to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
more readable.
The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
For the riscv JIT, use RV_REG_TCC directly to save one register move as
suggested by Björn Töpel. For the other implementations, no function changes,
it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
can reflect the actual max tail call count, the related tail call testcases
in test_bpf module and selftests can work well for the interpreter and the
JIT.
Here are the test results on x86_64:
# uname -m
x86_64
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
# rmmod test_bpf
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
# rmmod test_bpf
# ./test_progs -t tailcalls
#142 tailcalls:OK
Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
Current BPF codegen doesn't respect X86_FEATURE_RETPOLINE* flags and
unconditionally emits a thunk call, this is sub-optimal and doesn't
match the regular, compiler generated, code.
Update the i386 JIT to emit code equal to what the compiler emits for
the regular kernel text (IOW. a plain THUNK call).
Update the x86_64 JIT to emit code similar to the result of compiler
and kernel rewrites as according to X86_FEATURE_RETPOLINE* flags.
Inlining RETPOLINE_AMD (lfence; jmp *%reg) and !RETPOLINE (jmp *%reg),
while doing a THUNK call for RETPOLINE.
This removes the hard-coded retpoline thunks and shrinks the generated
code. Leaving a single retpoline thunk definition in the kernel.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.614772675@infradead.org
In case of JITs, each of the JIT backends compiles the BPF nospec instruction
/either/ to a machine instruction which emits a speculation barrier /or/ to
/no/ machine instruction in case the underlying architecture is not affected
by Speculative Store Bypass or has different mitigations in place already.
This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
instruction for mitigation. In case of arm64, we rely on the firmware mitigation
as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
it works for all of the kernel code with no need to provide any additional
instructions here (hence only comment in arm64 JIT). Other archs can follow
as needed. The BPF nospec instruction is specifically targeting Spectre v4
since i) we don't use a serialization barrier for the Spectre v1 case, and
ii) mitigation instructions for v1 and v4 might be different on some archs.
The BPF nospec is required for a future commit, where the BPF verifier does
annotate intermediate BPF programs with speculation barriers.
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The branch displacement logic in the BPF JIT compilers for x86 assumes
that, for any generated branch instruction, the distance cannot
increase between optimization passes.
But this assumption can be violated due to how the distances are
computed. Specifically, whenever a backward branch is processed in
do_jit(), the distance is computed by subtracting the positions in the
machine code from different optimization passes. This is because part
of addrs[] is already updated for the current optimization pass, before
the branch instruction is visited.
And so the optimizer can expand blocks of machine code in some cases.
This can confuse the optimizer logic, where it assumes that a fixed
point has been reached for all machine code blocks once the total
program size stops changing. And then the JIT compiler can output
abnormal machine code containing incorrect branch displacements.
To mitigate this issue, we assert that a fixed point is reached while
populating the output image. This rejects any problematic programs.
The issue affects both x86-32 and x86-64. We mitigate separately to
ease backporting.
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
A subsequent patch will add additional atomic operations. These new
operations will use the same opcode field as the existing XADD, with
the immediate discriminating different operations.
In preparation, rename the instruction mode BPF_ATOMIC and start
calling the zero immediate BPF_ADD.
This is possible (doesn't break existing valid BPF progs) because the
immediate field is currently reserved MBZ and BPF_ADD is zero.
All uses are removed from the tree but the BPF_XADD definition is
kept around to avoid breaking builds for people including kernel
headers.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-5-jackmanb@google.com
The '==' expression itself is bool, no need to convert it to bool again.
This fixes the following coccicheck warning:
arch/x86/net/bpf_jit_comp32.c:1478:50-55: WARNING: conversion to bool not needed here
arch/x86/net/bpf_jit_comp32.c:1479:50-55: WARNING: conversion to bool not needed here
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200506140352.37154-1-yanaijie@huawei.com
When verifier_zext is true, we don't need to emit code
for zero-extension.
Fixes: 836256bf5f ("x32: bpf: eliminate zero extension code-gen")
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200423050637.GA4029@udknight
The current JIT clobbers the destination register for BPF_JSET BPF_X
and BPF_K by using "and" and "or" instructions. This is fine when the
destination register is a temporary loaded from a register stored on
the stack but not otherwise.
This patch fixes the problem (for both BPF_K and BPF_X) by always loading
the destination register into temporaries since BPF_JSET should not
modify the destination register.
This bug may not be currently triggerable as BPF_REG_AX is the only
register not stored on the stack and the verifier uses it in a limited
way.
Fixes: 03f5781be2 ("bpf, x86_32: add eBPF JIT compiler for ia32")
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Wang YanQing <udknight@gmail.com>
Link: https://lore.kernel.org/bpf/20200422173630.8351-2-luke.r.nels@gmail.com
The current JIT uses the following sequence to zero-extend into the
upper 32 bits of the destination register for BPF_LDX BPF_{B,H,W},
when the destination register is not on the stack:
EMIT3(0xC7, add_1reg(0xC0, dst_hi), 0);
The problem is that C7 /0 encodes a MOV instruction that requires a 4-byte
immediate; the current code emits only 1 byte of the immediate. This
means that the first 3 bytes of the next instruction will be treated as
the rest of the immediate, breaking the stream of instructions.
This patch fixes the problem by instead emitting "xor dst_hi,dst_hi"
to clear the upper 32 bits. This fixes the problem and is more efficient
than using MOV to load a zero immediate.
This bug may not be currently triggerable as BPF_REG_AX is the only
register not stored on the stack and the verifier uses it in a limited
way, and the verifier implements a zero-extension optimization. But the
JIT should avoid emitting incorrect encodings regardless.
Fixes: 03f5781be2 ("bpf, x86_32: add eBPF JIT compiler for ia32")
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Acked-by: Wang YanQing <udknight@gmail.com>
Link: https://lore.kernel.org/bpf/20200422173630.8351-1-luke.r.nels@gmail.com
The current x32 BPF JIT is incorrect for JMP32 JSET BPF_X when the upper
32 bits of operand registers are non-zero in certain situations.
The problem is in the following code:
case BPF_JMP | BPF_JSET | BPF_X:
case BPF_JMP32 | BPF_JSET | BPF_X:
...
/* and dreg_lo,sreg_lo */
EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo));
/* and dreg_hi,sreg_hi */
EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi));
/* or dreg_lo,dreg_hi */
EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi));
This code checks the upper bits of the operand registers regardless if
the BPF instruction is BPF_JMP32 or BPF_JMP64. Registers dreg_hi and
dreg_lo are not loaded from the stack for BPF_JMP32, however, they can
still be polluted with values from previous instructions.
The following BPF program demonstrates the bug. The jset64 instruction
loads the temporary registers and performs the jump, since ((u64)r7 &
(u64)r8) is non-zero. The jset32 should _not_ be taken, as the lower
32 bits are all zero, however, the current JIT will take the branch due
the pollution of temporary registers from the earlier jset64.
mov64 r0, 0
ld64 r7, 0x8000000000000000
ld64 r8, 0x8000000000000000
jset64 r7, r8, 1
exit
jset32 r7, r8, 1
mov64 r0, 2
exit
The expected return value of this program is 2; under the buggy x32 JIT
it returns 0. The fix is to skip using the upper 32 bits for jset32 and
compare the upper 32 bits for jset64 only.
All tests in test_bpf.ko and selftests/bpf/test_verifier continue to
pass with this change.
We found this bug using our automated verification tool, Serval.
Fixes: 69f827eb6e ("x32: bpf: implement jitting of JMP32")
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200305234416.31597-1-luke.r.nels@gmail.com
The current x32 BPF JIT does not correctly compile shift operations when
the immediate shift amount is 0. The expected behavior is for this to
be a no-op.
The following program demonstrates the bug. The expexceted result is 1,
but the current JITed code returns 2.
r0 = 1
r1 = 1
r1 <<= 0
if r1 == 1 goto end
r0 = 2
end:
exit
This patch simplifies the code and fixes the bug.
Fixes: 03f5781be2 ("bpf, x86_32: add eBPF JIT compiler for ia32")
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The current x32 BPF JIT for shift operations is not correct when the
shift amount in a register is 0. The expected behavior is a no-op, whereas
the current implementation changes bits in the destination register.
The following example demonstrates the bug. The expected result of this
program is 1, but the current JITed code returns 2.
r0 = 1
r1 = 1
r2 = 0
r1 <<= r2
if r1 == 1 goto end
r0 = 2
end:
exit
The bug is caused by an incorrect assumption by the JIT that a shift by
32 clear the register. On x32 however, shifts use the lower 5 bits of
the source, making a shift by 32 equivalent to a shift by 0.
This patch fixes the bug using double-precision shifts, which also
simplifies the code.
Fixes: 03f5781be2 ("bpf, x86_32: add eBPF JIT compiler for ia32")
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The current method to compare 64-bit numbers for conditional jump is:
1) Compare the high 32-bit first.
2) If the high 32-bit isn't the same, then goto step 4.
3) Compare the low 32-bit.
4) Check the desired condition.
This method is right for unsigned comparison, but it is buggy for signed
comparison, because it does signed comparison for low 32-bit too.
There is only one sign bit in 64-bit number, that is the MSB in the 64-bit
number, it is wrong to treat low 32-bit as signed number and do the signed
comparison for it.
This patch fixes the bug and adds a testcase in selftests/bpf for such bug.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch implements code-gen for new JMP32 instructions on x32.
Also fixed several reverse xmas tree coding style issues as I am there.
Cc: Wang YanQing <udknight@gmail.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 24dea04767 ("bpf, x32: remove ld_abs/ld_ind")
removed the 4 /* Extra space for skb_copy_bits buffer */
from _STACK_SIZE, but it didn't fix the concerned code
in emit_prologue and emit_epilogue, and this error will
bring very strange kernel runtime errors. This patch
fixes it.
Fixes: 24dea04767 ("bpf, x32: remove ld_abs/ld_ind")
Reported-by: Meelis Roos <mroos@linux.ee>
Bisected-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from x32 JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
only. Classic BPF is supported because of the conversion by BPF core.
Almost all instructions from eBPF ISA supported except the following:
BPF_ALU64 | BPF_DIV | BPF_K
BPF_ALU64 | BPF_DIV | BPF_X
BPF_ALU64 | BPF_MOD | BPF_K
BPF_ALU64 | BPF_MOD | BPF_X
BPF_STX | BPF_XADD | BPF_W
BPF_STX | BPF_XADD | BPF_DW
It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
eBPF registers, R0-R10, are simulated through scratch space on stack.
The reasons behind the hardware registers allocation policy are:
1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
for general eBPF 64bit register simulation.
2:We need at least 4 registers to simulate most eBPF ISA operations
on registers operands instead of on register&memory operands.
3:We need to put BPF_REG_AX on hardware registers, or constant blinding
will degrade jit performance heavily.
Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
Testing results on i5-5200U:
1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
2) test_progs: Summary: 83 PASSED, 0 FAILED.
3) test_lpm: OK
4) test_lru_map: OK
5) test_verifier: Summary: 828 PASSED, 0 FAILED.
Above tests are all done in following two conditions separately:
1:bpf_jit_enable=1 and bpf_jit_harden=0
2:bpf_jit_enable=1 and bpf_jit_harden=2
Below are some numbers for this jit implementation:
Note:
I run test_progs in kselftest 100 times continuously for every condition,
the numbers are in format: total/times=avg.
The numbers that test_bpf reports show almost the same relation.
a:jit_enable=0 and jit_harden=0 b:jit_enable=1 and jit_harden=0
test_pkt_access:PASS:ipv4:15622/100=156 test_pkt_access:PASS:ipv4:10674/100=106
test_pkt_access:PASS:ipv6:9130/100=91 test_pkt_access:PASS:ipv6:4855/100=48
test_xdp:PASS:ipv4:240198/100=2401 test_xdp:PASS:ipv4:138912/100=1389
test_xdp:PASS:ipv6:137326/100=1373 test_xdp:PASS:ipv6:68542/100=685
test_l4lb:PASS:ipv4:61100/100=611 test_l4lb:PASS:ipv4:37302/100=373
test_l4lb:PASS:ipv6:101000/100=1010 test_l4lb:PASS:ipv6:55030/100=550
c:jit_enable=1 and jit_harden=2
test_pkt_access:PASS:ipv4:10558/100=105
test_pkt_access:PASS:ipv6:5092/100=50
test_xdp:PASS:ipv4:131902/100=1319
test_xdp:PASS:ipv6:77932/100=779
test_l4lb:PASS:ipv4:38924/100=389
test_l4lb:PASS:ipv6:57520/100=575
The numbers show we get 30%~50% improvement.
See Documentation/networking/filter.txt for more information.
Changelog:
Changes v5-v6:
1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
consistence reason.
2:Clean up non-standard comments, reported by Daniel Borkmann.
3:Fix a memory leak issue, repoted by Daniel Borkmann.
Changes v4-v5:
1:Delete is_on_stack, BPF_REG_AX is the only one
on real hardware registers, so just check with
it.
2:Apply commit 1612a981b7 ("bpf, x64: fix JIT emission
for dead code"), suggested by Daniel Borkmann.
Changes v3-v4:
1:Fix changelog in commit.
I install llvm-6.0, then test_progs willn't report errors.
I submit another patch:
"bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
to fix another problem, after that patch, test_verifier willn't report errors too.
2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
Changes v2-v3:
1:Move BPF_REG_AX to real hardware registers for performance reason.
3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
4:Delete partial codes in 1c2a088a66, suggested by Daniel Borkmann.
5:Some bug fixes and comments improvement.
Changes v1-v2:
1:Fix bug in emit_ia32_neg64.
2:Fix bug in emit_ia32_arsh_r64.
3:Delete filename in top level comment, suggested by Thomas Gleixner.
4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
5:Rewrite some words in changelog.
6:CodingSytle improvement and a little more comments.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>