This warning will be enabled later so avoid triggering it.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
It's documented as returning int which is also implemented by glibc and
musl, so adopt that return type.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
According to manual page [1], posix spec [2] and source code like
arch/mips/kernel/syscall.c, for historic reasons, the sys_pipe() syscall
on some architectures has an unusual calling convention. It returns
results in two registers which means there is no need for it to do
verify the validity of a userspace pointer argument. Historically that
used to be expensive in Linux. These days the performance advantage is
negligible.
Nolibc doesn't support the unusual calling convention above, luckily
Linux provides a generic sys_pipe2() with an additional flags argument
from 2.6.27. If flags is 0, then pipe2() is the same as pipe(). So here
we use sys_pipe2() to implement the pipe().
pipe2() is also provided to allow users to use flags argument on demand.
[1]: https://man7.org/linux/man-pages/man2/pipe.2.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/pipe.html
Suggested-by: Zhangjin Wu <falcon@tinylab.org>
Link: https://lore.kernel.org/all/20230729100401.GA4577@1wt.eu/
Signed-off-by: Yuan Tan <tanyuan@tinylab.org>
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Add a minimal implementation of setvbuf(), which error checks the mode
argument (as required by spec) and returns. Since nolibc never buffers
output, nothing needs to be done.
The kselftest framework recently added a call to setvbuf(). As a result,
any tests that use the kselftest framework and nolibc cause a compiler
error due to missing function. This provides an urgent fix for the
problem which is preventing arm64 testing on linux-next.
Example:
clang --target=aarch64-linux-gnu -fintegrated-as
-Werror=unknown-warning-option -Werror=ignored-optimization-argument
-Werror=option-ignored -Werror=unused-command-line-argument
--target=aarch64-linux-gnu -fintegrated-as
-fno-asynchronous-unwind-tables -fno-ident -s -Os -nostdlib \
-include ../../../../include/nolibc/nolibc.h -I../..\
-static -ffreestanding -Wall za-fork.c
build/kselftest/arm64/fp/za-fork-asm.o
-o build/kselftest/arm64/fp/za-fork
In file included from <built-in>:1:
In file included from ./../../../../include/nolibc/nolibc.h:97:
In file included from ./../../../../include/nolibc/arch.h:25:
./../../../../include/nolibc/arch-aarch64.h:178:35: warning: unknown
attribute 'optimize' ignored [-Wunknown-attributes]
void __attribute__((weak,noreturn,optimize("omit-frame-pointer")))
__no_stack_protector _start(void)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from za-fork.c:12:
../../kselftest.h:123:2: error: call to undeclared function 'setvbuf';
ISO C99 and later do not support implicit function declarations
[-Wimplicit-function-declaration]
setvbuf(stdout, NULL, _IOLBF, 0);
^
../../kselftest.h:123:24: error: use of undeclared identifier '_IOLBF'
setvbuf(stdout, NULL, _IOLBF, 0);
^
1 warning and 2 errors generated.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Link: https://lore.kernel.org/linux-kselftest/CA+G9fYus3Z8r2cg3zLv8uH8MRrzLFVWdnor02SNr=rCz+_WGVg@mail.gmail.com/
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Also clean up the instructions in delayed slots.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
move most of the _start operations to _start_c(), include the
stackprotector initialization.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
As suggested by Thomas, It is able to move the stackprotector
initialization from the assembly _start to the beginning of the new
_start_c(). Let's call __stack_chk_init() in _start_c() as a
preparation.
Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/lkml/a00284a6-54b1-498c-92aa-44997fa78403@t-8ch.de/
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Let's define an empty __stack_chk_init for the !_NOLIBC_STACKPROTECTOR
branch.
This allows to remove #ifdef around every call of __stack_chk_init().
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
As the environ and _auxv support added for nolibc, the assembly _start
function becomes more and more complex and therefore makes the porting
of nolibc to new architectures harder and harder.
To simplify portability, this C version of _start_c() is added to do
most of the assembly start operations in C, which reduces the complexity
a lot and will eventually simplify the porting of nolibc to the new
architectures.
The new _start_c() only requires a stack pointer argument, it will find
argc, argv, envp/environ and _auxv for us, and then call main(),
finally, it exit() with main's return status. With this new _start_c(),
the future new architectures only require to add very few assembly
instructions.
As suggested by Thomas, users may use a different signature of main
(e.g. void main(void)), a _nolibc_main alias is added for main to
silence the warning about potential conflicting types.
As suggested by Willy, the code is carefully polished for both smaller
size and better readability with local variables and the right types.
Suggested-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/lkml/20230715095729.GC24086@1wt.eu/
Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/lkml/90fdd255-32f4-4caf-90ff-06456b53dac3@t-8ch.de/
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
The statx manpage [1] shows that it has been supported from Linux 4.11
and glibc 2.28, the Linux support can be checked for all of the
architectures with this command:
$ git grep -r statx v4.11 arch/ include/uapi/asm-generic/unistd.h \
| grep -E "aarch64|arm|mips|s390|x86|:include/uapi"
Besides riscv and loongarch, all of the nolibc supported architectures
have added sys_statx from Linux v4.11. riscv is mainlined to v4.15,
loongarch is mainlined to v5.19, both of them use the generic unistd.h,
so, they have added sys_statx from their first mainline versions.
The current oldest stable branch is v4.14, only reserving sys_statx
still preserves compatibility with all of the supported stable branches,
So, let's remove the old arch related and dependent sys_stat support
completely.
This is friendly to the future new architecture porting.
[1]: https://man7.org/linux/man-pages/man2/statx.2.html
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
As gcc doc [1] shows:
Most optimizations are completely disabled at -O0 or if an -O level is
not set on the command line, even if individual optimization flags are
specified.
Test result [2] shows, gcc>=11.1.0 deviates from the above description,
but before gcc 11.1.0, "-O0" still forcely uses frame pointer in the
_start function even if the individual optimize("omit-frame-pointer")
flag is specified.
The frame pointer related operations will change the stack pointer (e.g.
In x86_64, an extra "push %rbp" will be inserted at the beginning of
_start) and make it differs from the one we expected, as a result, break
the whole startup function.
To fix up this issue, as suggested by Thomas, the individual "Os" and
"omit-frame-pointer" optimize flags are used together on _start function
to disable frame pointer completely even if the -O0 is set on the
command line.
[1]: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
[2]: https://lore.kernel.org/lkml/20230714094723.140603-1-falcon@tinylab.org/
Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/lkml/34b21ba5-7b59-4b3b-9ed6-ef9a3a5e06f7@t-8ch.de/
Fixes: 7f85485896 ("tools/nolibc: make compiler and assembler agree on the section around _start")
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Fix up such errors reported by scripts/checkpatch.pl:
ERROR: space required after that ',' (ctx:VxV)
#148: FILE: tools/include/nolibc/arch-aarch64.h:148:
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
^
ERROR: space required after that ',' (ctx:VxV)
#148: FILE: tools/include/nolibc/arch-aarch64.h:148:
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
^
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
In commit 52e423f5b9 ("tools/nolibc: export environ as a weak symbol on i386")
and friends the asm startup logic was extended to directly populate the
"environ" array.
This makes it impossible for "environ" to be dropped by the linker.
Therefore also drop the other logic to handle non-present "environ".
Also add a testcase to validate the initialization of environ.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
a reverse operation of mkdir() is meaningful, add rmdir() here.
required by nolibc-test to remove /proc while CONFIG_PROC_FS is not
enabled.
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Both glibc and musl provide RB_ flags via <sys/reboot.h> for reboot(),
they don't need to include <linux/reboot.h>, let nolibc provide RB_
flags too.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Fix up the error reported by scripts/checkpatch.pl:
ERROR: do not use assignment in if condition
#95: FILE: tools/include/nolibc/sys.h:95:
+ if ((ret = sys_brk(0)) && (sys_brk(ret + inc) == ret + inc))
Apply the new generic __sysret() to merge the SET_ERRNO() and return
lines.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Do several cleanups together:
- Since all supported architectures have my_syscall6() now, remove the
#ifdef check.
- Move the mmap() related macros to tools/include/nolibc/types.h and
reuse most of them from <linux/mman.h>
- Apply the new generic __sysret() to convert the calling of sys_map()
to oneline code
Note, since MAP_FAILED is -1 on Linux, so we can use the generic
__sysret() which returns -1 upon error and still satisfy user land that
checks for MAP_FAILED.
Suggested-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/lkml/20230702192347.GJ16233@1wt.eu/
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
No official reference states the errno range, here aligns with musl and
glibc and uses [-MAX_ERRNO, -1] instead of all negative ones.
- musl: src/internal/syscall_ret.c
- glibc: sysdeps/unix/sysv/linux/sysdep.h
The MAX_ERRNO used by musl and glibc is 4095, just like the one nolibc
defined in tools/include/nolibc/errno.h.
Suggested-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/lkml/ZKKdD%2Fp4UkEavru6@1wt.eu/
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Link: https://lore.kernel.org/linux-riscv/94dd5170929f454fbc0a10a2eb3b108d@AcuMS.aculab.com/
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
It is able to pass the 6th argument like the 5th argument via the stack
for mips, let's add a new my_syscall6() now, see [1] for details:
The mips/o32 system call convention passes arguments 5 through 8 on
the user stack.
Both mmap() and pselect6() require my_syscall6().
[1]: https://man7.org/linux/man-pages/man2/syscall.2.html
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
my_syscall<N> share the same long clobber list, define a macro for them.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
my_syscall<N> share the same long clobber list, define a macro for them.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
replace "__asm__ volatile" with "__asm__ volatile" and insert necessary
whitespace before "\" to make sure the lines are aligned.
$ sed -i -e 's/__asm__ volatile ( /__asm__ volatile ( /g' tools/include/nolibc/*.h
Note, arch-s390.h uses post-tab instead of post-whitespaces, must avoid
insert whitespace just before the tabs:
$ sed -i -e 's/__asm__ volatile (\t/__asm__ volatile (\t/g' tools/include/nolibc/arch-*.h
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
More than 8 whitespaces of the code indent are replaced with "tab +
whitespaces" to fix up such errors reported by scripts/checkpatch.pl:
ERROR: code indent should use tabs where possible
#64: FILE: tools/include/nolibc/arch-mips.h:64:
+^I \$
ERROR: code indent should use tabs where possible
#72: FILE: tools/include/nolibc/arch-mips.h:72:
+^I "t0", "t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9" \$
This command is used:
$ sed -i -e '/^\t* /{s/ /\t/g}' tools/include/nolibc/arch-*.h
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Adding support to specify pid for uprobe_multi link and the uprobes
are created only for task with given pid value.
Using the consumer.filter filter callback for that, so the task gets
filtered during the uprobe installation.
We still need to check the task during runtime in the uprobe handler,
because the handler could get executed if there's another system
wide consumer on the same uprobe (thanks Oleg for the insight).
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding support to specify cookies array for uprobe_multi link.
The cookies array share indexes and length with other uprobe_multi
arrays (offsets/ref_ctr_offsets).
The cookies[i] value defines cookie for i-the uprobe and will be
returned by bpf_get_attach_cookie helper when called from ebpf
program hooked to that specific uprobe.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding new multi uprobe link that allows to attach bpf program
to multiple uprobes.
Uprobes to attach are specified via new link_create uprobe_multi
union:
struct {
__aligned_u64 path;
__aligned_u64 offsets;
__aligned_u64 ref_ctr_offsets;
__u32 cnt;
__u32 flags;
} uprobe_multi;
Uprobes are defined for single binary specified in path and multiple
calling sites specified in offsets array with optional reference
counters specified in ref_ctr_offsets array. All specified arrays
have length of 'cnt'.
The 'flags' supports single bit for now that marks the uprobe as
return probe.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Switching BPF_F_KPROBE_MULTI_RETURN macro to anonymous enum,
so it'd show up in vmlinux.h. There's not functional change
compared to having this as macro.
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding support for bpf_get_func_ip helper for uprobe program to return
probed address for both uprobe and return uprobe.
We discussed this in [1] and agreed that uprobe can have special use
of bpf_get_func_ip helper that differs from kprobe.
The kprobe bpf_get_func_ip returns:
- address of the function if probe is attach on function entry
for both kprobe and return kprobe
- 0 if the probe is not attach on function entry
The uprobe bpf_get_func_ip returns:
- address of the probe for both uprobe and return uprobe
The reason for this semantic change is that kernel can't really tell
if the probe user space address is function entry.
The uprobe program is actually kprobe type program attached as uprobe.
One of the consequences of this design is that uprobes do not have its
own set of helpers, but share them with kprobes.
As we need different functionality for bpf_get_func_ip helper for uprobe,
I'm adding the bool value to the bpf_trace_run_ctx, so the helper can
detect that it's executed in uprobe context and call specific code.
The is_uprobe bool is set as true in bpf_prog_run_array_sleepable, which
is currently used only for executing bpf programs in uprobe.
Renaming bpf_prog_run_array_sleepable to bpf_prog_run_array_uprobe
to address that it's only used for uprobes and that it sets the
run_ctx.is_uprobe as suggested by Yafang Shao.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
[1] https://lore.kernel.org/bpf/CAEf4BzZ=xLVkG5eurEuvLU79wAMtwho7ReR+XJAgwhFF4M-7Cg@mail.gmail.com/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230807085956.2344866-2-jolsa@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Use __sysret() to shrink most of the library routines to oneline code.
Removed 266 lines of duplicated code.
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Use __sysret() to shrink the whole _syscall() to oneline code.
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Most of the library routines share the same syscall return logic:
In general, a 0 return value indicates success. A -1 return value
indicates an error, and an error number is stored in errno. [1]
Let's add a __sysret() helper for the above logic to simplify the coding
and shrink the code lines too.
Thomas suggested to use inline function instead of macro for __sysret().
Willy suggested to make __sysret() be always inline.
[1]: https://man7.org/linux/man-pages/man2/syscall.2.html
Suggested-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/linux-riscv/ZH1+hkhiA2+ItSvX@1wt.eu/
Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/linux-riscv/ea4e7442-7223-4211-ba29-70821e907888@t-8ch.de/
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Compiling nolibc for rv32 got such errors:
nolibc/sysroot/riscv/include/sys.h: In function ‘sys_gettimeofday’:
nolibc/sysroot/riscv/include/sys.h:557:21: error: ‘__NR_gettimeofday’ undeclared (first use in this function); did you mean ‘sys_gettimeofday’?
557 | return my_syscall2(__NR_gettimeofday, tv, tz);
| ^~~~~~~~~~~~~~~~~
nolibc/sysroot/riscv/include/sys.h: In function ‘sys_lseek’:
nolibc/sysroot/riscv/include/sys.h:675:21: error: ‘__NR_lseek’ undeclared (first use in this function)
675 | return my_syscall3(__NR_lseek, fd, offset, whence);
| ^~~~~~~~~~
nolibc/sysroot/riscv/include/sys.h: In function ‘sys_wait4’:
nolibc/sysroot/riscv/include/sys.h:1341:21: error: ‘__NR_wait4’ undeclared (first use in this function)
1341 | return my_syscall4(__NR_wait4, pid, status, options, rusage);
If a syscall macro is not supported by a target platform, wrap it with
'#ifdef' and 'return -ENOSYS' for the '#else' branch, which lets the
other syscalls work as-is and allows developers to fix up the test
failures reported by nolibc-test one by one later.
This wraps all of the failed syscall macros with '#ifdef' and 'return
-ENOSYS' for the '#else' branch, so, all of the undeclared failures are
fixed.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/linux-riscv/5e7d2adf-e96f-41ca-a4c6-5c87a25d4c9c@app.fastmail.com/
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Compiling nolibc for rv32 got such errors:
In file included from nolibc/sysroot/riscv/include/nolibc.h:99,
from nolibc/sysroot/riscv/include/errno.h:26,
from nolibc/sysroot/riscv/include/stdio.h:14,
from tools/testing/selftests/nolibc/nolibc-test.c:12:
nolibc/sysroot/riscv/include/sys.h:946:2: error: #error Neither __NR_ppoll nor __NR_poll defined, cannot implement sys_poll()
946 | #error Neither __NR_ppoll nor __NR_poll defined, cannot implement sys_poll()
| ^~~~~
nolibc/sysroot/riscv/include/sys.h:1062:2: error: #error None of __NR_select, __NR_pselect6, nor __NR__newselect defined, cannot implement sys_select()
1062 | #error None of __NR_select, __NR_pselect6, nor __NR__newselect defined, cannot implement sys_select()
If a syscall is not supported by a target platform, 'return -ENOSYS' is
better than '#error', which lets the other syscalls work as-is and
allows developers to fix up the test failures reported by nolibc-test
one by one later.
This converts all of the '#error' to 'return -ENOSYS', so, all of the
'#error' failures are fixed.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/linux-riscv/5e7d2adf-e96f-41ca-a4c6-5c87a25d4c9c@app.fastmail.com/
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
=/kVE
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2023-08-03
We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).
The main changes are:
1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
Daniel Borkmann
2) Support new insns from cpu v4 from Yonghong Song
3) Non-atomically allocate freelist during prefill from YiFei Zhu
4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu
5) Add tracepoint to xdp attaching failure from Leon Hwang
6) struct netdev_rx_queue and xdp.h reshuffling to reduce
rebuild time from Jakub Kicinski
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
net: invert the netdevice.h vs xdp.h dependency
net: move struct netdev_rx_queue out of netdevice.h
eth: add missing xdp.h includes in drivers
selftests/bpf: Add testcase for xdp attaching failure tracepoint
bpf, xdp: Add tracepoint to xdp attaching failure
selftests/bpf: fix static assert compilation issue for test_cls_*.c
bpf: fix bpf_probe_read_kernel prototype mismatch
riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
libbpf: fix typos in Makefile
tracing: bpf: use struct trace_entry in struct syscall_tp_t
bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
netfilter: bpf: Only define get_proto_defrag_hook() if necessary
bpf: Fix an array-index-out-of-bounds issue in disasm.c
net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
docs/bpf: Fix malformed documentation
bpf: selftests: Add defrag selftests
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Support not connecting client socket
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
...
====================
Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds support for enabling IP defrag using pre-existing
netfilter defrag support. Basically all the flag does is bump a refcnt
while the link the active. Checks are also added to ensure the prog
requesting defrag support is run _after_ netfilter defrag hooks.
We also take care to avoid any issues w.r.t. module unloading -- while
defrag is active on a link, the module is prevented from unloading.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Also add support to pass topdir to ynl-regen.sh (Jakub) and call
it from the makefile to update the UAPI headers.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230727163001.3952878-4-sdf@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.
The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:
sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.
Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.
Fixes: 8e368dc72e ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-7-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Seeing the following:
Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'
...so sync tools version missing some list_node/rb_tree fields.
Fixes: c3c510ce43 ("bpf: Add 'owner' field to bpf_{list,rb}_node")
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20230719162257.20818-1-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.
The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:
I cannot help but feel that priority logic copy-paste from old tc, netfilter
and friends is done because "that's how things were done in the past". [...]
Priority gets exposed everywhere in uapi all the way to bpftool when it's
right there for users to understand. And that's the main problem with it.
The user don't want to and don't need to be aware of it, but uapi forces them
to pick the priority. [...] Your cover letter [0] example proves that in
real life different service pick the same priority. They simply don't know
any better. Priority is an unnecessary magic that apps _have_ to pick, so
they just copy-paste and everyone ends up using the same.
The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.
Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:
- Support prog-based attach/detach and link API
- Dependency directives (can also be combined):
- BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
- BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
space application does not need CAP_SYS_ADMIN to retrieve foreign fds
via bpf_*_get_fd_by_id()
- BPF_F_LINK flag as {prog,link} toggle
- If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
BPF_F_AFTER will just append for attaching
- Enforced only at attach time
- BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
own infra for replacing their internal prog
- If no flags are set, then it's default append behavior for attaching
- Internal revision counter and optionally being able to pass expected_revision
- User space application can query current state with revision, and pass it
along for attachment to assert current state before doing updates
- Query also gets extension for link_ids array and link_attach_flags:
- prog_ids are always filled with program IDs
- link_ids are filled with link IDs when link was used, otherwise 0
- {prog,link}_attach_flags for holding {prog,link}-specific flags
- Must be easy to integrate/reuse for in-kernel users
The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.
The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.
The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.
An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.
Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.
[0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
[1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
[2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the first basic multi-buffer test that sends a stream of 9K
packets and validates that they are received at the other end. In
order to enable sending and receiving multi-buffer packets, code that
sets the MTU is introduced as well as modifications to the XDP
programs so that they signal that they are multi-buffer enabled.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-20-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the ability to send and receive packets that are larger than the
size of a umem frame, using the AF_XDP /XDP multi-buffer
support. There are three pieces of code that need to be changed to
achieve this: the Rx path, the Tx path, and the validation logic.
Both the Rx path and Tx could only deal with a single fragment per
packet. The Tx path is extended with a new function called
pkt_nb_frags() that can be used to retrieve the number of fragments a
packet will consume. We then create these many fragments in a loop and
fill the N-1 first ones to the max size limit to use the buffer space
efficiently, and the Nth one with whatever data that is left. This
goes on until we have filled in at the most BATCH_SIZE worth of
descriptors and fragments. If we detect that the next packet would
lead to BATCH_SIZE number of fragments sent being exceeded, we do not
send this packet and finish the batch. This packet is instead sent in
the next iteration of BATCH_SIZE fragments.
For Rx, we loop over all fragments we receive as usual, but for every
descriptor that we receive we call a new validation function called
is_frag_valid() to validate the consistency of this fragment. The code
then checks if the packet continues in the next frame. If so, it loops
over the next packet and performs the same validation. once we have
received the last fragment of the packet we also call the function
is_pkt_valid() to validate the packet as a whole. If we get to the end
of the batch and we are not at the end of the current packet, we back
out the partial packet and end the loop. Once we get into the receive
loop next time, we start over from the beginning of that packet. This
so the code becomes simpler at the cost of some performance.
The validation function is_frag_valid() checks that the sequence and
packet numbers are correct at the start and end of each fragment.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-19-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
carry maximum fragments that underlying ZC driver is able to handle on
TX side. It is going to be included in netlink response only when driver
supports ZC. Any value higher than 1 implies multi-buffer ZC support on
underlying device.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmSwqwoACgkQ6rmadz2v
bTqOHRAAn+fzTLqUqsveFQcxOkie5MPHxKoOTjG4+yFR7rzPkU6Mn5RX3w5yFzSn
RqutwykF9OgipAzC3QXv4pRJuq6Gia5nvwUSDP4CX273ljyeF54DK7HfopE1+YrK
HXyBWZvVvMZP6q7qQyQ3qtbHZSjs5XP/M6YBlJ5zo/BTLFCyvbSDP14YKEqcBkWG
ld72ElXFxlnr/zEfRjzBCfMlbmgeHLO0SiHS/9827zEmNP1AAH5/ETA7/rJ7yCJs
QNQUIoJWob8xm5FMJ6CU/+sOqXR1CY053meGJFFBX5pvVD/CLRhrwHn0IMCyQqmh
wKR5waeXhpl/CKNeFuxXVMNFiXbqBb/0LYJaJtrMysjMLTsQ9X7NkrDBa/+kYGyZ
+ghGlaMQvPqUGg0rLH2nl9JNB8Ne/8prLMsAKUWnPuOo+Q03j054gnqhGeNtDd5b
gpSk+7x93PlhGcegBV1Wk8dkiGC5V9nTVNxg40XQUCs4k9L/8Vjc35Tjqx7nBTNH
DiFD24DDKUZacw9L6nEqvLF/N2fiRjtUZnVPC0yn/annyBcfX1s+ZH2Tu1F6Qk38
QMfBCnt12exmsiDoxdzzGJtjHnS/k5fsaKjlR21mOyMrIH7ipltr5UHHrdr1hBP6
24uSeTImvQQKDi+9IuXN127jZDOupKqVS6csrA0ZXrlKWh2HR+U=
=GVUB
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-07-13
We've added 67 non-merge commits during the last 15 day(s) which contain
a total of 106 files changed, 4444 insertions(+), 619 deletions(-).
The main changes are:
1) Fix bpftool build in presence of stale vmlinux.h,
from Alexander Lobakin.
2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress,
from Alexei Starovoitov.
3) Teach verifier actual bounds of bpf_get_smp_processor_id()
and fix perf+libbpf issue related to custom section handling,
from Andrii Nakryiko.
4) Introduce bpf map element count, from Anton Protopopov.
5) Check skb ownership against full socket, from Kui-Feng Lee.
6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong.
7) Export rcu_request_urgent_qs_task, from Paul E. McKenney.
8) Fix BTF walking of unions, from Yafang Shao.
9) Extend link_info for kprobe_multi and perf_event links,
from Yafang Shao.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits)
selftests/bpf: Add selftest for PTR_UNTRUSTED
bpf: Fix an error in verifying a field in a union
selftests/bpf: Add selftests for nested_trust
bpf: Fix an error around PTR_UNTRUSTED
selftests/bpf: add testcase for TRACING with 6+ arguments
bpf, x86: allow function arguments up to 12 for TRACING
bpf, x86: save/restore regs with BPF_DW size
bpftool: Use "fallthrough;" keyword instead of comments
bpf: Add object leak check.
bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
selftests/bpf: Improve test coverage of bpf_mem_alloc.
rcu: Export rcu_request_urgent_qs_task()
bpf: Allow reuse from waiting_for_gp_ttrace list.
bpf: Add a hint to allocated objects.
bpf: Change bpf_mem_cache draining process.
bpf: Further refactor alloc_bulk().
bpf: Factor out inc/dec of active flag into helpers.
bpf: Refactor alloc_bulk().
bpf: Let free_all() return the number of freed elements.
...
====================
Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
By introducing support for ->fill_link_info to the perf_event link, users
gain the ability to inspect it using `bpftool link show`. While the current
approach involves accessing this information via `bpftool perf show`,
consolidating link information for all link types in one place offers
greater convenience. Additionally, this patch extends support to the
generic perf event, which is not currently accommodated by
`bpftool perf show`. While only the perf type and config are exposed to
userspace, other attributes such as sample_period and sample_freq are
ignored. It's important to note that if kptr_restrict is not permitted, the
probed address will not be exposed, maintaining security measures.
A new enum bpf_perf_event_type is introduced to help the user understand
which struct is relevant.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With the addition of support for fill_link_info to the kprobe_multi link,
users will gain the ability to inspect it conveniently using the
`bpftool link show`. This enhancement provides valuable information to the
user, including the count of probed functions and their respective
addresses. It's important to note that if the kptr_restrict setting is not
permitted, the probed address will not be exposed, ensuring security.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
To pick the changes from:
6ac3928156 ("fs: allow to mount beneath top mount")
That, after a fix to the move_mount_flags.sh script, harvests the new
MOVE_MOUNT_BENEATH move_mount flag:
$ tools/perf/trace/beauty/move_mount_flags.sh > before
$ cp include/uapi/linux/mount.h tools/include/uapi/linux/mount.h
$ tools/perf/trace/beauty/move_mount_flags.sh > after
$
$ diff -u before after
--- before 2023-07-11 12:38:49.244886707 -0300
+++ after 2023-07-11 12:51:15.125255940 -0300
@@ -6,4 +6,5 @@
[ilog2(0x00000020) + 1] = "T_AUTOMOUNTS",
[ilog2(0x00000040) + 1] = "T_EMPTY_PATH",
[ilog2(0x00000100) + 1] = "SET_GROUP",
+ [ilog2(0x00000200) + 1] = "BENEATH",
};
$
That will then be properly decoded when used in tools like:
# perf trace -e move_mount
This addresses this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h
Cc: Christian Brauner <brauner@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZK17kifP%2FiYl+Hcc@kernel.org/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
89d01306e3 ("RISC-V: KVM: Implement device interface for AIA irqchip")
22725266bd ("KVM: Fix comment for KVM_ENABLE_CAP")
2f440b72e8 ("KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE")
That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the the 'perf trace' ioctl syscall argument
beautifiers.
This addresses this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Anup Patel <apatel@ventanamicro.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Ricardo Koller <ricarkol@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/lkml/ZK12+virXMIXMysy@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To get the changes in:
96b2b072ee ("exportfs: allow exporting non-decodeable file handles to userspace")
That don't add anything that is handled by existing hard coded tables or
table generation scripts.
This silences this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZK11P5AwRBUxxutI@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
81b1b599df ("drm/i915: Allow user to set cache at BO creation")
98d2722a85 ("drm/i915/huc: differentiate the 2 steps of the MTL HuC auth flow")
bc4be0a38b ("drm/i915/pmu: Prepare for multi-tile non-engine counters")
d1da138f24 ("drm/i915/uapi/pxp: Add a GET_PARAM for PXP")
That adds some ioctls but use the __I915_PMU_OTHER() macro, not
supported yet in the tools/perf/trace/beauty/drm_ioctl.sh conversion
script.
This silences this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Fei Yang <fei.yang@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Radhakrishna Sripada <radhakrishna.sripada@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Link: https://lore.kernel.org/lkml/ZK1R%2FIyWcUKYQbQV@kernel.org/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
These are cleanups for architecture specific header files:
- the comments in include/linux/syscalls.h have gone out of sync
and are really pointless, so these get removed
- The asm/bitsperlong.h header no longer needs to be architecture
specific on modern compilers, so use a generic version for newer
architectures that use new enough userspace compilers
- A cleanup for virt_to_pfn/virt_to_bus to have proper type
checking, forcing the use of pointers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmSl138ACgkQYKtH/8kJ
UieqWxAA2WjNVfyuieYckglOVE0PZPs2fzCwyzTY5iUTH3gE5cBFWJDWcg2EnouG
v3X3htEQcowYWaCF9+rypQXaGiSx4WXi2Bjxnz3D/BcreqWPI4eSQ0fpGG5SURTY
2zYF72GTt4JGR++l+7/R9MZwPbwYDT9BsD5tkel8PxnyVLM6/c5xFvbjzRSKFE8x
SMN1jGZ62ITLNf/8coAOEPNxBYtDT6yQyu7P2sx5cd65LAQq9yLKjFklnBBovgWT
OoCIZAdGkhcNwOh1LjyHcdNdpfNJGceKyqKPqty07IhCQuF2jxiyFYFzuBbeyQfE
S0itN8o/MIfUmxaQl3e8dPAVb1RlNVr1zfQ6y4tUtWNdkNL2WwSnSQSRHrBfHxCQ
QCF++PMeFcLhGwMYtqdNJ7XGLQ0PsjD74pRf0vo+vjmqDk2BJsJBP57VU+8MJn5r
SoxqnJ0WxLvm1TfrNKusV7zMNWquc2duJDW40zsOssP4itjYELSI6qa56qmzlqmX
zKmRx6mxAlx9RRK8FHXFYHbz3p93vv8z9vTOZV3AjIjjED960CLknUAwCC8FoJyz
9b5wyMXsLQHQjGt8luAvPc6OiU0EiU9a4SPK+feWcv27serFvnjJlRTS/yG2Z3zd
BYsUgsXHypsdoud+aE7MeCy7fE8n3mhoyMQQRBkOMFJ7RsG6wAE=
=S/he
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"These are cleanups for architecture specific header files:
- the comments in include/linux/syscalls.h have gone out of sync and
are really pointless, so these get removed
- The asm/bitsperlong.h header no longer needs to be architecture
specific on modern compilers, so use a generic version for newer
architectures that use new enough userspace compilers
- A cleanup for virt_to_pfn/virt_to_bus to have proper type checking,
forcing the use of pointers"
* tag 'asm-generic-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
syscalls: Remove file path comments from headers
tools arch: Remove uapi bitsperlong.h of hexagon and microblaze
asm-generic: Unify uapi bitsperlong.h for arm64, riscv and loongarch
m68k/mm: Make pfn accessors static inlines
arm64: memory: Make virt_to_pfn() a static inline
ARM: mm: Make virt_to_pfn() a static inline
asm-generic/page.h: Make pfn accessors static inlines
xen/netback: Pass (void *) to virt_to_page()
netfs: Pass a pointer to virt_to_page()
cifs: Pass a pointer to virt_to_page() in cifsglob
cifs: Pass a pointer to virt_to_page()
riscv: mm: init: Pass a pointer to virt_to_page()
ARC: init: Pass a pointer to virt_to_pfn() in init
m68k: Pass a pointer to virt_to_pfn() virt_to_page()
fs/proc/kcore.c: Pass a pointer to virt_addr_valid()
Core
----
- Rework the sendpage & splice implementations. Instead of feeding
data into sockets page by page extend sendmsg handlers to support
taking a reference on the data, controlled by a new flag called
MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
to invoke an additional callback instead of trying to predict what
the right combination of MORE/NOTLAST flags is.
Remove the MSG_SENDPAGE_NOTLAST flag completely.
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
- Enable socket busy polling with CONFIG_RT.
- Improve reliability and efficiency of reporting for ref_tracker.
- Auto-generate a user space C library for various Netlink families.
Protocols
---------
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2].
- Use per-VMA locking for "page-flipping" TCP receive zerocopy.
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags.
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative.
- Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
- Avoid waking up applications using TLS sockets until we have
a full record.
- Allow using kernel memory for protocol ioctl callbacks, paving
the way to issuing ioctls over io_uring.
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address.
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch.
- PPPoE: make number of hash bits configurable.
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig).
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge).
- Support matching on Connectivity Fault Management (CFM) packets.
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug.
- HSR: don't enable promiscuous mode if device offloads the proto.
- Support active scanning in IEEE 802.15.4.
- Continue work on Multi-Link Operation for WiFi 7.
BPF
---
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used,
or in fact passing the verifier at all for complex programs,
especially those using open-coded iterators.
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what
the output buffer *should* be, without writing anything.
- Accept dynptr memory as memory arguments passed to helpers.
- Add routing table ID to bpf_fib_lookup BPF helper.
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only).
- Show target_{obj,btf}_id in tracing link fdinfo.
- Addition of several new kfuncs (most of the names are self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter
---------
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value.
- Increase ip_vs_conn_tab_bits range for 64BIT builds.
- Allow updating size of a set.
- Improve NAT tuple selection when connection is closing.
Driver API
----------
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out).
- Support configuring rate selection pins of SFP modules.
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines.
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer.
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio).
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message.
- Split devlink instance and devlink port operations.
New hardware / drivers
----------------------
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers
-------
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is
a variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split
the different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and
Enhanced MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
=9i4I
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski:
"WiFi 7 and sendpage changes are the biggest pieces of work for this
release. The latter will definitely require fixes but I think that we
got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg
handlers to support taking a reference on the data, controlled by a
new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an
additional callback instead of trying to predict what the right
combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info
(MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full
record
- Allow using kernel memory for protocol ioctl callbacks, paving the
way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used, or
in fact passing the verifier at all for complex programs,
especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what the
output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are
self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers:
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer
header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested
configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is a
variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split the
different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced
MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
net: scm: introduce and use scm_recv_unix helper
af_unix: Skip SCM_PIDFD if scm->pid is NULL.
net: lan743x: Simplify comparison
netlink: Add __sock_i_ino() for __netlink_diag_dump().
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
libceph: Partially revert changes to support MSG_SPLICE_PAGES
net: phy: mscc: fix packet loss due to RGMII delays
net: mana: use vmalloc_array and vcalloc
net: enetc: use vmalloc_array and vcalloc
ionic: use vmalloc_array and vcalloc
pds_core: use vmalloc_array and vcalloc
gve: use vmalloc_array and vcalloc
octeon_ep: use vmalloc_array and vcalloc
net: usb: qmi_wwan: add u-blox 0x1312 composition
perf trace: fix MSG_SPLICE_PAGES build error
ipvlan: Fix return value of ipvlan_queue_xmit()
netfilter: nf_tables: fix underflow in chain reference counter
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
...
o Add stackprotector support.
o Fix RISC-V load-store instruction syntax to support 32-bit
binaries, plus fixes for generic 32-bit support.
o Fix use of s390 sys_fork().
o Add my_syscall6() for ARM.
o Support different platforms having different errno definitions.
o Fix ppoll/ppoll_time64 arguments (add the fifth argument).
o Force use of little endian on MIPS.
o Improved testing, for example, better handling of different
compilers and compiler versions, comparing nolibc behavior
to that of libc, and additional test cases.
o Improve syntax and header ordering.
o Use existing <linux/reboot.h> instead of redefining constants.
o Add syscall().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmSUv78THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jIUUD/9kEmF9BgeXerrsOx3omRgt7DbZR1Kj
UYEI0mzydqQc92ZuS87VsM56rOA5SUDUDCPgTkkPCpo1anqo22+9FfFLU6M7EoEJ
CNISlLtb7S1MdM9hND0RlxKoHxthcVcpUThVzAGMmuTNukJudwVBr085iiOS20VO
taR5oHbPEE6pMmhbEsurmUHwTaqeCnuSZqmUoHnzatOidRByZDL7/mPr8y+lhtwo
MP0wkS9ie6OTs7shH2/tt683ZY/v/JZnoOmokl7YxN6vsWeTxX7H3W4jdSGrPqW5
H+OaMVZV5QPG3EFN6MhvdMSAGWLXohMtMuSLc/BACwJ8u073LvJgJHoBahiVPXn7
y0bJbZbnXvkpp+Hqxh4argarwtQum3KAUrNLO/vIWSjJN1HbT0rhc1sRhAM+cta8
3a2nSsGf4xW8ToHgg2Q9PNzJSHxtIX1LxSEboS0IyRSYsdUS9E8gxugVIfyH9Rle
gyasoSjepqwLVz6JnWiFIizHLPpEc22a3wSoRm6MzRKFaY+f8+KW6si7GgmSNmdA
LJk5tid+2Unjz7BhXJ14XHRBpHYdQRQ4uA42EcUSc1CFc4/0rodGJ0hi03SXDGWY
dH11x/yKW54lWZqyYUA/KAcJm8jwCFIWfGRvY9DHrA0Sh5aEyeNH3Brx1iITqnht
svgWtwUsBJYIMQ==
=Is+H
-----END PGP SIGNATURE-----
Merge tag 'nolibc.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull nolibc updates from Paul McKenney:
- Add stackprotector support
- Fix RISC-V load-store instruction syntax to support 32-bit binaries,
plus fixes for generic 32-bit support
- Fix use of s390 sys_fork()
- Add my_syscall6() for ARM
- Support different platforms having different errno definitions
- Fix ppoll/ppoll_time64 arguments (add the fifth argument)
- Force use of little endian on MIPS
- Improved testing, for example, better handling of different compilers
and compiler versions, comparing nolibc behavior to that of libc, and
additional test cases
- Improve syntax and header ordering
- Use existing <linux/reboot.h> instead of redefining constants
- Add syscall()
* tag 'nolibc.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (53 commits)
selftests/nolibc: make sure gcc always use little endian on MIPS
selftests/nolibc: also count skipped and failed tests in output
selftests/nolibc: add new gettimeofday test cases
selftests/nolibc: remove gettimeofday_bad1/2 completely
selftests/nolibc: support two errnos with EXPECT_SYSER2()
tools/nolibc: open: fix up compile warning for arm
tools/nolibc: arm: add missing my_syscall6
selftests/nolibc: use INT_MAX instead of __INT_MAX__
selftests/nolibc: not include limits.h for nolibc
selftests/nolibc: fix up compile warning with glibc on x86_64
selftests/nolibc: allow specify extra arguments for qemu
selftests/nolibc: remove test gettimeofday_null
tools/nolibc: ensure fast64 integer types have 64 bits
selftests/nolibc: test_fork: fix up duplicated print
tools/nolibc: ppoll/ppoll_time64: add a missing argument
selftests/nolibc: remove the duplicated gettimeofday_bad2
selftests/nolibc: print name instead of number for EOVERFLOW
tools/nolibc: support nanoseconds in stat()
selftests/nolibc: prevent coredumps during test execution
tools/nolibc: add support for prctl()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJX+ygAKCRDbK58LschI
g0/2AQDHg12smf9mPfK9wOFDNRIIX8r2iufB8LUFQMzCwltN6gEAkAdkAyfbof7P
TMaNUiHABijAFtChxoSI35j3OOSRrwE=
=GJgN
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-06-23
We've added 49 non-merge commits during the last 24 day(s) which contain
a total of 70 files changed, 1935 insertions(+), 442 deletions(-).
The main changes are:
1) Extend bpf_fib_lookup helper to allow passing the route table ID,
from Louis DeLosSantos.
2) Fix regsafe() in verifier to call check_ids() for scalar registers,
from Eduard Zingerman.
3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and()
and a rework of bpf_cpumask_any*() kfuncs. Additionally,
add selftests, from David Vernet.
4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings,
from Gilad Sever.
5) Change bpf_link_put() to use workqueue unconditionally to fix it
under PREEMPT_RT, from Sebastian Andrzej Siewior.
6) Follow-ups to address issues in the bpf_refcount shared ownership
implementation, from Dave Marchevsky.
7) A few general refactorings to BPF map and program creation permissions
checks which were part of the BPF token series, from Andrii Nakryiko.
8) Various fixes for benchmark framework and add a new benchmark
for BPF memory allocator to BPF selftests, from Hou Tao.
9) Documentation improvements around iterators and trusted pointers,
from Anton Protopopov.
10) Small cleanup in verifier to improve allocated object check,
from Daniel T. Lee.
11) Improve performance of bpf_xdp_pointer() by avoiding access
to shared_info when XDP packet does not have frags,
from Jesper Dangaard Brouer.
12) Silence a harmless syzbot-reported warning in btf_type_id_size(),
from Yonghong Song.
13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper,
from Jarkko Sakkinen.
14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS,
from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
bpf, docs: Document existing macros instead of deprecated
bpf, docs: BPF Iterator Document
selftests/bpf: Fix compilation failure for prog vrf_socket_lookup
selftests/bpf: Add vrf_socket_lookup tests
bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint
bpf: Factor out socket lookup functions for the TC hookpoint.
selftests/bpf: Set the default value of consumer_cnt as 0
selftests/bpf: Ensure that next_cpu() returns a valid CPU number
selftests/bpf: Output the correct error code for pthread APIs
selftests/bpf: Use producer_cnt to allocate local counter array
xsk: Remove unused inline function xsk_buff_discard()
bpf: Keep BPF_PROG_LOAD permission checks clear of validations
bpf: Centralize permissions checks for all BPF map types
bpf: Inline map creation logic in map_create() function
bpf: Move unprivileged checks into map_create() and bpf_prog_load()
bpf: Remove in_atomic() from bpf_link_put().
selftests/bpf: Verify that check_ids() is used for scalars in regsafe()
bpf: Verify scalar ids mapping in regsafe() using check_ids()
selftests/bpf: Check if mark_chain_precision() follows scalar ids
...
====================
Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Source file locations for syscall definitions can change over a period
of time. File paths in comments get stale and are hard to maintain long
term. Also, their usefulness is questionable since it would be easier to
locate a syscall definition using the SYSCALL_DEFINEx() macro.
Remove all source file path comments from the syscall headers. Also,
equalize the uneven line spacing (some of which is introduced due to the
deletions).
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Now we specify the minimal version of GCC as 5.1 and Clang/LLVM as 11.0.0
in Documentation/process/changes.rst, __CHAR_BIT__ and __SIZEOF_LONG__ are
usable, it is probably fine to unify the definition of __BITS_PER_LONG as
(__CHAR_BIT__ * __SIZEOF_LONG__) in asm-generic uapi bitsperlong.h.
In order to keep safe and avoid regression, only unify uapi bitsperlong.h
for some archs such as arm64, riscv and loongarch which are using newer
toolchains that have the definitions of __CHAR_BIT__ and __SIZEOF_LONG__.
Suggested-by: Xi Ruoyao <xry111@xry111.site>
Link: https://lore.kernel.org/all/d3e255e4746de44c9903c4433616d44ffcf18d1b.camel@xry111.site/
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/linux-arch/a3a4f48a-07d4-4ed9-bc53-5d383428bdd2@app.fastmail.com/
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd.
This thing is direct analog of SO_PEERCRED which allows to get plain PID.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS,
but it contains pidfd instead of plain pid, which allows programmers not
to care about PID reuse problem.
We mask SO_PASSPIDFD feature if CONFIG_UNIX is not builtin because
it depends on a pidfd_prepare() API which is not exported to the kernel
modules.
Idea comes from UAPI kernel group:
https://uapi-group.org/kernel-features/
Big thanks to Christian Brauner and Lennart Poettering for productive
discussions about this.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Tested-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function ‘open’:
nolibc/sysroot/arm/include/sys.h:919:23: warning: ‘mode_t’ {aka ‘short unsigned int’} is promoted to ‘int’ when passed through ‘...’
919 | mode = va_arg(args, mode_t);
| ^
nolibc/sysroot/arm/include/sys.h:919:23: note: (so you should pass ‘int’ not ‘mode_t’ {aka ‘short unsigned int’} to ‘va_arg’)
nolibc/sysroot/arm/include/sys.h:919:23: note: if this code is reached, the program will abort
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This is required by the coming removal of the oldselect and newselect
support.
pselect6/pselect6_time64 will be used unconditionally, they have 6
arguments.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/linux-riscv/bf3e07c1-75f5-425b-9124-f3f2b230e63a@app.fastmail.com/
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When compile nolibc-test.c with 2.31 glibc, we got such error:
In file included from /usr/riscv64-linux-gnu/include/sys/cdefs.h:452,
from /usr/riscv64-linux-gnu/include/features.h:461,
from /usr/riscv64-linux-gnu/include/bits/libc-header-start.h:33,
from /usr/riscv64-linux-gnu/include/limits.h:26,
from /usr/lib/gcc-cross/riscv64-linux-gnu/9/include/limits.h:194,
from /usr/lib/gcc-cross/riscv64-linux-gnu/9/include/syslimits.h:7,
from /usr/lib/gcc-cross/riscv64-linux-gnu/9/include/limits.h:34,
from /labs/linux-lab/src/linux-stable/tools/testing/selftests/nolibc/nolibc-test.c:6:
/usr/riscv64-linux-gnu/include/bits/wordsize.h:28:3: error: #error "rv32i-based targets are not supported"
28 | # error "rv32i-based targets are not supported"
Glibc (>= 2.33) commit 5b6113d62efa ("RISC-V: Support the 32-bit ABI
implementation") fixed up above error.
As suggested by Thomas, defining INT_MIN/INT_MAX for nolibc can remove
the including of limits.h, and therefore no above error. of course, the
other libcs still require limits.h, move it to the right place.
The LONG_MIN/LONG_MAX are also defined too.
Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/linux-riscv/09d60dc2-e298-4c22-8e2f-8375861bd9be@t-8ch.de/
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
On 32bit platforms size_t is not enough to represent [u]int_fast64_t.
Fixes: 3e9fd4e9a1 ("tools/nolibc: add integer types and integer limit macros")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The ppoll and ppoll_time64 syscalls have 5 arguments, but we only
provide 4, align with kernel and add the missing sigsetsize argument.
Because the sigmask is NULL, the last sigsetsize argument is ignored,
keep it as 0 here is safe enough.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Keep backwards compatibility through unions.
The compatibility macros like
#define st_atime st_atim.tv_sec
as documented in stat(3type) don't work for nolibc because it would
break with other stat-like structures that contain the field st_atime.
The stx_atime, stx_mtime, stx_ctime are in type of 'struct
statx_timestamp', which is incompatible with 'struct timespec', should
be converted explicitly.
/* include/uapi/linux/stat.h */
struct statx_timestamp {
__s64 tv_sec;
__u32 tv_nsec;
__s32 __reserved;
};
/* include/uapi/linux/time.h */
struct timespec {
__kernel_old_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/linux-riscv/3a3edd48-1ace-4c89-89e8-9c594dd1b3c9@t-8ch.de/
Co-authored-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
[wt: squashed Zhangjin & Thomas' patches into one to preserve "bisectability"]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
It will be used to disable core dumps from the child spawned to validate
the stack protector functionality.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
s390 does not support the "global" stack protector mode that is
implemented in nolibc.
Now that nolibc detects if stack protectors are enabled at runtime it
could happen that a future compiler does indeed use global mode on
and nolibc would compile but segfault at runtime.
To avoid this hypothetic case and to align s390 with the other
architectures disable stack protectors when compiling _start().
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Not all compilers, notably GCC < 10, have support for
__attribute__((no_stack_protector)).
Fall back to a mechanism that also works there.
Tested with GCC 9.5.0 from kernel.org crosstools.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The stackprotector support in nolibc should be enabled iff it is also
enabled in the compiler.
Use the preprocessor defines added by gcc and clang if stackprotector
support is enable to automatically do so in nolibc.
This completely removes the need for any user-visible API.
To avoid inlining the lengthy preprocessor check into every user
introduce a new header compiler.h that abstracts the logic away.
As the define NOLIBC_STACKPROTECTOR is now not user-relevant anymore
prefix it with an underscore.
Suggested-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/lkml/20230520133237.GA27501@1wt.eu/
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This makes it easier to add and remove more entries in the future
without creating spurious diff hunks.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The all-zero pattern is one of the more probable out-of-bound writes so
add a special case to not accidentally accept it.
Also it enables the reliable detection of stack protector initialization
during testing.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This was forgotten in the original submission.
It is unknown why it worked for x86_64 on some compiler without this
attribute.
Reported-by: Willy Tarreau <w@1wt.eu>
Closes: https://lore.kernel.org/lkml/20230520133237.GA27501@1wt.eu/
Fixes: 0d8c461adb ("tools/nolibc: x86_64: add stackprotector support")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
syscall() is used by "normal" libcs to allow users to directly call
syscalls.
By having the same syntax inside nolibc users can more easily write code
that works with different libcs.
The macro logic is adapted from systemtaps STAP_PROBEV() macro that is
released in the public domain / CC0.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When compile nolibc application for rv32, we got such errors:
nolibc/sysroot/riscv/include/arch.h:190: Error: unrecognized opcode `ld a4,0(a3)'
nolibc/sysroot/riscv/include/arch.h:194: Error: unrecognized opcode `sd a3,%lo(_auxv)(a4)'
nolibc/sysroot/riscv/include/arch.h:196: Error: unrecognized opcode `sd a2,%lo(environ)(a3)'
Refer to arch/riscv/include/asm/asm.h and add REG_L/REG_S macros here to let
rv32 uses its own lw/sw instructions.
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The same constants and some more have been exposed to userspace via
linux/reboot.h for a long time.
To avoid conflicts and trim down nolibc a bit drop the custom
definitions.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
On s390 the first two arguments to the clone() syscall are swapped,
as documented in clone(2).
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Most of nolibc is already using C89 comments.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When building in strict C89 mode the "inline" keyword is unknown.
While "__inline__" is non-standard it is used by the kernel headers
themselves.
So the used compilers would have to support it or the users shim it with
a #define.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Most of the code was migrated to C99-conformant __asm__ statements
before. It seems string.h was missed.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>