Alexei Starovoitov says:
====================
v5->v6:
- fixed issue found by bpf CI. The light skeleton generation was
doing a dry-run of loading the program where all actual sys_bpf syscalls
were replaced by calls into gen_loader. Turned out that search for valid
vmlinux_btf was not stubbed out which was causing light skeleton gen
to fail on older kernels.
- significantly reduced verbosity of gen_loader.c.
- an example trace_printk.lskel.h generated out of progs/trace_printk.c
https://gist.github.com/4ast/774ea58f8286abac6aa8e3bf3bf3b903
v4->v5:
- addressed a bunch of minor comments from Andrii.
- the main difference is that lskel is now more robust in case of errors
and a bit cleaner looking.
v3->v4:
- cleaned up closing of temporary FDs in case intermediate sys_bpf fails during
execution of loader program.
- added support for rodata in the skeleton.
- enforce bpf_prog_type_syscall to be sleepable, since it needs bpf_copy_from_user
to populate rodata map.
- converted test trace_printk to use lskel to test rodata access.
- various small bug fixes.
v2->v3: Addressed comments from Andrii and John.
- added support for setting max_entries after signature verification
and used it in ringbuf test, since ringbuf's max_entries has to be updated
after skeleton open() and before load(). See patch 20.
- bpf_btf_find_by_name_kind doesn't take btf_fd anymore.
Because of that removed attach_prog_fd from bpf_prog_desc in lskel.
Both features to be added later.
- cleaned up closing of fd==0 during loader gen by resetting fds back to -1.
- converted loader gen to use memset(&attr, cmd_specific_attr_size).
would love to see this optimization in the rest of libbpf.
- fixed memory leak during loader_gen in case of enomem.
- support for fd_array kernel feature is added in patch 9 to have
exhaustive testing across all selftests and then partially reverted
in patch 15 to keep old style map_fd patching tested as well.
- since fentry_test/fexit_tests were extended with re-attach had to add
support for per-program attach method in lskel and use it in the tests.
- cleanup closing of fds in lskel in case of partial failures.
- fixed numerous small nits.
v1->v2: Addressed comments from Al, Yonghong and Andrii.
- documented sys_close fdget/fdput requirement and non-recursion check.
- reduced internal api leaks between libbpf and bpftool.
Now bpf_object__gen_loader() is the only new libbf api with minimal fields.
- fixed light skeleton __destroy() method to munmap and close maps and progs.
- refactored bpf_btf_find_by_name_kind to return btf_id | (btf_obj_fd << 32).
- refactored use of bpf_btf_find_by_name_kind from loader prog.
- moved auto-gen like code into skel_internal.h that is used by *.lskel.h
It has minimal static inline bpf_load_and_run() method used by lskel.
- added lksel.h example in patch 15.
- replaced union bpf_map_prog_desc with struct bpf_map_desc and struct bpf_prog_desc.
- removed mark_feat_supported and added a patch to pass 'obj' into kernel_supports.
- added proper tracking of temporary FDs in loader prog and their cleanup via bpf_sys_close.
- rename gen_trace.c into gen_loader.c to better align the naming throughout.
- expanded number of available helpers in new prog type.
- added support for raw_tp attaching in lskel.
lskel supports tracing and raw_tp progs now.
It correctly loads all networking prog types too, but __attach() method is tbd.
- converted progs/test_ksyms_module.c to lskel.
- minor feedback fixes all over.
The description of V1 set is still valid:
This is a first step towards signed bpf programs and the third approach of that kind.
The first approach was to bring libbpf into the kernel as a user-mode-driver.
The second approach was to invent a new file format and let kernel execute
that format as a sequence of syscalls that create maps and load programs.
This third approach is using new type of bpf program instead of inventing file format.
1st and 2nd approaches had too many downsides comparing to this 3rd and were discarded
after months of work.
To make it work the following new concepts are introduced:
1. syscall bpf program type
A kind of bpf program that can do sys_bpf and sys_close syscalls.
It can only execute in user context.
2. FD array or FD index.
Traditionally BPF instructions are patched with FDs.
What it means that maps has to be created first and then instructions modified
which breaks signature verification if the program is signed.
Instead of patching each instruction with FD patch it with an index into array of FDs.
That makes the program signature stable if it uses maps.
3. loader program that is generated as "strace of libbpf".
When libbpf is loading bpf_file.o it does a bunch of sys_bpf() syscalls to
load BTF, create maps, populate maps and finally load programs.
Instead of actually doing the syscalls generate a trace of what libbpf
would have done and represent it as the "loader program".
The "loader program" consists of single map and single bpf program that
does those syscalls.
Executing such "loader program" via bpf_prog_test_run() command will
replay the sequence of syscalls that libbpf would have done which will result
the same maps created and programs loaded as specified in the elf file.
The "loader program" removes libelf and majority of libbpf dependency from
program loading process.
4. light skeleton
Instead of embedding the whole elf file into skeleton and using libbpf
to parse it later generate a loader program and embed it into "light skeleton".
Such skeleton can load the same set of elf files, but it doesn't need
libbpf and libelf to do that. It only needs few sys_bpf wrappers.
Future steps:
- support CO-RE in the kernel. This patch set is already too big,
so that critical feature is left for the next step.
- generate light skeleton in golang to allow such users use BTF and
all other features provided by libbpf
- generate light skeleton for kernel, so that bpf programs can be embeded
in the kernel module. The UMD usage in bpf_preload will be replaced with
such skeleton, so bpf_preload would become a standard kernel module
without user space dependency.
- finally do the signing of the loader program.
The patches are work in progress with few rough edges.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add -L flag to bpftool to use libbpf gen_trace facility and syscall/loader program
for skeleton generation and program loading.
"bpftool gen skeleton -L" command will generate a "light skeleton" or "loader skeleton"
that is similar to existing skeleton, but has one major difference:
$ bpftool gen skeleton lsm.o > lsm.skel.h
$ bpftool gen skeleton -L lsm.o > lsm.lskel.h
$ diff lsm.skel.h lsm.lskel.h
@@ -5,34 +4,34 @@
#define __LSM_SKEL_H__
#include <stdlib.h>
-#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
The light skeleton does not use majority of libbpf infrastructure.
It doesn't need libelf. It doesn't parse .o file.
It only needs few sys_bpf wrappers. All of them are in bpf/bpf.h file.
In future libbpf/bpf.c can be inlined into bpf.h, so not even libbpf.a would be
needed to work with light skeleton.
"bpftool prog load -L file.o" command is introduced for debugging of syscall/loader
program generation. Just like the same command without -L it will try to load
the programs from file.o into the kernel. It won't even try to pin them.
"bpftool prog load -L -d file.o" command will provide additional debug messages
on how syscall/loader program was generated.
Also the execution of syscall/loader program will use bpf_trace_printk() for
each step of loading BTF, creating maps, and loading programs.
The user can do "cat /.../trace_pipe" for further debug.
An example of fexit_sleep.lskel.h generated from progs/fexit_sleep.c:
struct fexit_sleep {
struct bpf_loader_ctx ctx;
struct {
struct bpf_map_desc bss;
} maps;
struct {
struct bpf_prog_desc nanosleep_fentry;
struct bpf_prog_desc nanosleep_fexit;
} progs;
struct {
int nanosleep_fentry_fd;
int nanosleep_fexit_fd;
} links;
struct fexit_sleep__bss {
int pid;
int fentry_cnt;
int fexit_cnt;
} *bss;
};
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-18-alexei.starovoitov@gmail.com
The BPF program loading process performed by libbpf is quite complex
and consists of the following steps:
"open" phase:
- parse elf file and remember relocations, sections
- collect externs and ksyms including their btf_ids in prog's BTF
- patch BTF datasec (since llvm couldn't do it)
- init maps (old style map_def, BTF based, global data map, kconfig map)
- collect relocations against progs and maps
"load" phase:
- probe kernel features
- load vmlinux BTF
- resolve externs (kconfig and ksym)
- load program BTF
- init struct_ops
- create maps
- apply CO-RE relocations
- patch ld_imm64 insns with src_reg=PSEUDO_MAP, PSEUDO_MAP_VALUE, PSEUDO_BTF_ID
- reposition subprograms and adjust call insns
- sanitize and load progs
During this process libbpf does sys_bpf() calls to load BTF, create maps,
populate maps and finally load programs.
Instead of actually doing the syscalls generate a trace of what libbpf
would have done and represent it as the "loader program".
The "loader program" consists of single map with:
- union bpf_attr(s)
- BTF bytes
- map value bytes
- insns bytes
and single bpf program that passes bpf_attr(s) and data into bpf_sys_bpf() helper.
Executing such "loader program" via bpf_prog_test_run() command will
replay the sequence of syscalls that libbpf would have done which will result
the same maps created and programs loaded as specified in the elf file.
The "loader program" removes libelf and majority of libbpf dependency from
program loading process.
kconfig, typeless ksym, struct_ops and CO-RE are not supported yet.
The order of relocate_data and relocate_calls had to change, so that
bpf_gen__prog_load() can see all relocations for a given program with
correct insn_idx-es.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-15-alexei.starovoitov@gmail.com
Add a pointer to 'struct bpf_object' to kernel_supports() helper.
It will be used in the next patch.
No functional changes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-13-alexei.starovoitov@gmail.com
In order to be able to generate loader program in the later
patches change the order of data and text relocations.
Also improve the test to include data relos.
If the kernel supports "FD array" the map_fd relocations can be processed
before text relos since generated loader program won't need to manually
patch ld_imm64 insns with map_fd.
But ksym and kfunc relocations can only be processed after all calls
are relocated, since loader program will consist of a sequence
of calls to bpf_btf_find_by_name_kind() followed by patching of btf_id
and btf_obj_fd into corresponding ld_imm64 insns. The locations of those
ld_imm64 insns are specified in relocations.
Hence process all data relocations (maps, ksym, kfunc) together after call relos.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-12-alexei.starovoitov@gmail.com
Add bpf_sys_close() helper to be used by the syscall/loader program to close
intermediate FDs and other cleanup.
Note this helper must never be allowed inside fdget/fdput bracketing.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-11-alexei.starovoitov@gmail.com
Add new helper:
long bpf_btf_find_by_name_kind(char *name, int name_sz, u32 kind, int flags)
Description
Find BTF type with given name and kind in vmlinux BTF or in module's BTFs.
Return
Returns btf_id and btf_obj_fd in lower and upper 32 bits.
It will be used by loader program to find btf_id to attach the program to
and to find btf_ids of ksyms.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-10-alexei.starovoitov@gmail.com
Typical program loading sequence involves creating bpf maps and applying
map FDs into bpf instructions in various places in the bpf program.
This job is done by libbpf that is using compiler generated ELF relocations
to patch certain instruction after maps are created and BTFs are loaded.
The goal of fd_idx is to allow bpf instructions to stay immutable
after compilation. At load time the libbpf would still create maps as usual,
but it wouldn't need to patch instructions. It would store map_fds into
__u32 fd_array[] and would pass that pointer to sys_bpf(BPF_PROG_LOAD).
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-9-alexei.starovoitov@gmail.com
Similar to prog_load make btf_load command to be availble to
bpf_prog_type_syscall program.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-7-alexei.starovoitov@gmail.com
bpf_prog_type_syscall is a program that creates a bpf map,
updates it, and loads another bpf program using bpf_sys_bpf() helper.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-6-alexei.starovoitov@gmail.com
With the help from bpfptr_t prepare relevant bpf syscall commands
to be used from kernel and user space.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-4-alexei.starovoitov@gmail.com
Similar to sockptr_t introduce bpfptr_t with few additions:
make_bpfptr() creates new user/kernel pointer in the same address space as
existing user/kernel pointer.
bpfptr_add() advances the user/kernel pointer.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-3-alexei.starovoitov@gmail.com
Add placeholders for bpf_sys_bpf() helper and new program type.
Make sure to check that expected_attach_type is zero for future extensibility.
Allow tracing helper functions to be used in this program type, since they will
only execute from user context via bpf_prog_test_run.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-2-alexei.starovoitov@gmail.com
Similar to the way in which of_mdiobus_register() has a fallback to the
non-DT based mdiobus_register() when CONFIG_OF is not set, we can create
a shim for the device-managed devm_of_mdiobus_register() which calls
devm_mdiobus_register() and discards the struct device_node *.
In particular, this solves a build issue with the qca8k DSA driver which
uses devm_of_mdiobus_register and can be compiled without CONFIG_OF.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The list_head dcb_app_list is initialized statically.
It is unnecessary to initialize by INIT_LIST_HEAD().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using list_del_init() instead of list_del() + INIT_LIST_HEAD()
to simpify the code.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang says:
====================
net: wan: clean up some code style issues
This patchset clean up some code style issues.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the checkpatch error: "foo* bar" should be "foo *bar".
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Space prohibited before that close parenthesis ')',
so removes the redundant space.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Braces {} are not necessary for single statement blocks,
this patch removes redundant braces {}.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add space required before the open parenthesis '(',
and add spaces required around that '<', '>' and '!='.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes some redundant blank lines.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the missing unlock before return from function qca8k_vlan_add()
and qca8k_vlan_del() in the error handling case.
Fixes: 028f5f8ef4 ("net: dsa: qca8k: handle error with qca8k_read operation")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since cipso_v4_cache_invalidate() has no return value, so drop
related descriptions in its comments.
Fixes: 446fda4f26 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel says:
====================
Add support for custom multipath hash
This patchset adds support for custom multipath hash policy for both
IPv4 and IPv6 traffic. The new policy allows user space to control the
outer and inner packet fields used for the hash computation.
Motivation
==========
Linux currently supports different multipath hash policies for IPv4 and
IPv6 traffic:
* Layer 3
* Layer 4
* Layer 3 or inner layer 3, if present
These policies hash on a fixed set of fields, which is inflexible and
against operators' requirements to control the hash input: "The ability
to control the inputs to the hash function should be a consideration in
any load-balancing RFP" [1].
An example of this inflexibility can be seen by the fact that none of
the current policies allows operators to use the standard 5-tuple and
the flow label for multipath hash computation. Such a policy is useful
in the following real-world example of a data center with the following
types of traffic:
* Anycast IPv6 TCP traffic towards layer 4 load balancers. Flow label is
constant (zero) to avoid breaking established connections
* Non-encapsulated IPv6 traffic. Flow label is used to re-route flows
around problematic (congested / failed) paths [2]
* IPv6 encapsulated traffic (IPv4-in-IPv6 or IPv6-in-IPv6). Outer flow
label is generated from encapsulated packet
* UDP encapsulated traffic. Outer source port is generated from
encapsulated packet
In the above example, using the inner flow information for hash
computation in addition to the outer flow information is useful during
failures of the BPF agent that selectively generates the flow label
based on the traffic type. In such cases, the self-healing properties of
the flow label are lost, but encapsulated flows are still load balanced.
Control over the inner fields is even more critical when encapsulation
is performed by hardware routers. For example, the Spectrum ASIC can
only encode 8 bits of entropy in the outer flow label / outer UDP source
port when performing IP / UDP encapsulation. In the case of IPv4 GRE
encapsulation there is no outer field to encode the inner hash in.
User interface
==============
In accordance with existing multipath hash configuration, the new custom
policy is added as a new option (3) to the
net.ipv{4,6}.fib_multipath_hash_policy sysctls. When the new policy is
used, the packet fields used for hash computation are determined by the
net.ipv{4,6}.fib_multipath_hash_fields sysctls. These sysctls accept a
bitmask according to the following table (from ip-sysctl.rst):
====== ============================
0x0001 Source IP address
0x0002 Destination IP address
0x0004 IP protocol
0x0008 Flow Label
0x0010 Source port
0x0020 Destination port
0x0040 Inner source IP address
0x0080 Inner destination IP address
0x0100 Inner IP protocol
0x0200 Inner Flow Label
0x0400 Inner source port
0x0800 Inner destination port
====== ============================
For example, to allow IPv6 traffic to be hashed based on standard
5-tuple and flow label:
# sysctl -wq net.ipv6.fib_multipath_hash_fields=0x0037
# sysctl -wq net.ipv6.fib_multipath_hash_policy=3
Implementation
==============
As with existing policies, the new policy relies on the flow dissector
to extract the packet fields for the hash computation. However, unlike
existing policies that either use the outer or inner flow, the new
policy might require both flows to be dissected.
To avoid unnecessary invocations of the flow dissector, the data path
skips dissection of the outer or inner flows if none of the outer or
inner fields are required.
In addition, inner flow dissection is not performed when no
encapsulation was encountered (i.e., 'FLOW_DIS_ENCAPSULATION' not set by
flow dissector) during dissection of the outer flow.
Testing
=======
Three new selftests are added with three different topologies that allow
testing of following traffic combinations:
* Non-encapsulated IPv4 / IPv6 traffic
* IPv4 / IPv6 overlay over IPv4 underlay
* IPv4 / IPv6 overlay over IPv6 underlay
All three tests follow the same pattern. Each time a different packet
field is used for hash computation. When the field changes in the packet
stream, traffic is expected to be balanced across the two paths. When
the field does not change, traffic is expected to be unbalanced across
the two paths.
Patchset overview
=================
Patches #1-#3 add custom multipath hash support for IPv4 traffic
Patches #4-#7 do the same for IPv6
Patches #8-#10 add selftests
Future work
===========
mlxsw support can be found here [3].
Changes since RFC v2 [4]:
* Patch #2: Document that 0x0008 is used for Flow Label
* Patch #2: Do not allow the bitmask to be zero
* Patch #6: Do not allow the bitmask to be zero
Changes since RFC v1 [5]:
* Use a bitmask instead of a bitmap
[1] https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3acf3ec3f4b0fd4263989f2e4227bbd1c42b5fe1
[3] https://github.com/idosch/linux/tree/submit/custom_hash_mlxsw_v2
[4] https://lore.kernel.org/netdev/20210509151615.200608-1-idosch@idosch.org/
[5] https://lore.kernel.org/netdev/20210502162257.3472453-1-idosch@idosch.org/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Test that when the hash policy is set to custom, traffic is distributed
only according to the inner fields set in the fib_multipath_hash_fields
sysctl.
Each time set a different field and make sure traffic is only
distributed when the field is changed in the packet stream.
The test only verifies the behavior of IPv4/IPv6 overlays on top of an
IPv6 underlay network. The previous patch verified the same with an IPv4
underlay network.
Example output:
# ./ip6gre_custom_multipath_hash.sh
TEST: ping [ OK ]
TEST: ping6 [ OK ]
INFO: Running IPv4 overlay custom multipath hash tests
TEST: Multipath hash field: Inner source IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6602 / 6002
TEST: Multipath hash field: Inner source IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 1 / 12601
TEST: Multipath hash field: Inner destination IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6802 / 5801
TEST: Multipath hash field: Inner destination IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 12602 / 3
TEST: Multipath hash field: Inner source port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16431 / 16344
TEST: Multipath hash field: Inner source port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 0 / 32773
TEST: Multipath hash field: Inner destination port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16431 / 16344
TEST: Multipath hash field: Inner destination port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 2 / 32772
INFO: Running IPv6 overlay custom multipath hash tests
TEST: Multipath hash field: Inner source IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6704 / 5902
TEST: Multipath hash field: Inner source IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 1 / 12600
TEST: Multipath hash field: Inner destination IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 5751 / 6852
TEST: Multipath hash field: Inner destination IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 12602 / 0
TEST: Multipath hash field: Inner flowlabel (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 8272 / 8181
TEST: Multipath hash field: Inner flowlabel (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 3 / 12602
TEST: Multipath hash field: Inner source port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16424 / 16351
TEST: Multipath hash field: Inner source port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 3 / 32774
TEST: Multipath hash field: Inner destination port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16425 / 16350
TEST: Multipath hash field: Inner destination port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 2 / 32773
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Test that when the hash policy is set to custom, traffic is distributed
only according to the inner fields set in the fib_multipath_hash_fields
sysctl.
Each time set a different field and make sure traffic is only
distributed when the field is changed in the packet stream.
The test only verifies the behavior of IPv4/IPv6 overlays on top of an
IPv4 underlay network. A subsequent patch will do the same with an IPv6
underlay network.
Example output:
# ./gre_custom_multipath_hash.sh
TEST: ping [ OK ]
TEST: ping6 [ OK ]
INFO: Running IPv4 overlay custom multipath hash tests
TEST: Multipath hash field: Inner source IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6601 / 6001
TEST: Multipath hash field: Inner source IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 0 / 12600
TEST: Multipath hash field: Inner destination IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6802 / 5802
TEST: Multipath hash field: Inner destination IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 12601 / 1
TEST: Multipath hash field: Inner source port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16430 / 16344
TEST: Multipath hash field: Inner source port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 0 / 32772
TEST: Multipath hash field: Inner destination port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16430 / 16343
TEST: Multipath hash field: Inner destination port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 0 / 32772
INFO: Running IPv6 overlay custom multipath hash tests
TEST: Multipath hash field: Inner source IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6702 / 5900
TEST: Multipath hash field: Inner source IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 0 / 12601
TEST: Multipath hash field: Inner destination IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 5751 / 6851
TEST: Multipath hash field: Inner destination IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 12602 / 1
TEST: Multipath hash field: Inner flowlabel (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 8364 / 8065
TEST: Multipath hash field: Inner flowlabel (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 12601 / 0
TEST: Multipath hash field: Inner source port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16425 / 16349
TEST: Multipath hash field: Inner source port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 1 / 32770
TEST: Multipath hash field: Inner destination port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16425 / 16349
TEST: Multipath hash field: Inner destination port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 2 / 32770
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Test that when the hash policy is set to custom, traffic is distributed
only according to the outer fields set in the fib_multipath_hash_fields
sysctl.
Each time set a different field and make sure traffic is only
distributed when the field is changed in the packet stream.
The test only verifies the behavior with non-encapsulated IPv4 and IPv6
packets. Subsequent patches will add tests for IPv4/IPv6 overlays on top
of IPv4/IPv6 underlay networks.
Example output:
# ./custom_multipath_hash.sh
TEST: ping [ OK ]
TEST: ping6 [ OK ]
INFO: Running IPv4 custom multipath hash tests
TEST: Multipath hash field: Source IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6353 / 6254
TEST: Multipath hash field: Source IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 0 / 12600
TEST: Multipath hash field: Destination IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6102 / 6502
TEST: Multipath hash field: Destination IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 1 / 12601
TEST: Multipath hash field: Source port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16428 / 16345
TEST: Multipath hash field: Source port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 32770 / 2
TEST: Multipath hash field: Destination port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16428 / 16345
TEST: Multipath hash field: Destination port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 32770 / 2
INFO: Running IPv6 custom multipath hash tests
TEST: Multipath hash field: Source IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 6704 / 5903
TEST: Multipath hash field: Source IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 12600 / 0
TEST: Multipath hash field: Destination IP (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 5551 / 7052
TEST: Multipath hash field: Destination IP (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 12603 / 0
TEST: Multipath hash field: Flowlabel (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 8378 / 8080
TEST: Multipath hash field: Flowlabel (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 2 / 12603
TEST: Multipath hash field: Source port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16385 / 16388
TEST: Multipath hash field: Source port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 0 / 32774
TEST: Multipath hash field: Destination port (balanced) [ OK ]
INFO: Packets sent on path1 / path2: 16386 / 16390
TEST: Multipath hash field: Destination port (unbalanced) [ OK ]
INFO: Packets sent on path1 / path2: 32771 / 2
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.
The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.
In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.
The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:
# sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037
net.ipv6.fib_multipath_hash_fields = 0x0037
To avoid introducing holes in 'struct netns_sysctl_ipv6', move the
'bindv6only' field after the multipath hash fields.
The kernel rejects unknown fields, for example:
# sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000
sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add another multipath hash policy where the
multipath hash is calculated directly by the policy specific code and
not outside of the switch statement.
Prepare for this change by moving the multipath hash calculation inside
the switch statement.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'out_timer' label was added in commit 63152fc0de ("[NETNS][IPV6]
ip6_fib - gc timer per namespace") when the timer was allocated on the
heap.
Commit 417f28bb34 ("netns: dont alloc ipv6 fib timer list") removed
the allocation, but kept the label name.
Rename it to a more suitable name.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.
The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.
In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.
The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:
# sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037
net.ipv4.fib_multipath_hash_fields = 0x0037
The kernel rejects unknown fields, for example:
# sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000
sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument
More fields can be added in the future, if needed.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add another multipath hash policy where the
multipath hash is calculated directly by the policy specific code and
not outside of the switch statement.
Prepare for this change by moving the multipath hash calculation inside
the switch statement.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the
coalesced packet payload can be > MSS, but < MSS + 20.
bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload
length. After then tcp_gso_segment checks for the payload length if it
is <= MSS. The condition is causing the packet to be dropped.
tcp_gso_segment():
[...]
mss = skb_shinfo(skb)->gso_size;
if (unlikely(skb->len <= mss))
goto out;
[...]
Allow to upgrade/downgrade MSS only when BPF_F_ADJ_ROOM_FIXED_GSO is
not set.
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/bpf/1620804453-57566-1-git-send-email-dseok.yi@samsung.com
'err' and 'flags' are not used, we can just get rid of them.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210517022348.50555-1-xiyou.wangcong@gmail.com
After commit 96a71005bd ("bpf, arm64: remove obsolete exception handling
from div/mod"), there is no need to check twice about BPF_DIV and BPF_MOD,
remove the redundant switch case.
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1621328170-17583-1-git-send-email-yangtiezhu@loongson.cn
Remove leading spaces before tabs in Kconfig file(s) by running the
following command:
$ find drivers/net -name 'Kconfig*' | xargs sed -r -i 's/^[ ]+\t/\t/'
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Variable ret is set to '0' or '-EBUSY', but this value is never read
as it is not used later on, hence it is a redundant assignment and
can be removed.
Clean up the following clang-analyzer warning:
net/packet/af_packet.c:3936:4: warning: Value stored to 'ret' is never
read [clang-analyzer-deadcode.DeadStores].
net/packet/af_packet.c:3933:4: warning: Value stored to 'ret' is never
read [clang-analyzer-deadcode.DeadStores].
No functional change.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Sit Wei Hong says:
====================
Introducing support for DWC xpcs Energy Efficient Ethernet
The goal of this patch set is to enable EEE in the xpcs so that when
EEE is enabled, the MAC-->xpcs-->PHY have all the EEE related
configurations enabled.
Patch 1 adds the functions to enable EEE in the xpcs and sets it to
transparent mode.
Patch 2 adds the callbacks to configure the xpcs EEE mode.
The results are tested by checking the lpi counters of the tx and rx
path of the interface. When EEE is enabled, the lpi counters should
increament as it enters and exits lpi states.
host@EHL$ ethtool --show-eee enp0s30f4
EEE Settings for enp0s30f4:
EEE status: disabled
Tx LPI: disabled
Supported EEE link modes: 100baseT/Full
1000baseT/Full
Advertised EEE link modes: Not reported
Link partner advertised EEE link modes: 100baseT/Full
1000baseT/Full
host@EHL$ ethtool -S enp0s30f4 | grep lpi
irq_tx_path_in_lpi_mode_n: 0
irq_tx_path_exit_lpi_mode_n: 0
irq_rx_path_in_lpi_mode_n: 0
irq_rx_path_exit_lpi_mode_n: 0
host@EHL$ ethtool --set-eee enp0s30f4 eee on
host@EHL$ [ 110.265154] intel-eth-pci 0000:00:1e.4 enp0s30f4: Link is Down
[ 112.315155] intel-eth-pci 0000:00:1e.4 enp0s30f4: Link is Up - 1Gbps/Full - flow control off
[ 112.324612] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s30f4: link becomes ready
host@EHL$ ethtool --show-eee enp0s30f4
EEE Settings for enp0s30f4:
EEE status: enabled - active
Tx LPI: 1000000 (us)
Supported EEE link modes: 100baseT/Full
1000baseT/Full
Advertised EEE link modes: 100baseT/Full
1000baseT/Full
Link partner advertised EEE link modes: 100baseT/Full
1000baseT/Full
host@EHL$ ethtool -S enp0s30f4 | grep lpi
irq_tx_path_in_lpi_mode_n: 6
irq_tx_path_exit_lpi_mode_n: 5
irq_rx_path_in_lpi_mode_n: 7
irq_rx_path_exit_lpi_mode_n: 6
host@EHL$ ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=1.02 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=0.510 ms
64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=0.489 ms
64 bytes from 192.168.1.1: icmp_seq=4 ttl=64 time=0.484 ms
64 bytes from 192.168.1.1: icmp_seq=5 ttl=64 time=0.504 ms
64 bytes from 192.168.1.1: icmp_seq=6 ttl=64 time=0.466 ms
64 bytes from 192.168.1.1: icmp_seq=7 ttl=64 time=0.529 ms
64 bytes from 192.168.1.1: icmp_seq=8 ttl=64 time=0.519 ms
64 bytes from 192.168.1.1: icmp_seq=9 ttl=64 time=0.518 ms
64 bytes from 192.168.1.1: icmp_seq=10 ttl=64 time=0.501 ms
--- 192.168.1.1 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9216ms
rtt min/avg/max/mdev = 0.466/0.553/1.018/0.155 ms
host@EHL$ ethtool -S enp0s30f4 | grep lpi
irq_tx_path_in_lpi_mode_n: 22
irq_tx_path_exit_lpi_mode_n: 21
irq_rx_path_in_lpi_mode_n: 21
irq_rx_path_exit_lpi_mode_n: 20
====================
Signed-off-by: David S. Miller <davem@davemloft.net>