OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jay Jayatheerthan	cd9e72b6f2	samples/bpf: xdpsock: Add option to specify batch size New option to specify batch size for tx, rx and l2fwd has been added. This allows fine tuning to maximize performance. It is specified using '-b' or '--batch_size' options. When not specified default is 64. Signed-off-by: Jay Jayatheerthan <jay.jayatheerthan@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191220085530.4980-4-jay.jayatheerthan@intel.com	2019-12-20 16:10:39 -08:00
Jay Jayatheerthan	695255882b	samples/bpf: xdpsock: Use common code to handle signal and main exit Add code to do cleanup for signals and application completion in a unified fashion. The signal handler sets benckmark_done flag terminating the threads. The cleanup is called before returning from main() function. Signed-off-by: Jay Jayatheerthan <jay.jayatheerthan@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191220085530.4980-3-jay.jayatheerthan@intel.com	2019-12-20 16:10:39 -08:00
Jay Jayatheerthan	d3f11b018f	samples/bpf: xdpsock: Add duration option to specify how long to run The application now supports '-d' or '--duration' option to specify number of seconds to run. This is used in tx, rx and l2fwd features. If this option is not provided, the application runs forever. Signed-off-by: Jay Jayatheerthan <jay.jayatheerthan@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191220085530.4980-2-jay.jayatheerthan@intel.com	2019-12-20 16:10:39 -08:00
Andrey Ignatov	478bee0df0	selftests/bpf: Preserve errno in test_progs CHECK macros It's follow-up for discussion [1] CHECK and CHECK_FAIL macros in test_progs.h can affect errno in some circumstances, e.g. if some code accidentally closes stdout. It makes checking errno in patterns like this unreliable: if (CHECK(!bpf_prog_attach_xattr(...), "tag", "msg")) goto err; CHECK_FAIL(errno != ENOENT); , since by CHECK_FAIL time errno could be affected not only by bpf_prog_attach_xattr but by CHECK as well. Fix it by saving and restoring errno in the macros. There is no "Fixes" tag since no problems were discovered yet and it's rather precaution. test_progs was run with this change and no difference was identified. [1] https://lore.kernel.org/bpf/20191219210907.GD16266@rdna-mbp.dhcp.thefacebook.com/ Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191220000511.1684853-1-rdna@fb.com	2019-12-20 16:06:02 -08:00
Alexei Starovoitov	ce3cec2793	Merge branch 'xsk-cleanup' Magnus Karlsson says: ==================== This patch set cleans up the ring access functions of AF_XDP in hope that it will now be easier to understand and maintain. I used to get a headache every time I looked at this code in order to really understand it, but now I do think it is a lot less painful. The code has been simplified a lot and as a bonus we get better performance in nearly all cases. On my new 2.1 GHz Cascade Lake machine with a standard default config plus AF_XDP support and CONFIG_PREEMPT on I get the following results in percent performance increases with this patch set compared to without it: Zero-copy (-N): rxdrop txpush l2fwd 1 core: -2% 0% 3% 2 cores: 4% 0% 3% Zero-copy with poll() (-N -p): rxdrop txpush l2fwd 1 core: 3% 0% 1% 2 cores: 21% 0% 9% Skb mode (-S): Shows a 0% to 5% performance improvement over the same benchmarks as above. Here 1 core means that we are running the driver processing and the application on the same core, while 2 cores means that they execute on separate cores. The applications are from the xdpsock sample app. On my older 2.0 Ghz Broadwell machine that I used for the v1, I get the following results: Zero-copy (-N): rxdrop txpush l2fwd 1 core: 4% 5% 4% 2 cores: 1% 0% 2% Zero-copy with poll() (-N -p): rxdrop txpush l2fwd 1 core: 1% 3% 3% 2 cores: 22% 0% 5% Skb mode (-S): Shows a 0% to 1% performance improvement over the same benchmarks as above. When a results says 21 or 22% better, as in the case of poll mode with 2 cores and rxdrop, my first reaction is that it must be a bug. Everything else shows between 0% and 5% performance improvement. What is giving rise to 22%? A quick bisect indicates that it is patches 2, 3, 4, 5, and 6 that are giving rise to most of this improvement. So not one patch in particular, but something around 4% improvement from each one of them. Note that exactly this benchmark has previously had an extraordinary slow down compared to when running without poll syscalls. For all the other poll tests above, the slowdown has always been around 4% for using poll syscalls. But with the bad performing test in question, it was above 25%. Interestingly, after this clean up, the slow down is 4%, just like all the other poll tests. Please take an extra peek at this so I have not messed up something. The 0% for several txpush results are due to the test bottlenecking on a non-CPU HW resource. If I eliminated that bottleneck on my system, I would expect to see an increase there too. Changes v1 -> v2: * Corrected textual errors in the commit logs (Sergei and Martin) * Fixed the functions that detect empty and full rings so that they now operate on the global ring state (Maxim) This patch has been applied against commit `a352a82496` ("Merge branch 'libbpf-extern-followups'") Structure of the patch set: Patch 1: Eliminate the lazy update threshold used when preallocating entries in the completion ring Patch 2: Simplify the detection of empty and full rings Patch 3: Consolidate the two local producer pointers into one Patch 4: Standardize the naming of the producer ring access functions Patch 5: Eliminate the Rx batch size used for the fill ring Patch 6: Simplify the functions xskq_nb_avail and xskq_nb_free Patch 7: Simplify and standardize the naming of the consumer ring access functions Patch 8: Change the names of the validation functions to improve readability and also the return value of these functions Patch 9: Change the name of xsk_umem_discard_addr() to xsk_umem_release_addr() to better reflect the new names. Requires a name change in the drivers that support AF_XDP zero-copy. Patch 10: Remove unnecessary READ_ONCE of data in the ring Patch 11: Add overall function naming comment and reorder the functions for easier reference Patch 12: Use the struct_size helper function when allocating rings ==================== Reviewed-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-12-20 16:00:20 -08:00
Magnus Karlsson	1d9cb1f381	xsk: Use struct_size() helper Improve readability and maintainability by using the struct_size() helper when allocating the AF_XDP rings. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-13-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	15d8c9162c	xsk: Add function naming comments and reorder functions Add comments on how the ring access functions are named and how they are supposed to be used for producers and consumers. The functions are also reordered so that the consumer functions are in the beginning and the producer functions in the end, for easier reference. Put this in a separate patch as the diff might look a little odd, but no functionality has changed in this patch. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-12-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	c34787fcc9	xsk: Remove unnecessary READ_ONCE of data There are two unnecessary READ_ONCE of descriptor data. These are not needed since the data is written by the producer before it signals that the data is available by incrementing the producer pointer. As the access to this producer pointer is serialized and the consumer always reads the descriptor after it has read and synchronized with the producer counter, the write of the descriptor will have fully completed and it does not matter if the consumer has any read tearing. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-11-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	f8509aa078	xsk: ixgbe: i40e: ice: mlx5: Xsk_umem_discard_addr to xsk_umem_release_addr Change the name of xsk_umem_discard_addr to xsk_umem_release_addr to better reflect the new naming of the AF_XDP queue manipulation functions. As this functions is used by drivers implementing support for AF_XDP zero-copy, it requires a name change to these drivers. The function xsk_umem_release_addr_rq has also changed name in the same fashion. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-10-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	03896ef1f0	xsk: Change names of validation functions Change the names of the validation functions to better reflect what they are doing. The uppermost ones are reading entries from the rings and only the bottom ones validate entries. So xskq_cons_read_ is a better prefix name. Also change the xskq_cons_read_ functions to return a bool as the the descriptor or address is already returned by reference in the parameters. Everyone is using the return value as a bool anyway. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-9-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	c5ed924b54	xsk: Simplify the consumer ring access functions Simplify and refactor consumer ring functions. The consumer first "peeks" to find descriptors or addresses that are available to read from the ring, then reads them and finally "releases" these descriptors once it is done. The two local variables cons_tail and cons_head are turned into one single variable called cached_cons. cached_tail referred to the cached value of the global consumer pointer and will be stored in cached_cons. For cached_head, we just use cached_prod instead as it was not used for a consumer queue before. It also better reflects what it really is now: a cached copy of the producer pointer. The names of the functions are also renamed in the same manner as the producer functions. The new functions are called xskq_cons_ followed by what it does. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-8-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	df0ae6f78a	xsk: Simplify xskq_nb_avail and xskq_nb_free At this point, there are no users of the functions xskq_nb_avail and xskq_nb_free that take any other number of entries argument than 1, so let us get rid of the second argument that takes the number of entries. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-7-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	4b638f13ba	xsk: Eliminate the RX batch size In the xsk consumer ring code there is a variable called RX_BATCH_SIZE that dictates the minimum number of entries that we try to grab from the fill and Tx rings. In fact, the code always try to grab the maximum amount of entries from these rings. The only thing this variable does is to throw an error if there is less than 16 (as it is defined) entries on the ring. There is no reason to do this and it will just lead to weird behavior from user space's point of view. So eliminate this variable. With this change, we will be able to simplify the xskq_nb_free and xskq_nb_avail code in the next commit. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-6-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	59e35e5525	xsk: Standardize naming of producer ring access functions Adopt the naming of the producer ring access functions to have a similar naming convention as the functions in libbpf, but adapted to the kernel. You first reserve a number of entries that you later submit to the global state of the ring. This is much clearer, IMO, than the one that was in the kernel part. Once renamed, we also discover that two functions are actually the same, so remove one of them. Some of the primitive ring submission operations are also the same so break these out into __xskq_prod_submit that the upper level ring access functions can use. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-5-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	d7012f05e3	xsk: Consolidate to one single cached producer pointer Currently, the xsk ring code has two cached producer pointers: prod_head and prod_tail. This patch consolidates these two into a single one called cached_prod to make the code simpler and easier to maintain. This will be in line with the user space part of the the code found in libbpf, that only uses a single cached pointer. The Rx path only uses the two top level functions xskq_produce_batch_desc and xskq_produce_flush_desc and they both use prod_head and never prod_tail. So just move them over to cached_prod. The Tx XDP_DRV path uses xskq_produce_addr_lazy and xskq_produce_flush_addr_n and unnecessarily operates on both prod_tail and prod_head, so move them over to just use cached_prod by skipping the intermediate step of updating prod_tail. The Tx path in XDP_SKB mode uses xskq_reserve_addr and xskq_produce_addr. They currently use both cached pointers, but we can operate on the global producer pointer in xskq_produce_addr since it has to be updated anyway, thus eliminating the use of both cached pointers. We can also remove the xskq_nb_free in xskq_produce_addr since it is already called in xskq_reserve_addr. No need to do it twice. When there is only one cached producer pointer, we can also simplify xskq_nb_free by removing one argument. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-4-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:09 -08:00
Magnus Karlsson	11cc2d2149	xsk: Simplify detection of empty and full rings In order to set the correct return flags for poll, the xsk code has to check if the Rx queue is empty and if the Tx queue is full. This code was unnecessarily large and complex as it used the functions that are used to update the local state from the global state (xskq_nb_free and xskq_nb_avail). Since we are not doing this nor updating any data dependent on this state, we can simplify the functions. Another benefit from this is that we can also simplify the xskq_nb_free and xskq_nb_avail functions in a later commit. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-3-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:08 -08:00
Magnus Karlsson	484b165306	xsk: Eliminate the lazy update threshold The lazy update threshold was introduced to keep the producer and consumer some distance apart in the completion ring. This was important in the beginning of the development of AF_XDP as the ring format as that point in time was very sensitive to the producer and consumer being on the same cache line. This is not the case anymore as the current ring format does not degrade in any noticeable way when this happens. Moreover, this threshold makes it impossible to run the system with rings that have less than 128 entries. So let us remove this threshold and just get one entry from the ring as in all other functions. This will enable us to remove this function in a later commit. Note that xskq_produce_addr_lazy followed by xskq_produce_flush_addr_n are still not the same function as xskq_produce_addr() as it operates on another cached pointer. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-2-git-send-email-magnus.karlsson@intel.com	2019-12-20 16:00:08 -08:00
Alexei Starovoitov	99cacdc6f6	Merge branch 'replace-cg_bpf-prog' Andrey Ignatov says: ==================== v3->v4: - use OPTS_VALID and OPTS_GET to handle bpf_prog_attach_opts. v2->v3: - rely on DECLARE_LIBBPF_OPTS from libbpf_common.h; - separate "required" and "optional" arguments in bpf_prog_attach_xattr; - convert test_cgroup_attach to prog_tests; - move the new selftest to prog_tests/cgroup_attach_multi. v1->v2: - move DECLARE_LIBBPF_OPTS from libbpf.h to bpf.h (patch 4); - switch new libbpf API to OPTS framework; - switch selftest to libbpf OPTS framework. This patch set adds support for replacing cgroup-bpf programs attached with BPF_F_ALLOW_MULTI flag so that any program in a list can be updated to a new version without service interruption and order of programs can be preserved. Please see patch 3 for details on the use-case and API changes. Other patches: Patch 1 is preliminary refactoring of __cgroup_bpf_attach to simplify it. Patch 2 is minor cleanup of hierarchy_allows_attach. Patch 4 extends libbpf API to support new set of attach attributes. Patch 5 converts test_cgroup_attach to prog_tests. Patch 6 adds selftest coverage for the new API. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-12-19 21:23:40 -08:00
Andrey Ignatov	06ac0186bd	selftests/bpf: Test BPF_F_REPLACE in cgroup_attach_multi Test replacing a cgroup-bpf program attached with BPF_F_ALLOW_MULTI and possible failure modes: invalid combination of flags, invalid replace_bpf_fd, replacing a non-attachd to specified cgroup program. Example of program replacing: # gdb -q --args ./test_progs --name=cgroup_attach_multi ... Breakpoint 1, test_cgroup_attach_multi () at cgroup_attach_multi.c:227 (gdb) [1]+ Stopped gdb -q --args ./test_progs --name=cgroup_attach_multi # bpftool c s /mnt/cgroup2/cgroup-test-work-dir/cg1 ID AttachType AttachFlags Name 2133 egress multi 2134 egress multi # fg gdb -q --args ./test_progs --name=cgroup_attach_multi (gdb) c Continuing. Breakpoint 2, test_cgroup_attach_multi () at cgroup_attach_multi.c:233 (gdb) [1]+ Stopped gdb -q --args ./test_progs --name=cgroup_attach_multi # bpftool c s /mnt/cgroup2/cgroup-test-work-dir/cg1 ID AttachType AttachFlags Name 2139 egress multi 2134 egress multi Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/7b9b83e8d5fb82e15b034341bd40b6fb2431eeba.1576741281.git.rdna@fb.com	2019-12-19 21:22:26 -08:00
Andrey Ignatov	257c88559f	selftests/bpf: Convert test_cgroup_attach to prog_tests Convert test_cgroup_attach to prog_tests. This change does a lot of things but in many cases it's pretty expensive to separate them, so they go in one commit. Nevertheless the logic is ketp as is and changes made are just moving things around, simplifying them (w/o changing the meaning of the tests) and making prog_tests compatible: * split the 3 tests in the file into 3 separate files in prog_tests/; * rename the test functions to test_<file_base_name>; * remove unused includes, constants, variables and functions from every test; * replace `if`-s with or `if (CHECK())` where additional context should be logged and with `if (CHECK_FAIL())` where line number is enough; * switch from `log_err()` to logging via `CHECK()`; * replace `assert`-s with `CHECK_FAIL()` to avoid crashing the whole test_progs if one assertion fails; * replace cgroup_helpers with test__join_cgroup() in cgroup_attach_override only, other tests need more fine-grained control for cgroup creation/deletion so cgroup_helpers are still used there; * simplify cgroup_attach_autodetach by switching to easiest possible program since this test doesn't really need such a complicated program as cgroup_attach_multi does; * remove test_cgroup_attach.c itself. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/0ff19cc64d2dc5cf404349f07131119480e10e32.1576741281.git.rdna@fb.com	2019-12-19 21:22:26 -08:00
Andrey Ignatov	cdbee3839c	libbpf: Introduce bpf_prog_attach_xattr Introduce a new bpf_prog_attach_xattr function that, in addition to program fd, target fd and attach type, accepts an extendable struct bpf_prog_attach_opts. bpf_prog_attach_opts relies on DECLARE_LIBBPF_OPTS macro to maintain backward and forward compatibility and has the following "optional" attach attributes: * existing attach_flags, since it's not required when attaching in NONE mode. Even though it's quite often used in MULTI and OVERRIDE mode it seems to be a good idea to reduce number of arguments to bpf_prog_attach_xattr; * newly introduced attribute of BPF_PROG_ATTACH command: replace_prog_fd that is fd of previously attached cgroup-bpf program to replace if BPF_F_REPLACE flag is used. The new function is named to be consistent with other xattr-functions (bpf_prog_test_run_xattr, bpf_create_map_xattr, bpf_load_program_xattr). The struct bpf_prog_attach_opts is supposed to be used with DECLARE_LIBBPF_OPTS macro. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/bd6e0732303eb14e4b79cb128268d9e9ad6db208.1576741281.git.rdna@fb.com	2019-12-19 21:22:25 -08:00
Andrey Ignatov	7dd68b3279	bpf: Support replacing cgroup-bpf program in MULTI mode The common use-case in production is to have multiple cgroup-bpf programs per attach type that cover multiple use-cases. Such programs are attached with BPF_F_ALLOW_MULTI and can be maintained by different people. Order of programs usually matters, for example imagine two egress programs: the first one drops packets and the second one counts packets. If they're swapped the result of counting program will be different. It brings operational challenges with updating cgroup-bpf program(s) attached with BPF_F_ALLOW_MULTI since there is no way to replace a program: * One way to update is to detach all programs first and then attach the new version(s) again in the right order. This introduces an interruption in the work a program is doing and may not be acceptable (e.g. if it's egress firewall); * Another way is attach the new version of a program first and only then detach the old version. This introduces the time interval when two versions of same program are working, what may not be acceptable if a program is not idempotent. It also imposes additional burden on program developers to make sure that two versions of their program can co-exist. Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to replace That way user can replace any program among those attached with BPF_F_ALLOW_MULTI flag without the problems described above. Details of the new API: * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid descriptor of BPF program, BPF_PROG_ATTACH will return corresponding error (EINVAL or EBADF). * If replace_bpf_fd has valid descriptor of BPF program but such a program is not attached to specified cgroup, BPF_PROG_ATTACH will return ENOENT. BPF_F_REPLACE is introduced to make the user intent clear, since replace_bpf_fd alone can't be used for this (its default value, 0, is a valid fd). BPF_F_REPLACE also makes it possible to extend the API in the future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed). Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Narkyiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com	2019-12-19 21:22:25 -08:00
Andrey Ignatov	9fab329d6a	bpf: Remove unused new_flags in hierarchy_allows_attach() new_flags is unused, remove it. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/2c49b30ab750f93cfef04a1e40b097d70c3a39a1.1576741281.git.rdna@fb.com	2019-12-19 21:22:25 -08:00
Andrey Ignatov	1020c1f24a	bpf: Simplify __cgroup_bpf_attach __cgroup_bpf_attach has a lot of identical code to handle two scenarios: BPF_F_ALLOW_MULTI is set and unset. Simplify it by splitting the two main steps: * First, the decision is made whether a new bpf_prog_list entry should be allocated or existing entry should be reused for the new program. This decision is saved in replace_pl pointer; * Next, replace_pl pointer is used to handle both possible states of BPF_F_ALLOW_MULTI flag (set / unset) instead of doing similar work for them separately. This splitting, in turn, allows to make further simplifications: * The check for attaching same program twice in BPF_F_ALLOW_MULTI mode can be done before allocating cgroup storage, so that if user tries to attach same program twice no alloc/free happens as it was before; * pl_was_allocated becomes redundant so it's removed. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/c6193db6fe630797110b0d3ff06c125d093b834c.1576741281.git.rdna@fb.com	2019-12-19 21:22:25 -08:00
Alexei Starovoitov	c92bbaa0fd	Merge branch 'simplify-do_redirect' Björn Töpel says: ==================== This series aims to simplify the XDP maps and xdp_do_redirect_map()/xdp_do_flush_map(), and to crank out some more performance from XDP_REDIRECT scenarios. The first part of the series simplifies all XDP_REDIRECT capable maps, so that __XXX_flush_map() does not require the map parameter, by moving the flush list from the map to global scope. This results in that the map_to_flush member can be removed from struct bpf_redirect_info, and its corresponding logic. Simpler code, and more performance due to that checks/code per-packet is moved to flush. Pre-series performance: $ sudo taskset -c 22 ./xdpsock -i enp134s0f0 -q 20 -n 1 -r -z sock0@enp134s0f0:20 rxdrop xdp-drv pps pkts 1.00 rx 20,797,350 230,942,399 tx 0 0 $ sudo ./xdp_redirect_cpu --dev enp134s0f0 --cpu 22 xdp_cpu_map0 Running XDP/eBPF prog_name:xdp_cpu_map5_lb_hash_ip_pairs XDP-cpumap CPU:to pps drop-pps extra-info XDP-RX 20 7723038 0 0 XDP-RX total 7723038 0 cpumap_kthread total 0 0 0 redirect_err total 0 0 xdp_exception total 0 0 Post-series performance: $ sudo taskset -c 22 ./xdpsock -i enp134s0f0 -q 20 -n 1 -r -z sock0@enp134s0f0:20 rxdrop xdp-drv pps pkts 1.00 rx 21,524,979 86,835,327 tx 0 0 $ sudo ./xdp_redirect_cpu --dev enp134s0f0 --cpu 22 xdp_cpu_map0 Running XDP/eBPF prog_name:xdp_cpu_map5_lb_hash_ip_pairs XDP-cpumap CPU:to pps drop-pps extra-info XDP-RX 20 7840124 0 0 XDP-RX total 7840124 0 cpumap_kthread total 0 0 0 redirect_err total 0 0 xdp_exception total 0 0 Results: +3.5% and +1.5% for the ubenchmarks. v1->v2 [1]: * Removed 'unused-variable' compiler warning (Jakub) [1] https://lore.kernel.org/bpf/20191218105400.2895-1-bjorn.topel@gmail.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-12-19 21:10:27 -08:00
Björn Töpel	1170beaa3f	xdp: Simplify __bpf_tx_xdp_map() The explicit error checking is not needed. Simply return the error instead. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-9-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Björn Töpel	332f22a60e	xdp: Remove map_to_flush and map swap detection Now that all XDP maps that can be used with bpf_redirect_map() tracks entries to be flushed in a global fashion, there is not need to track that the map has changed and flush from xdp_do_generic_map() anymore. All entries will be flushed in xdp_do_flush_map(). This means that the map_to_flush can be removed, and the corresponding checks. Moving the flush logic to one place, xdp_do_flush_map(), give a bulking behavior and performance boost. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-8-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Björn Töpel	cdfafe98ca	xdp: Make cpumap flush_list common for all map instances The cpumap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all devmaps, which simplifies __cpu_map_flush() and cpu_map_alloc(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-7-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Björn Töpel	96360004b8	xdp: Make devmap flush_list common for all map instances The devmap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all devmaps, which simplifies __dev_map_flush() and dev_map_init_map(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-6-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Björn Töpel	e312b9e706	xsk: Make xskmap flush_list common for all map instances The xskmap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all xskmaps, which simplifies __xsk_map_flush() and xsk_map_alloc(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-5-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Björn Töpel	fb5aacdf36	xdp: Fix graze->grace type-o in cpumap comments Simple spelling fix. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-4-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Björn Töpel	4bc188c7f2	xdp: Simplify cpumap cleanup After the RCU flavor consolidation [1], call_rcu() and synchronize_rcu() waits for preempt-disable regions (NAPI) in addition to the read-side critical sections. As a result of this, the cleanup code in cpumap can be simplified * There is no longer a need to flush in __cpu_map_entry_free, since we know that this has been done when the call_rcu() callback is triggered. * When freeing the map, there is no need to explicitly wait for a flush. It's guaranteed to be done after the synchronize_rcu() call in cpu_map_free(). [1] https://lwn.net/Articles/777036/ Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-3-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Björn Töpel	0536b85239	xdp: Simplify devmap cleanup After the RCU flavor consolidation [1], call_rcu() and synchronize_rcu() waits for preempt-disable regions (NAPI) in addition to the read-side critical sections. As a result of this, the cleanup code in devmap can be simplified * There is no longer a need to flush in __dev_map_entry_free, since we know that this has been done when the call_rcu() callback is triggered. * When freeing the map, there is no need to explicitly wait for a flush. It's guaranteed to be done after the synchronize_rcu() call in dev_map_free(). The rcu_barrier() is still needed, so that the map is not freed prior the elements. [1] https://lwn.net/Articles/777036/ Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-2-bjorn.topel@gmail.com	2019-12-19 21:09:43 -08:00
Aditya Pakki	5bf2fc1f9c	bpf: Remove unnecessary assertion on fp_old The two callers of bpf_prog_realloc - bpf_patch_insn_single and bpf_migrate_filter dereference the struct fp_old, before passing it to the function. Thus assertion to check fp_old is unnecessary and can be removed. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191219175735.19231-1-pakki001@umn.edu	2019-12-19 22:24:15 +01:00
Andrii Nakryiko	7745ff9842	libbpf: Fix another __u64 printf warning Fix yet another printf warning for %llu specifier on ppc64le. This time size_t casting won't work, so cast to verbose `unsigned long long`. Fixes: `166750bc1d` ("libbpf: Support libbpf-provided extern variables") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191219052103.3515-1-andriin@fb.com	2019-12-19 16:47:56 +01:00
Toke Høiland-Jørgensen	b5c7d0d0f7	libbpf: Fix printing of ulimit value Naresh pointed out that libbpf builds fail on 32-bit architectures because rlimit.rlim_cur is defined as 'unsigned long long' on those architectures. Fix this by using %zu in printf and casting to size_t. Fixes: `dc3a2d2547` ("libbpf: Print hint about ulimit when getting permission denied error") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191219090236.905059-1-toke@redhat.com	2019-12-19 16:25:52 +01:00
Alexei Starovoitov	580205dd4f	selftests/bpf: Fix test_attach_probe Fix two issues in test_attach_probe: 1. it was not able to parse /proc/self/maps beyond the first line, since %s means parse string until white space. 2. offset has to be accounted for otherwise uprobed address is incorrect. Fixes: `1e8611bbdf` ("selftests/bpf: add kprobe/uprobe selftests") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191219020442.1922617-1-ast@kernel.org	2019-12-19 16:14:08 +01:00
Toke Høiland-Jørgensen	12dd14b230	libbpf: Add missing newline in opts validation macro The error log output in the opts validation macro was missing a newline. Fixes: `2ce8450ef5` ("libbpf: add bpf_object__open_{file, mem} w/ extensible opts") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191219120714.928380-1-toke@redhat.com	2019-12-19 16:08:46 +01:00
Daniel Borkmann	7800a3d54a	Merge branch 'bpf-riscv-jit-improvements' Björn Töpel says: ==================== This series contain one non-critical fix, support for far jumps, and some optimizations for the BPF JIT. Previously, the JIT only supported 12b branch targets for conditional branches, and 21b for unconditional branches. Starting with this series, 32b branching is supported. As part of supporting far jumps, branch relaxation was introduced. The idea is to start with a pessimistic jump (e.g. auipc/jalr) and for each pass the JIT will have an opportunity to pick a better instruction (e.g. jal) and shrink the image. Instead of two passes, the JIT requires more passes. It typically converges after 3 passes. The optimizations mentioned in the subject are for calls and tail calls. In the tail call generation we can save one instruction by using the offset in jalr. Calls are optimized by doing (auipc)/jal(r) relative jumps instead of loading the entire absolute address and doing jalr. This required that the JIT image allocator was made RISC-V specific, so we can ensure that the JIT image and the kernel text are in range (32b). The last two patches of the series is not critical to the series, but are two UAPI build issues for BPF events. A closer look from the RV-folks would be much appreciated. The test_bpf.ko module, selftests/bpf/test_verifier and selftests/seccomp/seccomp_bpf pass all tests. RISC-V is still missing proper kprobe and tracepoint support, so a lot of BPF selftests cannot be run. v1->v2: [1] * Removed unused function parameter from emit_branch() * Added patch to support far branch in tail call emit [1] https://lore.kernel.org/bpf/20191209173136.29615-1-bjorn.topel@gmail.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-12-19 16:04:18 +01:00
Björn Töpel	34bfc10a6e	riscv, perf: Add arch specific perf_arch_bpf_user_pt_regs RISC-V was missing a proper perf_arch_bpf_user_pt_regs macro for CONFIG_PERF_EVENT builds. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191216091343.23260-10-bjorn.topel@gmail.com	2019-12-19 16:03:31 +01:00
Björn Töpel	eb9928bed0	riscv, bpf: Add missing uapi header for BPF_PROG_TYPE_PERF_EVENT programs Add missing uapi header the BPF_PROG_TYPE_PERF_EVENT programs by exporting struct user_regs_struct instead of struct pt_regs which is in-kernel only. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191216091343.23260-9-bjorn.topel@gmail.com	2019-12-19 16:03:31 +01:00
Björn Töpel	e368b64f8b	riscv, bpf: Optimize calls Instead of using emit_imm() and emit_jalr() which can expand to six instructions, start using jal or auipc+jalr. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191216091343.23260-8-bjorn.topel@gmail.com	2019-12-19 16:03:31 +01:00
Björn Töpel	7f3631e88e	riscv, bpf: Provide RISC-V specific JIT image alloc/free This commit makes sure that the JIT images is kept close to the kernel text, so BPF calls can use relative calling with auipc/jalr or jal instead of loading the full 64-bit address and jalr. The BPF JIT image region is 128 MB before the kernel text. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191216091343.23260-7-bjorn.topel@gmail.com	2019-12-19 16:03:31 +01:00
Björn Töpel	fe8322b866	riscv, bpf: Optimize BPF tail calls Remove one addi, and instead use the offset part of jalr. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191216091343.23260-6-bjorn.topel@gmail.com	2019-12-19 16:03:31 +01:00
Björn Töpel	33203c02f2	riscv, bpf: Add support for far jumps and exits This commit add support for far (offset > 21b) jumps and exits. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Luke Nelson <lukenels@cs.washington.edu> Link: https://lore.kernel.org/bpf/20191216091343.23260-5-bjorn.topel@gmail.com	2019-12-19 16:03:30 +01:00
Björn Töpel	29d92edd9e	riscv, bpf: Add support for far branching when emitting tail call Start use the emit_branch() function in the tail call emitter in order to support far branching. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191216091343.23260-4-bjorn.topel@gmail.com	2019-12-19 16:03:30 +01:00
Björn Töpel	7d1ef13fea	riscv, bpf: Add support for far branching This commit adds branch relaxation to the BPF JIT, and with that support for far (offset greater than 12b) branching. The branch relaxation requires more than two passes to converge. For most programs it is three passes, but for larger programs it can be more. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Luke Nelson <lukenels@cs.washington.edu> Link: https://lore.kernel.org/bpf/20191216091343.23260-3-bjorn.topel@gmail.com	2019-12-19 16:03:30 +01:00
Björn Töpel	f1003b787c	riscv, bpf: Fix broken BPF tail calls The BPF JIT incorrectly clobbered the a0 register, and did not flag usage of s5 register when BPF stack was being used. Fixes: `2353ecc6f9` ("bpf, riscv: add BPF JIT for RV64G") Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191216091343.23260-2-bjorn.topel@gmail.com	2019-12-19 16:03:30 +01:00
Alexei Starovoitov	a352a82496	Merge branch 'libbpf-extern-followups' Andrii Nakryiko says: ==================== Based on latest feedback and discussions, this patch set implements the following changes: - Kconfig-provided externs have to be in .kconfig section, for which bpf_helpers.h provides convenient __kconfig macro (Daniel); - instead of allowing to override Kconfig file path, switch this to ability to extend and override system Kconfig with user-provided custom values (Alexei); - BTF is required when externs are used. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-12-18 17:33:52 -08:00
Andrii Nakryiko	630628cb7d	libbpf: BTF is required when externs are present BTF is required to get type information about extern variables. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191219002837.3074619-4-andriin@fb.com	2019-12-18 17:33:36 -08:00

1 2 3 4 5 ...

887017 Commits All Branches Search

887017 Commits

All Branches