The writer acquires dev_base_lock with disabled bottom halves.
The reader can acquire dev_base_lock without disabling bottom halves
because there is no writer in softirq context.
On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves (as in write_lock_bh()) and somewhere else with
enabled bottom halves (as by read_lock() in netstat_show()) followed by
disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
-> spin_lock_bh()). This is the reverse locking order.
All read_lock() invocation are from sysfs callback which are not invoked
from softirq context. Therefore there is no need to disable bottom
halves while acquiring a write lock.
Acquire the write lock of dev_base_lock without disabling bottom halves.
Reported-by: Pei Zhang <pezhang@redhat.com>
Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
A CAP_NET_RAW capable process can already spoof (on transmit) anything
it desires via raw packet sockets... There is no good reason to not
allow it to also be able to play routing tricks on packets from its
own normal sockets.
There is a desire to be able to use SO_MARK for routing table selection
(via ip rule fwmark) from within a user process without having to run
it as root. Granting it CAP_NET_RAW is much less dangerous than
CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change,
while CAP_NET_ADMIN does - by for example allowing the reconfiguration
of the routing tables and/or bringing up/down devices).
Let's keep CAP_NET_ADMIN for persistent state changes,
while using CAP_NET_RAW for non-configuration related stuff.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CAP_NET_ADMIN is and should continue to be about configuring the
system as a whole, not about configuring per-socket or per-packet
parameters.
Sending and receiving raw packets is what CAP_NET_RAW is all about.
It can already send packets with any VLAN tag, and any IPv4 TOS
mark, and any IPv6 TCLASS mark, simply by virtue of building
such a raw packet. Not to mention using any protocol and source/
/destination ip address/port tuple.
These are the fields that networking gear uses to prioritize packets.
Hence, a CAP_NET_RAW process is already capable of affecting traffic
prioritization after it hits the wire. This change makes it capable
of affecting traffic prioritization even in the host at the nic and
before that in the queueing disciplines (provided skb->priority is
actually being used for prioritization, and not the TOS/TCLASS field)
Hence it makes sense to allow a CAP_NET_RAW process to set the
priority of sockets and thus packets it sends.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
.ndo_change_proto_down was added seemingly to enable out-of-tree
implementations. Over 2.5yrs later we still have no real users
upstream. Hardwire the generic implementation for now, we can
revert once real users materialize. (rocker is a test vehicle,
not a user.)
We need to drop the optimization on the sysfs side, because
unlike ndos priv_flags will be changed at runtime, so we'd
need READ_ONCE/WRITE_ONCE everywhere..
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new device generic parameter to enable and disable
iWARP functionality on a multi-protocol RDMA device.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Leszek Kaliszczuk <leszek.kaliszczuk@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
Replace the existing empty member position markers "headers_start" and
"headers_end" with a struct_group(). This will allow memcpy() and sizeof()
to more easily reason about sizes, and improve readability.
"pahole" shows no size nor member offset changes to struct sk_buff.
"objdump -d" shows no object code changes (outside of WARNs affected by
source line number changes).
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/*
Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedor
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for using the struct_group() macro in struct sk_buff,
move the conditional preprocessor directives out of the region of struct
sk_buff that will be enclosed by struct_group(). While GCC and Clang are
happy with conditional preprocessor directives here, sparse is not, even
under -Wno-directive-within-macro[1], as would be seen under a C=1 build:
net/core/filter.c: note: in included file (through include/linux/netlink.h, include/linux/sock_diag.h):
./include/linux/skbuff.h:820:1: warning: directive in macro's argument list
./include/linux/skbuff.h:822:1: warning: directive in macro's argument list
./include/linux/skbuff.h:846:1: warning: directive in macro's argument list
./include/linux/skbuff.h:848:1: warning: directive in macro's argument list
Additionally remove empty macro argument definitions and usage.
"objdump -d" shows no object code differences.
[1] https://www.spinics.net/lists/linux-sparse/msg10857.html
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IPv6 module gets initialized, but it's hitting an error in inet6_init()
where it then needs to undo all the prior initialization work, it also might
do a call to ndisc_cleanup() which then calls neigh_table_clear(). In there
is a missing timer cancellation of the table's managed_work item.
The kernel test robot explicitly triggered this error path and caused a UAF
crash similar to the below:
[...]
[ 28.833183][ C0] BUG: unable to handle page fault for address: f7a43288
[ 28.833973][ C0] #PF: supervisor write access in kernel mode
[ 28.834660][ C0] #PF: error_code(0x0002) - not-present page
[ 28.835319][ C0] *pde = 06b2c067 *pte = 00000000
[ 28.835853][ C0] Oops: 0002 [#1] PREEMPT
[ 28.836367][ C0] CPU: 0 PID: 303 Comm: sed Not tainted 5.16.0-rc1-00233-g83ff5faa0d3b #7
[ 28.837293][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[ 28.838338][ C0] EIP: __run_timers.constprop.0+0x82/0x440
[...]
[ 28.845607][ C0] Call Trace:
[ 28.845942][ C0] <SOFTIRQ>
[ 28.846333][ C0] ? check_preemption_disabled.isra.0+0x2a/0x80
[ 28.846975][ C0] ? __this_cpu_preempt_check+0x8/0xa
[ 28.847570][ C0] run_timer_softirq+0xd/0x40
[ 28.848050][ C0] __do_softirq+0xf5/0x576
[ 28.848547][ C0] ? __softirqentry_text_start+0x10/0x10
[ 28.849127][ C0] do_softirq_own_stack+0x2b/0x40
[ 28.849749][ C0] </SOFTIRQ>
[ 28.850087][ C0] irq_exit_rcu+0x7d/0xc0
[ 28.850587][ C0] common_interrupt+0x2a/0x40
[ 28.851068][ C0] asm_common_interrupt+0x119/0x120
[...]
Note that IPv6 module cannot be unloaded as per 8ce4406103 ("ipv6: do not
allow ipv6 module to be removed") hence this can only be seen during module
initialization error. Tested with kernel test robot's reproducer.
Fixes: 7482e3841d ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'mask' bitmap is local to this function. So the non-atomic
'__set_bit()' can be used to save a few cycles.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add netif_set_gso_max_segs() helper.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since all netdev->dev_addr modifications go via dev_addr_mod()
we can put it on the list. When address is change remove it
and add it back.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev->dev_addr should only be modified via helpers,
but someone may be casting off the const. Add a runtime
check to catch abuses.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no module callers in-tree and it's hard to justify
why anyone would init or flush addresses of a netdev (note
the flush is more of a destructor, it frees netdev->dev_addr).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. We converted all users to make modifications via appropriate
helpers, make netdev->dev_addr const.
The update helpers need to upcast from the buffer to
struct netdev_hw_addr.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a sock is added to a sock map we evaluate what proto op hooks need to
be used. However, when the program is removed from the sock map we have not
been evaluating if that changes the required program layout.
Before the patch listed in the 'fixes' tag this was not causing failures
because the base program set handles all cases. Specifically, the case with
a stream parser and the case with out a stream parser are both handled. With
the fix below we identified a race when running with a proto op that attempts
to read skbs off both the stream parser and the skb->receive_queue. Namely,
that a race existed where when the stream parser is empty checking the
skb->receive_queue from recvmsg at the precies moment when the parser is
paused and the receive_queue is not empty could result in skipping the stream
parser. This may break a RX policy depending on the parser to run.
The fix tag then loads a specific proto ops that resolved this race. But, we
missed removing that proto ops recv hook when the sock is removed from the
sockmap. The result is the stream parser is stopped so no more skbs will be
aggregated there, but the hook and BPF program continues to be attached on
the psock. User space will then get an EBUSY when trying to read the socket
because the recvmsg() handler is now waiting on a stopped stream parser.
To fix we rerun the proto ops init() function which will look at the new set
of progs attached to the psock and rest the proto ops hook to the correct
handlers. And in the above case where we remove the sock from the sock map
the RX prog will no longer be listed so the proto ops is removed.
Fixes: c5d2177a72 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com
When a TCP socket is added to a sock map we look at the programs attached
to the map to determine what proto op hooks need to be changed. Before
the patch in the 'fixes' tag there were only two categories -- the empty
set of programs or a TX policy. In any case the base set handled the
receive case.
After the fix we have an optimized program for receive that closes a small,
but possible, race on receive. This program is loaded only when the map the
psock is being added to includes a RX policy. Otherwise, the race is not
possible so we don't need to handle the race condition.
In order for the call to sk_psock_init() to correctly evaluate the above
conditions all progs need to be set in the psock before the call. However,
in the current code this is not the case. We end up evaluating the
requirements on the old prog state. If your psock is attached to multiple
maps -- for example a tx map and rx map -- then the second update would pull
in the correct maps. But, the other pattern with a single rx enabled map
the correct receive hooks are not used. The result is the race fixed by the
patch in the fixes tag below may still be seen in this case.
To fix we simply set all psock->progs before doing the call into
sock_map_init(). With this the init() call gets the full list of programs
and chooses the correct proto ops on the first iteration instead of
requiring the second update to pull them in. This fixes the race case when
only a single map is used.
Fixes: c5d2177a72 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211119181418.353932-2-john.fastabend@gmail.com
The mlxsw driver calls to various devlink flash routines even before
users can get any access to the devlink instance itself. For example,
mlxsw_core_fw_rev_validate() one of such functions.
__mlxsw_core_bus_device_register
-> mlxsw_core_fw_rev_validate
-> mlxsw_core_fw_flash
-> mlxfw_firmware_flash
-> mlxfw_status_notify
-> devlink_flash_update_status_notify
-> __devlink_flash_update_notify
-> WARN_ON(...)
It causes to the WARN_ON to trigger warning about devlink not registered.
Fixes: cf53021740 ("devlink: Notify users when objects are accessible")
Reported-by: Danielle Ratson <danieller@nvidia.com>
Tested-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit d00e60ee54.
As reported by Guillaume in [1]:
Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT
in 32-bit systems, which breaks the bootup proceess when a
ethernet driver is using page pool with PP_FLAG_DMA_MAP flag.
As we were hoping we had no active consumers for such system
when we removed the dma mapping support, and LPAE seems like
a common feature for 32 bits system, so revert it.
1. https://www.spinics.net/lists/netdev/msg779890.html
Fixes: d00e60ee54 ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Tested-by: "kernelci.org bot" <bot@kernelci.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tx_timeout_show() assumed dev_watchdog() would stop all
the queues, to fetch queue->trans_timeout under protection
of the queue->_xmit_lock.
As we want to no longer disrupt transmits, we use an
atomic_long_t instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: david decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
siphash keys use 16 bytes.
Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2021-11-16
We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).
The main changes are:
1) Fix pruning regression where verifier went overly conservative rejecting
previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.
2) Fix verifier TOCTOU bug when using read-only map's values as constant
scalars during verification, from Daniel Borkmann.
3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.
4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.
5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
programs due to deadlock possibilities, from Dmitrii Banshchikov.
6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.
7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.
8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
udp: Validate checksum in udp_read_sock()
bpf: Fix toctou on read-only map's constant scalar tracking
samples/bpf: Fix build error due to -isystem removal
selftests/bpf: Add tests for restricted helpers
bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
libbpf: Perform map fd cleanup for gen_loader in case of error
samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
tools/runqslower: Fix cross-build
samples/bpf: Fix summary per-sec stats in xdp_sample_user
selftests/bpf: Check map in map pruning
bpf: Fix inner map state pruning regression.
xsk: Fix crash on double free in buffer pool
====================
Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net->core.sock_inuse is a per cpu variable (int),
while net->core.prot_inuse is another per cpu variable
of 64 integers.
per cpu allocator tend to place them in very different places.
Grouping them together makes sense, since it makes
updates potentially faster, if hitting the same
cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MPTCP hard codes it, let us instead provide this helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_prot_inuse_add() is very small, we can inline it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.
gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper is used once, no need to keep it in fat net/core/skbuff.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use INDIRECT_CALL_INET() to avoid an indirect call
when/if CONFIG_RETPOLINE=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We were only using one bit, and we can replace it by sk_is_tcp()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move sk_is_tcp() to include/net/sock.h and use it where we can.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-11-15
We've added 72 non-merge commits during the last 13 day(s) which contain
a total of 171 files changed, 2728 insertions(+), 1143 deletions(-).
The main changes are:
1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to
BTF such that BPF verifier will be able to detect misuse, from Yonghong Song.
2) Big batch of libbpf improvements including various fixes, future proofing APIs,
and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko.
3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the
programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush.
4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to
ensure exception handling still works, from Russell King and Alan Maguire.
5) Add a new bpf_find_vma() helper for tracing to map an address to the backing
file such as shared library, from Song Liu.
6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump,
updating documentation and bash-completion among others, from Quentin Monnet.
7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as
the API is heavily tailored around perf and is non-generic, from Dave Marchevsky.
8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an
opt-out for more relaxed BPF program requirements, from Stanislav Fomichev.
9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpftool: Use libbpf_get_error() to check error
bpftool: Fix mixed indentation in documentation
bpftool: Update the lists of names for maps and prog-attach types
bpftool: Fix indent in option lists in the documentation
bpftool: Remove inclusion of utilities.mak from Makefiles
bpftool: Fix memory leak in prog_dump()
selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
selftests/bpf: Fix an unused-but-set-variable compiler warning
bpf: Introduce btf_tracing_ids
bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
bpftool: Enable libbpf's strict mode by default
docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support
selftests/bpf: Clarify llvm dependency with btf_tag selftest
selftests/bpf: Add a C test for btf_type_tag
selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication
selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests
selftests/bpf: Test libbpf API function btf__add_type_tag()
bpftool: Support BTF_KIND_TYPE_TAG
libbpf: Support BTF_KIND_TYPE_TAG
...
====================
Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk_clone_lock() needs to call sock_inuse_add(1) before entering the
sk_free_unlock_clone() error path, for __sk_free() from sk_free() from
sk_free_unlock_clone() calls sock_inuse_add(-1).
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 648845ab7e ("sock: Move the socket inuse to namespace.")
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 719c571970 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.
Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an uninitialized variable, rather than straight to
another round of the state check.
Error path looks like:
napi_disable():
unsigned long val, new; /* new is uninitialized */
do {
val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
NAPIF_STATE_SCHED is set */
if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
usleep_range(20, 200);
continue; /* go straight to the condition check */
}
new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
writes garbage */
napi_enable():
do {
val = READ_ONCE(n->state);
BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>
while the typical BUG splat is like:
[ 172.652461] ------------[ cut here ]------------
[ 172.652462] kernel BUG at net/core/dev.c:6937!
[ 172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G I 5.15.0 #42
[ 172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[ 172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[ 172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[ 172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[ 172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[ 172.782403] Call Trace:
[ 172.784857] <TASK>
[ 172.786963] ice_up_complete+0x6f/0x210 [ice]
[ 172.791349] ice_xdp+0x136/0x320 [ice]
[ 172.795108] ? ice_change_mtu+0x180/0x180 [ice]
[ 172.799648] dev_xdp_install+0x61/0xe0
[ 172.803401] dev_xdp_attach+0x1e0/0x550
[ 172.807240] dev_change_xdp_fd+0x1e6/0x220
[ 172.811338] do_setlink+0xee8/0x1010
[ 172.814917] rtnl_setlink+0xe5/0x170
[ 172.818499] ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[ 172.823732] ? security_capable+0x36/0x50
< snip >
Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
for-loop with an explicit break.
From v1 [0]:
- just use a for-loop to simplify both the fix and the existing
code (Eric).
[0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com
Fixes: 719c571970 ("net: make napi_disable() symmetric with enable")
Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It may be helpful to have access to the ifindex during bpf socket
lookup. An example may be to scope certain socket lookup logic to
specific interfaces, i.e. an interface may be made exempt from custom
lookup code.
Add the ifindex of the arriving connection to the bpf_sk_lookup API.
Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com
The current conversion of skb->data_end reads like this:
; data_end = (void*)(long)skb->data_end;
559: (79) r1 = *(u64 *)(r2 +200) ; r1 = skb->data
560: (61) r11 = *(u32 *)(r2 +112) ; r11 = skb->len
561: (0f) r1 += r11
562: (61) r11 = *(u32 *)(r2 +116)
563: (1f) r1 -= r11
But similar to the case in 84f44df664 ("bpf: sock_ops sk access may stomp
registers when dst_reg = src_reg"), the code will read an incorrect skb->len
when src == dst. In this case we end up generating this xlated code:
; data_end = (void*)(long)skb->data_end;
559: (79) r1 = *(u64 *)(r1 +200) ; r1 = skb->data
560: (61) r11 = *(u32 *)(r1 +112) ; r11 = (skb->data)->len
561: (0f) r1 += r11
562: (61) r11 = *(u32 *)(r1 +116)
563: (1f) r1 -= r11
... where line 560 is the reading 4B of (skb->data + 112) instead of the
intended skb->len Here the skb pointer in r1 gets set to skb->data and the
later deref for skb->len ends up following skb->data instead of skb.
This fixes the issue similarly to the patch mentioned above by creating an
additional temporary variable and using to store the register when dst_reg =
src_reg. We name the variable bpf_temp_reg and place it in the cb context for
sk_skb. Then we restore from the temp to ensure nothing is lost.
Fixes: 16137b09a6 ("bpf: Compute data_end dynamically with JIT code")
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com
Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling
progress, e.g. offset and length of the skb. First this is poorly named and
inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at
this layer.
But, more importantly strparser is using the following to access its metadata.
(struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))
Where _strp_msg is defined as:
struct _strp_msg {
struct strp_msg strp; /* 0 8 */
int accum_len; /* 8 4 */
/* size: 12, cachelines: 1, members: 2 */
/* last cacheline: 12 bytes */
};
So we use 12 bytes of ->data[] in struct. However in BPF code running parser
and verdict the user has read capabilities into the data[] array as well. Its
not too problematic, but we should not be exposing internal state to BPF
program. If its really needed then we can use the probe_read() APIs which allow
reading kernel memory. And I don't believe cb[] layer poses any API breakage by
moving this around because programs can't depend on cb[] across layers.
In order to fix another issue with a ctx rewrite we need to stash a temp
variable somewhere. To make this work cleanly this patch builds a cb struct
for sk_skb types called sk_skb_cb struct. Then we can use this consistently
in the strparser, sockmap space. Additionally we can start allowing ->cb[]
write access after this.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com
In order to fix an issue with sockets in TCP sockmap redirect cases we plan
to allow CLOSE state sockets to exist in the sockmap. However, the check in
bpf_sk_lookup_assign() currently only invalidates sockets in the
TCP_ESTABLISHED case relying on the checks on sockmap insert to ensure we
never SOCK_CLOSE state sockets in the map.
To prepare for this change we flip the logic in bpf_sk_lookup_assign() to
explicitly test for the accepted cases. Namely, a tcp socket in TCP_LISTEN
or a udp socket in TCP_CLOSE state. This also makes the code more resilent
to future changes.
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-2-john.fastabend@gmail.com
Fix following coccicheck warning:
./net/core/devlink.c:69:6-10: WARNING use flexible-array member instead
Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Link: https://lore.kernel.org/r/20211103121607.27490-1-guozhengkui@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Track skbs containing only zerocopy data and avoid charging them to
kernel memory to correctly account the memory utilization for
msg_zerocopy. All of the data in such skbs is held in user pages which
are already accounted to user. Before this change, they are charged
again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.
Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.
A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.
In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.
Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.
With this commit we don't see the warning we saw in previous commit
which resulted in commit 84882cf72c.
Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While commit 097b9146c0 ("net: fix up truesize of cloned
skb in skb_prepare_for_shift()") fixed immediate issues found
when KFENCE was enabled/tested, there are still similar issues,
when tcp_trim_head() hits KFENCE while the master skb
is cloned.
This happens under heavy networking TX workloads,
when the TX completion might be delayed after incoming ACK.
This patch fixes the WARNING in sk_stream_kill_queues
when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.
Fixes: d3fb45f370 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-11-01
We've added 181 non-merge commits during the last 28 day(s) which contain
a total of 280 files changed, 11791 insertions(+), 5879 deletions(-).
The main changes are:
1) Fix bpf verifier propagation of 64-bit bounds, from Alexei.
2) Parallelize bpf test_progs, from Yucong and Andrii.
3) Deprecate various libbpf apis including af_xdp, from Andrii, Hengqi, Magnus.
4) Improve bpf selftests on s390, from Ilya.
5) bloomfilter bpf map type, from Joanne.
6) Big improvements to JIT tests especially on Mips, from Johan.
7) Support kernel module function calls from bpf, from Kumar.
8) Support typeless and weak ksym in light skeleton, from Kumar.
9) Disallow unprivileged bpf by default, from Pawan.
10) BTF_KIND_DECL_TAG support, from Yonghong.
11) Various bpftool cleanups, from Quentin.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (181 commits)
libbpf: Deprecate AF_XDP support
kbuild: Unify options for BTF generation for vmlinux and modules
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
selftests/bpf: Fix also no-alu32 strobemeta selftest
bpf: Add missing map_delete_elem method to bloom filter map
selftests/bpf: Add bloom map success test for userspace calls
bpf: Add alignment padding for "map_extra" + consolidate holes
bpf: Bloom filter map naming fixups
selftests/bpf: Add test cases for struct_ops prog
bpf: Add dummy BPF STRUCT_OPS for test purpose
bpf: Factor out helpers for ctx access checking
bpf: Factor out a helper to prepare trampoline for struct_ops prog
selftests, bpf: Fix broken riscv build
riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.h
tools, build: Add RISC-V to HOSTARCH parsing
riscv, bpf: Increase the maximum number of iterations
selftests, bpf: Add one test for sockmap with strparser
selftests, bpf: Fix test_txmsg_ingress_parser error
...
====================
Link: https://lore.kernel.org/r/20211102013123.9005-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Track skbs with only zerocopy data and avoid charging them to kernel
memory to correctly account the memory utilization for msg_zerocopy.
All of the data in such skbs is held in user pages which are already
accounted to user. Before this change, they are charged again in
kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.
Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.
A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.
In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.
Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.
Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If sockmap enable strparser, there are lose offset info in
sk_psock_skb_ingress(). If the length determined by parse_msg function is not
skb->len, the skb will be converted to sk_msg multiple times, and userspace
app will get the data multiple times.
Fix this by get the offset and length from strp_msg. And as Cong suggested,
add one bit in skb->_sk_redir to distinguish enable or disable strparser.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com
devlink compat code needs to drop rtnl_lock to take
devlink->lock to ensure correct lock ordering.
This is problematic because we're not strictly guaranteed
that the netdev will not disappear after we re-lock.
It may open a possibility of nested ->begin / ->complete
calls.
Instead of calling into devlink under rtnl_lock take
a ref on the devlink instance and make the call after
we've dropped rtnl_lock.
We (continue to) assume that netdevs have an implicit
reference on the devlink returned from ndo_get_devlink_port
Note that ndo_get_devlink_port will now get called
under rtnl_lock. That should be fine since none of
the drivers seem to be taking serious locks inside
ndo_get_devlink_port.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow those who hold implicit reference on a devlink instance
to try to take a full ref on it. This will be used from netdev
code which has an implicit ref because of driver call ordering.
Note that after recent changes devlink_unregister() may happen
before netdev unregister, but devlink_free() should still happen
after, so we are safe to try, but we can't just refcount_inc()
and assume it's not zero.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't take the lock in net/core/dev_ioctl.c,
we'll have things to do outside rtnl_lock soon.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a packet of a new flow arrives in openvswitch kernel module, it dissects
the packet and passes the extracted flow key to ovs-vswtichd daemon. If hw-
offload configuration is enabled, the daemon creates a new TC flower entry to
bypass openvswitch kernel module for the flow (TC flower can also offload flows
to NICs but this time that does not matter).
In this processing flow, I found the following issue in cases of GRE/IPIP
packets.
When ovs_flow_key_extract() in openvswitch module parses a packet of a new
GRE (or IPIP) flow received on non-tunneling vports, it extracts information
of the outer IP header for ip_proto/src_ip/dst_ip match keys.
This means ovs-vswitchd creates a TC flower entry with IP protocol/addresses
match keys whose values are those of the outer IP header. OTOH, TC flower,
which uses flow_dissector (different parser from openvswitch module), extracts
information of the inner IP header.
The following flow is an example to describe the issue in more detail.
<----------- Outer IP -----------------> <---------- Inner IP ---------->
+----------+--------------+--------------+----------+----------+----------+
| ip_proto | src_ip | dst_ip | ip_proto | src_ip | dst_ip |
| 47 (GRE) | 192.168.10.1 | 192.168.10.2 | 6 (TCP) | 10.0.0.1 | 10.0.0.2 |
+----------+--------------+--------------+----------+----------+----------+
In this case, TC flower entry and extracted information are shown as below:
- ovs-vswitchd creates TC flower entry with:
- ip_proto: 47
- src_ip: 192.168.10.1
- dst_ip: 192.168.10.2
- TC flower extracts below for IP header matches:
- ip_proto: 6
- src_ip: 10.0.0.1
- dst_ip: 10.0.0.2
Thus, GRE or IPIP packets never match the TC flower entry, as each
dissector behaves differently.
IMHO, the behavior of TC flower (flow dissector) does not look correct,
as ip_proto/src_ip/dst_ip in TC flower match means the outermost IP
header information except for GRE/IPIP cases. This patch adds a new
flow_dissector flag FLOW_DISSECTOR_F_STOP_BEFORE_ENCAP which skips
dissection of the encapsulated inner GRE/IPIP header in TC flower
classifier.
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink_alloc() and devlink_register() are both GPL.
A non-GPL module won't get far, so for consistency
we can make all symbols GPL without risking any real
life breakage.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds a new skb extension for MCTP, to represent a
request/response flow.
The intention is to use this in a later change to allow i2c controllers
to correctly configure a multiplexer over a flow.
Since we have a cleanup function in the core path (if an extension is
present), we'll need to make CONFIG_MCTP a bool, rather than a tristate.
Includes a fix for a build warning with clang:
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce extra indirection from devlink_params_*() API. Such change
makes it clear that we can drop devlink->lock from these flows, because
everything is executed when the devlink is not registered yet.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a warning in xdp_rxq_info_unreg_mem_model() when reg_state isn't
equal to REG_STATE_REGISTERED, so the warning in xdp_rxq_info_unreg() is
redundant.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20211027013856.1866-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 22849b5ea5 as it
revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters
devlink objects during another devlink user triggered command.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 8bbeed4858 as it
revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters
devlink objects during another devlink user triggered command.
Fixes: 22849b5ea5 ("devlink: Remove not-executed trap policer notifications")
Reported-by: syzbot+93d5accfaefceedf43c1@syzkaller.appspotmail.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
LRO and HW-GRO are mutually exclusive, this commit adds this restriction
in netdev_fix_feature. HW-GRO is preferred, that means in case both
HW-GRO and LRO features are requested, LRO is cleared.
Signed-off-by: Ben Ben-ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Daniel Borkmann says:
====================
pull-request: bpf 2021-10-26
We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).
The main changes are:
1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.
2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.
3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.
4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.
5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.
6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.
7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix potential race in tail call compatibility check
bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
selftests/bpf: Use recv_timeout() instead of retries
net: Implement ->sock_is_readable() for UDP and AF_UNIX
skmsg: Extract and reuse sk_msg_is_readable()
net: Rename ->stream_memory_read to ->sock_is_readable
tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
cgroup: Fix memory leak caused by missing cgroup_bpf_offline
bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
bpf: Prevent increasing bpf_jit_limit above max
bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================
Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During a testing of an user-space application which transmits UDP
multicast datagrams and utilizes multicast routing to send the UDP
datagrams out of defined network interfaces, I've found a multicast
router does not fill-in UDP checksum into locally produced, looped-back
and forwarded UDP datagrams, if an original output NIC the datagrams
are sent to has UDP TX checksum offload enabled.
The datagrams are sent malformed out of the NIC the datagrams have been
forwarded to.
It is because:
1. If TX checksum offload is enabled on the output NIC, UDP checksum
is not calculated by kernel and is not filled into skb data.
2. dev_loopback_xmit(), which is called solely by
ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
unconditionally.
3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
CHECKSUM_COMPLETE"), the ip_summed value is preserved during
forwarding.
4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
a packet egress.
The minimum fix in dev_loopback_xmit():
1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
case when the original output NIC has TX checksum offload enabled.
The effects are:
a) If the forwarding destination interface supports TX checksum
offloading, the NIC driver is responsible to fill-in the
checksum.
b) If the forwarding destination interface does NOT support TX
checksum offloading, checksums are filled-in by kernel before
skb is submitted to the NIC driver.
c) For local delivery, checksum validation is skipped as in the
case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
means, for CHECKSUM_NONE, the behavior is unmodified and is there
to skip a looped-back packet local delivery checksum validation.
Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently in net_ns_get_ownership() it may not be able to set uid or gid
if make_kuid or make_kgid returns an invalid value, and an uninit-value
issue can be triggered by this.
This patch is to fix it by initializing the uid and gid before calling
net_ns_get_ownership(), as it does in kobject_get_ownership()
Fixes: e6dee9f389 ("net-sysfs: add netdev_change_owner()")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
to set the queue count and offset for each TC. So the queue count
and offset for the TCs may be zero for a short period after dev->num_tc
has been set. If a TX packet is being transmitted at this time in the
code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
nonzero dev->num_tc but zero qcount for the TC. The while loop that
keeps looping while hash >= qcount will not end.
Fix it by checking the TC's qcount to be nonzero before using it.
Fixes: eadec877ce ("net: Add support for subordinate traffic classes to netdev_pick_tx")
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parameters are registered before devlink_register() and all the
notifications are delayed. This patch removes not-possible parameters
notifications along with addition of code annotation logic.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The trap logic is registered before devlink_register() and all the
notifications are delayed. This patch removes not-possible trap group
notifications along with addition of code annotation logic.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The trap policer logic is registered before devlink_register() and all the
notifications are delayed. This patch removes not-possible notifications
along with addition of code annotation logic.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The change of devlink_register() to be last devlink command together
with delayed notification logic made the publish API to be obsolete.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Christoph Paasch reports [1] about incorrect skb->truesize
after skb_expand_head() call in ip6_xmit.
This may happen because of two reasons:
- skb_set_owner_w() for newly cloned skb is called too early,
before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
- pskb_expand_head() does not adjust truesize in (skb->sk) case.
In this case sk->sk_wmem_alloc should be adjusted too.
[1] https://lkml.org/lkml/2021/8/20/1082
Fixes: f1260ff15a ("skbuff: introduce skb_expand_head()")
Fixes: 2d85a1b31d ("ipv6: ip6_finish_output2: set sk into newly allocated nskb")
Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/644330dd-477e-0462-83bf-9f514c41edd1@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The helper is used in tracing programs to cast a socket
pointer to a unix_sock pointer.
The return value could be NULL if the casting is illegal.
Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211021134752.1223426-2-hengqi.chen@gmail.com
Although if_info_size is assigned, it has not been used. And the variable
should also be deleted.
The clang_analyzer complains as follows:
net/core/rtnetlink.c:3806: warning:
Although the value stored to 'if_info_size' is used in the enclosing
expression, the value is never actually read from 'if_info_size'.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the rework, the statistics code always adds up the byte and packet
value(s). On 32bit architectures a seqcount_t is used in
gnet_stats_basic_sync to ensure that the 64bit values are not modified
during the read since two 32bit loads are required. The usage of a
seqcount_t requires a lock to ensure that only one writer is active at a
time. This lock leads to disabled preemption during the update.
The lack of disabling preemption is now creating a warning as reported
by Naresh since the query done by gnet_stats_copy_basic() is in
preemptible context.
For ___gnet_stats_copy_basic() there is no need to disable preemption
since the update is performed on stack and can't be modified by another
writer. Instead of disabling preemption, to avoid the warning,
simply create a read function to just read the values and return as u64.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: 67c9e6270f ("net: sched: Protect Qdisc::bstats with u64_stats")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
While loading a driver and changing the number of queues, I noticed this
message in the kernel log:
"[253489.070080] Number of in use tx queues changed invalidating tc
mappings. Priority traffic classification disabled!"
But I had no idea what interface was being talked about because this
message used pr_warn().
After investigating, it appears we can use the netdev_* helpers already
defined to create predictably formatted messages, and that already handle
<unknown netdev> cases, in more of the messages in dev.c.
After this change, this message (and others) will look like this:
"[ 170.181093] ice 0000:3b:00.0 ens785f0: Number of in use tx queues
changed invalidating tc mappings. Priority traffic classification
disabled!"
One goal here was not to change the message significantly from the
original format so as to not break user's expectations, so I just
changed messages that used pr_* and generally started with %s ==
dev->name.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PCI core code in the pci_call_probe() has a path that doesn't hold
device_lock. It happens because the ->probe() is called through the
workqueue mechanism.
349 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
350 const struct pci_device_id *id)
351 {
352
....
377 if (cpu < nr_cpu_ids)
378 error = work_on_cpu(cpu, local_pci_probe, &ddi);
Luckily enough, the core still ensures that only single flow is executed,
so it safe to remove the assert checks that anyway were added for annotations
purposes.
Fixes: b88f7b1203 ("devlink: Annotate devlink API calls")
Reported-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric reported that the rate estimator reads statics from the softirq
which in turn triggers a warning introduced in the statistics rework.
The warning is too cautious. The updates happen in the softirq context
so reads from softirq are fine since the writes can not be preempted.
The updates/writes happen during qdisc_run() which ensures one writer
and the softirq context.
The remaining bad context for reading statistics remains in hard-IRQ
because it may preempt a writer.
Fixes: 29cbcd8582 ("net: sched: Remove Qdisc::running sequence counter")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS for net-next:
1) Add new run_estimation toggle to IPVS to stop the estimation_timer
logic, from Dust Li.
2) Relax superfluous dynset check on NFT_SET_TIMEOUT.
3) Add egress hook, from Lukas Wunner.
4) Nowadays, almost all hook functions in x_table land just call the hook
evaluation loop. Remove remaining hook wrappers from iptables and IPVS.
From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The Qdisc::running sequence counter has two uses:
1. Reliably reading qdisc's tc statistics while the qdisc is running
(a seqcount read/retry loop at gnet_stats_add_basic()).
2. As a flag, indicating whether the qdisc in question is running
(without any retry loops).
For the first usage, the Qdisc::running sequence counter write section,
qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
is actually needed: the raw qdisc's bstats update. A u64_stats sync
point was thus introduced (in previous commits) inside the bstats
structure itself. A local u64_stats write section is then started and
stopped for the bstats updates.
Use that u64_stats sync point mechanism for the bstats read/retry loop
at gnet_stats_add_basic().
For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
accessed with atomic bitops, is sufficient. Using a bit flag instead of
a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
to the SMP barriers implicitly added through raw_read_seqcount() and
write_seqcount_begin/end() getting removed. All call sites have been
surveyed though, and no required ordering was identified.
Now that the qdisc->running sequence counter is no longer used, remove
it.
Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the qdisc tc statistics "packets" vs.
"bytes" values getting out of sync on rare occasions. The individual
values will still be valid.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only factor differentiating per-CPU bstats data type (struct
gnet_stats_basic_cpu) from the packed non-per-CPU one (struct
gnet_stats_basic_packed) was a u64_stats sync point inside the former.
The two data types are now equivalent: earlier commits added a u64_stats
sync point to the latter.
Combine both data types into "struct gnet_stats_basic_sync". This
eliminates redundancy and simplifies the bstats read/write APIs.
Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit
architectures, u64_stats sync points do not use sequence counter
protection.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Qdisc::running sequence counter, used to protect Qdisc::bstats reads
from parallel writes, is in the process of being removed. Qdisc::bstats
read/writes will synchronize using an internal u64_stats sync point
instead.
Modify all bstats writes to use _bstats_update(). This ensures that
the internal u64_stats sync point is always acquired and released as
appropriate.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The not-per-CPU variant of qdisc tc (traffic control) statistics,
Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running
sequence counter.
This sequence counter is used for reliably protecting bstats reads from
parallel writes. Meanwhile, the seqcount's write section covers a much
wider area than bstats update: qdisc_run_begin() => qdisc_run_end().
That read/write section asymmetry can lead to needless retries of the
read section. To prepare for removing the Qdisc::running sequence
counter altogether, introduce a u64_stats sync point inside bstats
instead.
Modify _bstats_update() to start/end the bstats u64_stats write
section.
For bisectability, and finer commits granularity, the bstats read
section is still protected with a Qdisc::running read/retry loop and
qdisc_run_begin/end() still starts/ends that seqcount write section.
Once all call sites are modified to use _bstats_update(), the
Qdisc::running seqcount will be removed and bstats read/retry loop will
be modified to utilize the internal u64_stats sync point.
Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the statistics "packets" vs. "bytes"
values getting out of sync on rare occasions. The individual values will
still be valid.
[bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.]
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The gnet_stats_queue::qlen member is only used in the SMP-case.
qdisc_qstats_qlen_backlog() needs to add qdisc_qlen() to qstats.qlen to
have the same value as that provided by qdisc_qlen_sum().
gnet_stats_copy_queue() needs to overwritte the resulting qstats.qlen
field whith the caller submitted qlen value. It might be differ from the
submitted value.
Let both functions use gnet_stats_add_queue() and remove unused
__gnet_stats_copy_queue().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function will replace __gnet_stats_copy_queue(). It reads all
arguments and adds them into the passed gnet_stats_queue argument.
In contrast to __gnet_stats_copy_queue() it also copies the qlen member.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
__gnet_stats_copy_basic() always assigns the value to the bstats
argument overwriting the previous value. The later added per-CPU version
always accumulated the values in the returning gnet_stats_basic_packed
argument.
Based on review there are five users of that function as of today:
- est_fetch_counters(), ___gnet_stats_copy_basic()
memsets() bstats to zero, single invocation.
- mq_dump(), mqprio_dump(), mqprio_dump_class_stats()
memsets() bstats to zero, multiple invocation but does not use the
function due to !qdisc_is_percpu_stats().
Add the values in __gnet_stats_copy_basic() instead overwriting. Rename
the function to gnet_stats_add_basic() to make it more obvious.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of netdev helper functions to improve code readability.
Replace 'dev->priv_flags & IFF_EBRIDGE' with netif_is_bridge_master(dev).
Signed-off-by: Kyungrok Chung <acadx0@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_stream_kill_queues() can be called on close when there are
still outstanding skbs to transmit. Those skbs may try to queue
notifications to the error queue (e.g. timestamps).
If sk_stream_kill_queues() purges the queue without taking
its lock the queue may get corrupted, and skbs leaked.
This shows up as a warning about an rmem leak:
WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x...
The leak is always a multiple of 0x300 bytes (the value is in
%rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of
an empty sk_buff. Indeed if we dump the socket state at the time
of the warning the sk_error_queue is often (but not always)
corrupted. The ->next pointer points back at the list head,
but not the ->prev pointer. Indeed we can find the leaked skb
by scanning the kernel memory for something that looks like
an skb with ->sk = socket in question, and ->truesize = 0x300.
The contents of ->cb[] of the skb confirms the suspicion that
it is indeed a timestamp notification (as generated in
__skb_complete_tx_timestamp()).
Removing purging of sk_error_queue should be okay, since
inet_sock_destruct() does it again once all socket refs
are gone. Eric suggests this may cause sockets that go
thru disconnect() to maintain notifications from the
previous incarnations of the socket, but that should be
okay since the race was there anyway, and disconnect()
is not exactly dependable.
Thanks to Jonathan Lemon and Omar Sandoval for help at various
stages of tracing the issue.
Fixes: cb9eff0978 ("net: new user space API for time stamping of incoming and outgoing packets")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the 32-bit arch with 64-bit DMA seems to rare those days,
and page pool might carry a lot of code and complexity for
systems that possibly.
So disable dma mapping support for such systems, if drivers
really want to work on such systems, they have to implement
their own DMA-mapping fallback tracking outside page_pool.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The combination of NUD_PERMANENT + NTF_MANAGED is not supported and does
not make sense either given the former indicates a static/fixed neighbor
entry whereas the latter a dynamically resolved one. While it is possible
to transition from one over to the other, we should however reject such
creation attempts.
Fixes: 7482e3841d ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of open-coding a check for invalid bits in NTF_EXT_MASK, we can just
use the NLA_POLICY_MASK() helper instead, and simplify NDA_FLAGS_EXT sanity
check this way.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, NDA_FLAGS_EXT flags allow a maximum of 24 bits to be used for
extended neighbor flags. These are eventually fed into neigh->flags by
shifting with NTF_EXT_SHIFT as per commit 2c611ad97a ("net, neigh:
Extend neigh->flags to 32 bit to allow for extensions").
If really ever needed in future, the full 32 bits from NDA_FLAGS_EXT can
be used, it would only require to move neigh->flags from u32 to u64 inside
the kernel.
Add a build-time assertion such that when extending the NTF_EXT_MASK with
new bits, we'll trigger an error once we surpass the 24th bit. This assumes
that no bit holes in new NTF_EXT_* flags will slip in from UAPI, but I
think this is reasonable to assume.
Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tools/testing/selftests/net/ioam6.sh
7b1700e009 ("selftests: net: modify IOAM tests for undef bits")
bf77b1400a ("selftests: net: Test for the IOAM encapsulation with IPv6")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>