We have 100+ syzbot reports about netns being dismantled too soon,
still unresolved as of today.
We think a missing get_net() or an extra put_net() is the root cause.
In order to find the bug(s), and be able to spot future ones,
this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers
to precisely pair all put_net() with corresponding get_net().
To use these helpers, each data structure owning a refcount
should also use a "netns_tracker" to pair the get and put.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Return status directly from function called.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
'more' is checked first. When !more is checked immediately after that,
it is always true. We should drop this check.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20211208024732.142541-5-sakiwit@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xfrm4_fill_dst() and xfrm6_fill_dst() build dst,
getting a device reference that will likely be released
by standard dst_release() code.
We have to track these references or risk a warning if
CONFIG_NET_DEV_REFCNT_TRACKER=y
Note to XFRM maintainers :
Error path in xfrm6_fill_dst() releases the reference,
but does not clear xdst->u.dst.dev, so I wonder
if this could lead to double dev_put() in some cases,
where a dst_release() _is_ called by the callers in their
error path.
This extra dev_put() was added in commit 84c4a9dfbf ("xfrm6:
release dev before returning error")
Fixes: 9038c32000 ("net: dst: add net device refcount tracking to dst_entry")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <amwang@redhat.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/20211207193203.2706158-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The majority of DSA drivers do not make use of the PCS support, and
thus operate in legacy mode. In order to preserve this behaviour in
future, we need to set the legacy_pre_march2020 flag so phylink knows
this may require the legacy calls.
There are some DSA drivers that do make use of PCS support, and these
will continue operating as before - legacy_pre_march2020 will not
prevent split-PCS support enabling the newer phylink behaviour.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When an IPv4 packet is received, the ip_rcv_core(...) sets the receiving
interface index into the IPv4 socket control block (v5.16-rc4,
net/ipv4/ip_input.c line 510):
IPCB(skb)->iif = skb->skb_iif;
If that IPv4 packet is meant to be encapsulated in an outer IPv6+SRH
header, the seg6_do_srh_encap(...) performs the required encapsulation.
In this case, the seg6_do_srh_encap function clears the IPv6 socket control
block (v5.16-rc4 net/ipv6/seg6_iptunnel.c line 163):
memset(IP6CB(skb), 0, sizeof(*IP6CB(skb)));
The memset(...) was introduced in commit ef489749aa ("ipv6: sr: clear
IP6CB(skb) on SRH ip4ip6 encapsulation") a long time ago (2019-01-29).
Since the IPv6 socket control block and the IPv4 socket control block share
the same memory area (skb->cb), the receiving interface index info is lost
(IP6CB(skb)->iif is set to zero).
As a side effect, that condition triggers a NULL pointer dereference if
commit 0857d6f8c7 ("ipv6: When forwarding count rx stats on the orig
netdev") is applied.
To fix that issue, we set the IP6CB(skb)->iif with the index of the
receiving interface once again.
Fixes: ef489749aa ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation")
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211208195409.12169-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The done() netlink callback nfc_genl_dump_ses_done() should check if
received argument is non-NULL, because its allocation could fail earlier
in dumpit() (nfc_genl_dump_ses()).
Fixes: ac22ac466a ("NFC: Add a GET_SE netlink API")
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211209081307.57337-1-krzysztof.kozlowski@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The max number of UDP gso segments is intended to cap to UDP_MAX_SEGMENTS,
this is checked in udp_send_skb():
if (skb->len > cork->gso_size * UDP_MAX_SEGMENTS) {
kfree_skb(skb);
return -EINVAL;
}
skb->len contains network and transport header len here, we should use
only data len instead.
Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/900742e5-81fb-30dc-6e0b-375c6cdd7982@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
copy_user_offload() will actually push a struct struct xfrm_user_offload,
which is different than (struct xfrm_state *)->xso
(struct xfrm_state_offload)
Fixes: d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Calling netdev_queue_update_kobjects is allowed during device
unregistration since commit 5c56580b74 ("net: Adjust TX queue kobjects
if number of queues changes during unregister"). But this is solely to
allow queue unregistrations. Any path attempting to add new queues after
a device started its unregistration should be fixed.
This patch adds a warning to detect such illegal use.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When updating Rx and Tx queue kobjects, the queue count should always be
updated to match the queue kobjects count. This was not done in the net
device unregistration path, fix it. Tracking all queue count updates
will allow in a following up patch to detect illegal updates.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix bogus compilter warning in nfnetlink_queue, from Florian Westphal.
2) Don't run conntrack on vrf with !dflt qdisc, from Nicolas Dichtel.
3) Fix nft_pipapo bucket load in AVX2 lookup routine for six 8-bit
groups, from Stefano Brivio.
4) Break rule evaluation on malformed TCP options.
5) Use socat instead of nc in selftests/netfilter/nft_zones_many.sh,
also from Florian
6) Fix KCSAN data-race in conntrack timeout updates, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: conntrack: annotate data-races around ct->timeout
selftests: netfilter: switch zone stress to socat
netfilter: nft_exthdr: break evaluation if setting TCP option fails
selftests: netfilter: Add correctness test for mac,net set type
nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groups
vrf: don't run conntrack on vrf with !dflt qdisc
netfilter: nfnetlink_queue: silence bogus compiler warning
====================
Link: https://lore.kernel.org/r/20211209000847.102598-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
bpf 2021-12-08
We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 29 files changed, 659 insertions(+), 80 deletions(-).
The main changes are:
1) Fix an off-by-two error in packet range markings and also add a batch of
new tests for coverage of these corner cases, from Maxim Mikityanskiy.
2) Fix a compilation issue on MIPS JIT for R10000 CPUs, from Johan Almbladh.
3) Fix two functional regressions and a build warning related to BTF kfunc
for modules, from Kumar Kartikeya Dwivedi.
4) Fix outdated code and docs regarding BPF's migrate_disable() use on non-
PREEMPT_RT kernels, from Sebastian Andrzej Siewior.
5) Add missing includes in order to be able to detangle cgroup vs bpf header
dependencies, from Jakub Kicinski.
6) Fix regression in BPF sockmap tests caused by missing detachment of progs
from sockets when they are removed from the map, from John Fastabend.
7) Fix a missing "no previous prototype" warning in x86 JIT caused by BPF
dispatcher, from Björn Töpel.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add selftests to cover packet access corner cases
bpf: Fix the off-by-two error in range markings
treewide: Add missing includes masked by cgroup -> bpf dependency
tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets
bpf: Fix bpf_check_mod_kfunc_call for built-in modules
bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
mips, bpf: Fix reference to non-existing Kconfig symbol
bpf: Make sure bpf_disable_instrumentation() is safe vs preemption.
Documentation/locking/locktypes: Update migrate_disable() bits.
bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
bpf, sockmap: Attach map progs to psock early for feature probes
bpf, x86: Fix "no previous prototype" warning
====================
Link: https://lore.kernel.org/r/20211208155125.11826-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We don't really need new switch API for these, and with new switches
which intend to add support for this feature, it will become cumbersome
to maintain.
The change consists in restructuring the two drivers that implement this
offload (sja1105 and mv88e6xxx) such that the offload is enabled and
disabled from the ->port_bridge_{join,leave} methods instead of the old
->port_bridge_tx_fwd_{,un}offload.
The only non-trivial change is that mv88e6xxx_map_virtual_bridge_to_pvt()
has been moved to avoid a forward declaration, and the
mv88e6xxx_reg_lock() calls from inside it have been removed, since
locking is now done from mv88e6xxx_port_bridge_{join,leave}.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a preparation patch for the removal of the DSA switch methods
->port_bridge_tx_fwd_offload() and ->port_bridge_tx_fwd_unoffload().
The plan is for the switch to report whether it offloads TX forwarding
directly as a response to the ->port_bridge_join() method.
This change deals with the noisy portion of converting all existing
function prototypes to take this new boolean pointer argument.
The bool is placed in the cross-chip notifier structure for bridge join,
and a reference to it is provided to drivers. In the next change, DSA
will then actually look at this value instead of calling
->port_bridge_tx_fwd_offload().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the static inline helpers from net/dsa/dsa_priv.h to
include/net/dsa.h, so that drivers can call functions such as
dsa_port_offloads_bridge_dev(), which will be necessary after the
transition to a more complex bridge structure.
More functions than are needed right now are being moved, but this is
done for uniformity.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the majority of dsa_port_bridge_dev_get() calls in drivers is
just to check whether a port is under the bridge device provided as
argument by the DSA API.
We'd like to change that DSA API so that a more complex structure is
provided as argument. To keep things more generic, and considering that
the new complex structure will be provided by value and not by
reference, direct comparisons between dp->bridge and the provided bridge
will be broken. The generic way to do the checking would simply be to
do something like dsa_port_offloads_bridge(dp, &bridge).
But there's a problem, we already have a function named that way, which
actually takes a bridge_dev net_device as argument. Rename it so that we
can use dsa_port_offloads_bridge for something else.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The location of the bridge device pointer and number is going to change.
It is not going to be kept individually per port, but in a common
structure allocated dynamically and which will have lockdep validation.
Create helpers to access these elements so that we have a migration path
to the new organization.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The service where DSA assigns a unique bridge number for each forwarding
domain is useful even for drivers which do not implement the TX
forwarding offload feature.
For example, drivers might use the dp->bridge_num for FDB isolation.
So rename ds->num_fwd_offloading_bridges to ds->max_num_bridges, and
calculate a unique bridge_num for all drivers that set this value.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I have seen too many bugs already due to the fact that we must encode an
invalid dp->bridge_num as a negative value, because the natural tendency
is to check that invalid value using (!dp->bridge_num). Latest example
can be seen in commit 1bec0f0506 ("net: dsa: fix bridge_num not
getting cleared after ports leaving the bridge").
Convert the existing users to assume that dp->bridge_num == 0 is the
encoding for invalid, and valid bridge numbers start from 1.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The VMADDR_CID_ANY flag used by a socket means that the socket isn't bound
to any specific CID. For example, a host vsock server may want to be bound
with VMADDR_CID_ANY, so that a guest vsock client can connect to the host
server with CID=VMADDR_CID_HOST (i.e. 2), and meanwhile, a host vsock
client can connect to the same local server with CID=VMADDR_CID_LOCAL
(i.e. 1).
The current implementation sets the destination socket's svm_cid to a
fixed CID value after the first client's connection, which isn't an
expected operation. For example, if the guest client first connects to the
host server, the server's svm_cid gets set to VMADDR_CID_HOST, then other
host clients won't be able to connect to the server anymore.
Reproduce steps:
1. Run the host server:
socat VSOCK-LISTEN:1234,fork -
2. Run a guest client to connect to the host server:
socat - VSOCK-CONNECT:2:1234
3. Run a host client to connect to the host server:
socat - VSOCK-CONNECT:1:1234
Without this patch, step 3. above fails to connect, and socat complains
"socat[1720] E connect(5, AF=40 cid:1 port:1234, 16): Connection
reset by peer".
With this patch, the above works well.
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20211126011823.1760-1-wei.w.wang@intel.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
When building under -Warray-bounds, the compiler is especially
conservative when faced with casts from a smaller object to a larger
object. While this has found many real bugs, there are some cases that
are currently false positives (like here). With this as one of the last
few instances of the warning in the kernel before -Warray-bounds can be
enabled globally, rearrange the functions so that there is a header-only
version of hvs_send_data(). Silences this warning:
net/vmw_vsock/hyperv_transport.c: In function 'hvs_shutdown_lock_held.constprop':
net/vmw_vsock/hyperv_transport.c:231:32: warning: array subscript 'struct hvs_send_buf[0]' is partly outside array bounds of 'struct vmpipe_proto_header[1]' [-Warray-bounds]
231 | send_buf->hdr.pkt_type = 1;
| ~~~~~~~~~~~~~~~~~~~~~~~^~~
net/vmw_vsock/hyperv_transport.c:465:36: note: while referencing 'hdr'
465 | struct vmpipe_proto_header hdr;
| ^~~
This change results in no executable instruction differences.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20211207063217.2591451-1-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdevice_tracker inside struct net_device, to track
the self reference when a device has an active watchdog timer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sixth byte of packet data has to be looked up in the sixth group,
not in the seventh one, even if we load the bucket data into ymm6
(and not ymm5, for convenience of tracking stalls).
Without this fix, matching on a MAC address as first field of a set,
if 8-bit groups are selected (due to a small set size) would fail,
that is, the given MAC address would never match.
Reported-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Cc: <stable@vger.kernel.org> # 5.6.x
Fixes: 7400b06396 ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-By: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
First, add cork and nodelay fields to the mptcp_sock structure
so they can be used in sync_socket_options(), and fill them on setsockopt
while holding the msk socket lock.
Then, on setsockopt set proper tcp_sk(ssk)->nonagle values for subflows
by calling __tcp_sock_set_cork() or __tcp_sock_set_nodelay() on the ssk
while holding the ssk socket lock.
tcp_push_pending_frames() will be invoked on the ssk if a cork was cleared
or nodelay was set. Also set MPTCP_PUSH_PENDING bit by calling
mptcp_check_and_set_pending(). This will lead to __mptcp_push_pending()
being called inside mptcp_release_cb() with new tcp_sk(ssk)->nonagle.
Also add getsockopt support for TCP_CORK and TCP_NODELAY.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Maxim Galaganov <max@internet.ru>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Expose the mptcp_check_and_set_pending() function for use inside MPTCP
sockopt code. The next patch will call it when TCP_CORK is cleared or
TCP_NODELAY is set on the MPTCP socket in order to push pending data
from mptcp_release_cb().
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Maxim Galaganov <max@internet.ru>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Expose __tcp_sock_set_cork() and __tcp_sock_set_nodelay() for use in
MPTCP setsockopt code -- namely for syncing MPTCP socket options with
subflows inside sync_socket_options() while already holding the subflow
socket lock.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Maxim Galaganov <max@internet.ru>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
earlier patch added IP_TOS setsockopt support, this allows to get
the value set by earlier setsockopt.
Extends mptcp_put_int_option to handle u8 input/output by
adding required cast.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
a non-zero 'id' is sufficient to identify MPTCP endpoints: allow changing
the value of 'backup' bit by simply specifying the endpoint id.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/158
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allows to query in-sequence data ready for read(), total bytes in
write queue and total bytes in write queue that have not yet been sent.
v2: remove unneeded READ_ONCE() (Paolo Abeni)
v3: check for new data unconditionally in SIOCINQ ioctl (Mat Martineau)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support the TCP_INQ setsockopt.
This is a boolean that tells recvmsg path to include the remaining
in-sequence bytes in the cmsg data.
v2: do not use CB(skb)->offset, increment map_seq instead (Paolo Abeni)
v3: adjust CB(skb)->map_seq when taking skb from ofo queue (Paolo Abeni)
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/224
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This introduces mgmt_alloc_skb and mgmt_send_event_skb which are
convenient when building MGMT events that have variable length as the
likes of skb_put_data can be used to insert portion directly on the skb
instead of having to first build an intermediate buffer just to be
copied over the skb.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This fixes compilation when CONFIG_BT_MSFTEXT is not set.
Fixes: 6b3d4c8fcf3f2 ("Bluetooth: hci_event: Use of a function table to handle HCI events")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This adds support for Set Privacy Mode when updating the resolving list
when HCI_CONN_FLAG_DEVICE_PRIVACY so the controller shall use Device
Mode for devices programmed in the resolving list, Device Mode is
actually required when the remote device are not able to use RPA as
otherwise the default mode is Network Privacy Mode in which only
allows RPAs thus the controller would filter out advertisement using
identity addresses for which there is an IRK.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This introduces HCI_CONN_FLAG_DEVICE_PRIVACY which can be used by
userspace to indicate to the controller to use Device Privacy Mode to a
specific device.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This reworks hci_conn_params flags to use bitmap_* helpers and add
support for setting the supported flags in hdev->conn_flags so it can
easily be accessed.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This make use of hci_dev_test_and_{set,clear}_flag instead of doing 2
operations in a row.
Fixes: cbbdfa6f33 ("Bluetooth: Enable controller RPA resolution using Experimental feature")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Some devices have a bug causing them to not work if they query
LE tx power on startup. Thus we add a quirk in order to not query it
and default min/max tx power values to HCI_TX_POWER_INVALID.
Signed-off-by: Aditya Garg <gargaditya08@live.com>
Reported-by: Orlando Chamberlain <redecorating@protonmail.com>
Tested-by: Orlando Chamberlain <redecorating@protonmail.com>
Link:
https://lore.kernel.org/r/4970a940-211b-25d6-edab-21a815313954@protonmail.com
Fixes: 7c395ea521 ("Bluetooth: Query LE tx power on startup")
Cc: stable@vger.kernel.org
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This change the use of switch statement to a function table which is
easier to extend.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This change the use of switch statement to a function table which is
easier to extend and can include min/max length of each command.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This change the use of switch statement to a function table which is
easier to extend and can include min/max length of each subevent.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This change the use of switch statement to a function table
which is easier to extend and can include min/max length of each HCI
event.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the LE Direct Advertising Report
events received have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the LE Extended Advertising Report
events received have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the LE Advertising Report events
received have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the LE Metaevents received have the
minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the Extended Inquiry Result events
received have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the Inquiry Result with RSSI events
received have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the Inquiry Result events received
have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the Number of Complete Packets events
received have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the Command Complete events received
have the minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This uses skb_pull_data to check the BR/EDR events received have the
minimum required length.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Like skb_pull but returns the original data pointer before pulling the
data after performing a check against sbk->len.
This allows to change code that does "struct foo *p = (void *)skb->data;"
which is hard to audit and error prone, to:
p = skb_pull_data(skb, sizeof(*p));
if (!p)
return;
Which is both safer and cleaner.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently, buffers are cleared when smc connections are created and
buffers are reused. This slows down the speed of establishing new
connections. In most cases, the applications want to establish
connections as quickly as possible.
This patch moves memset() from connection creation path to release and
buffer unuse path, this trades off between speed of establishing and
release.
Test environments:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4
- socket sndbuf / rcvbuf: 16384 / 131072 bytes
- w/o first round, 5 rounds, avg, 100 conns batch per round
- smc_buf_create() use bpftrace kprobe, introduces extra latency
Latency benchmarks for smc_buf_create():
w/o patch : 19040.0 ns
w/ patch : 1932.6 ns
ratio : 10.2% (-89.8%)
Latency benchmarks for socket create and connect:
w/o patch : 143.3 us
w/ patch : 102.2 us
ratio : 71.3% (-28.7%)
The latency of establishing connections is reduced by 28.7%.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20211203113331.2818873-1-kgraul@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While preparing my patch series adding netns refcount tracking,
I spotted bugs in devlink_nl_cmd_reload()
Some error paths forgot to release a refcount on a netns.
To fix this, we can reduce the scope of get_net()/put_net()
section around the call to devlink_reload().
Fixes: ccdf07219d ("devlink: Add reload action option to devlink reload command")
Fixes: dc64cc7c63 ("devlink: Add devlink reload limit option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Moshe Shemesh <moshe@mellanox.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20211205192822.1741045-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a short period between a net device starts to be unregistered
and when it is actually gone. In that time frame ethtool operations
could still be performed, which might end up in unwanted or undefined
behaviours[1].
Do not allow ethtool operations after a net device starts its
unregistration. This patch targets the netlink part as the ioctl one
isn't affected: the reference to the net device is taken and the
operation is executed within an rtnl lock section and the net device
won't be found after unregister.
[1] For example adding Tx queues after unregister ends up in NULL
pointer exceptions and UaFs, such as:
BUG: KASAN: use-after-free in kobject_get+0x14/0x90
Read of size 1 at addr ffff88801961248c by task ethtool/755
CPU: 0 PID: 755 Comm: ethtool Not tainted 5.15.0-rc6+ #778
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/014
Call Trace:
dump_stack_lvl+0x57/0x72
print_address_description.constprop.0+0x1f/0x140
kasan_report.cold+0x7f/0x11b
kobject_get+0x14/0x90
kobject_add_internal+0x3d1/0x450
kobject_init_and_add+0xba/0xf0
netdev_queue_update_kobjects+0xcf/0x200
netif_set_real_num_tx_queues+0xb4/0x310
veth_set_channels+0x1c3/0x550
ethnl_set_channels+0x524/0x610
Fixes: 041b1c5d4a ("ethtool: helper functions for netlink interface")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20211203101318.435618-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdevice_tracker inside struct net_device, to track
the self reference when a device is in lweventlist.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Note that other ip_tunnel users do not seem to hold a reference
on tunnel->dev. Probably needs some investigations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to track all dev_hold()/dev_put() to ease leak hunting.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to track all dev_hold()/dev_put() to ease leak hunting.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This helper might hold a netdev reference for a long time,
lets add reference tracking.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will help debugging pesky netdev reference leaks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net device are refcounted. Over the years we had numerous bugs
caused by imbalanced dev_hold() and dev_put() calls.
The general idea is to be able to precisely pair each decrement with
a corresponding prior increment. Both share a cookie, basically
a pointer to private data storing stack traces.
This patch adds dev_hold_track() and dev_put_track().
To use these helpers, each data structure owning a refcount
should also use a "netdevice_tracker" to pair the hold and put.
netdevice_tracker dev_tracker;
...
dev_hold_track(dev, &dev_tracker, GFP_ATOMIC);
...
dev_put_track(dev, &dev_tracker);
Whenever a leak happens, we will get precise stack traces
of the point dev_hold_track() happened, at device dismantle phase.
We will also get a stack trace if too many dev_put_track() for the same
netdevice_tracker are attempted.
This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If sending a frame failed any sync command associated with it will never
be completed. As such, cancel any such command immediately to avoid
timing out.
Signed-off-by: Benjamin Berg <bberg@redhat.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
After transfer errors it makes sense to cancel an ongoing synchronous
command that cannot complete anymore. To permit this, export the old
hci_req_sync_cancel function as hci_cmd_sync_cancel in the API.
Signed-off-by: Benjamin Berg <bberg@redhat.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Resetting the timers and cmd_cnt means that we assume the device will be
in a good state after the sync command finishes. Without this a chain of
synchronous commands might get stuck if one of them is cancelled.
Signed-off-by: Benjamin Berg <bberg@redhat.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Short-lived connections increase the insertion rate requirements,
fill the offload table and provide very limited offload value since
they process a very small amount of packets. The ct ASSURED flag is
designed to filter short-lived connections for early expiration.
Offload connections when they are ESTABLISHED and ASSURED.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
and wireguard.
Current release - regressions:
- smc: keep smc_close_final()'s error code during active close
Current release - new code bugs:
- iwlwifi: various static checker fixes (int overflow, leaks, missing
error codes)
- rtw89: fix size of firmware header before transfer, avoid crash
- mt76: fix timestamp check in tx_status; fix pktid leak;
- mscc: ocelot: fix missing unlock on error in ocelot_hwstamp_set()
Previous releases - regressions:
- smc: fix list corruption in smc_lgr_cleanup_early
- ipv4: convert fib_num_tclassid_users to atomic_t
Previous releases - always broken:
- tls: fix authentication failure in CCM mode
- vrf: reset IPCB/IP6CB when processing outbound pkts, prevent
incorrect processing
- dsa: mv88e6xxx: fixes for various device errata
- rds: correct socket tunable error in rds_tcp_tune()
- ipv6: fix memory leak in fib6_rule_suppress
- wireguard: reset peer src endpoint when netns exits
- wireguard: improve resilience to DoS around incoming handshakes
- tcp: fix page frag corruption on page fault which involves TCP
- mpls: fix missing attributes in delete notifications
- mt7915: fix NULL pointer dereference with ad-hoc mode
Misc:
- rt2x00: be more lenient about EPROTO errors during start
- mlx4_en: update reported link modes for 1/10G
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGo6KMACgkQMUZtbf5S
Iruq+BAAhRMTcL+X4eRIL9lIEvWEKHMKLCA/pUaQWNlSxsbEeydJWRNSc37Cs3pv
z0rYIEhfieOz8+QXS1Kq+yZwJVXjA8Jvgld2qw9V9Y5w+N15Mj8RUtG8NaUw+o4E
U8PCAbaamnbzyPdlCYcVHschd8MD0BCXm5+jAGeIyCP+KQCnhEpFZv+bvHaWzQR8
FZLYrhXTR9W0DFsrKG9+haqFwFBR3+VDqTGILhaHPE+r2o6wKQQ5yJMhd8fq0SaC
nne8zDkGuFEeW3cxj0VbhdRMyrV97eMK+P4dZ2P0Z7xcrsed9/2XJkNQNJGtuRnj
GGJV6utupJRAY+lnJNUkifqS4Wt7KirfZsSsyaKKa4plyoVgtGhiqEYFTQVLagC0
CF4Qe+3qks6rESbRu6PEFN4oWSkMEhRzdcDpg7vBDURUKcrRs9fgtNUJUCi8nKFA
A/F/K+7IHBoBZyQYZbYmnGdNsNauKbF3rUY3hwMGBfQZIr/wsql9+jhtLsmZX77m
V/L7KzT2jhhNc5gDzuLps25K3P7snKuV19qQSsY2LeuGj1x3gmWZ+ibN6ynhB+Gt
KBnfHDMTI/4aciZBIbwJmwfeRhCF8tOfw0WZdUP7FRIXukbfVuDBoznWLz4BKKgf
GSYSTNDs/PHZQo5vCQ/onvTwUK5aN6zoPNy5ih7lp9YZBYtN2TI=
=r0Jh
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, and wireguard.
Mostly scattered driver changes this week, with one big clump in
mv88e6xxx. Nothing of note, really.
Current release - regressions:
- smc: keep smc_close_final()'s error code during active close
Current release - new code bugs:
- iwlwifi: various static checker fixes (int overflow, leaks, missing
error codes)
- rtw89: fix size of firmware header before transfer, avoid crash
- mt76: fix timestamp check in tx_status; fix pktid leak;
- mscc: ocelot: fix missing unlock on error in ocelot_hwstamp_set()
Previous releases - regressions:
- smc: fix list corruption in smc_lgr_cleanup_early
- ipv4: convert fib_num_tclassid_users to atomic_t
Previous releases - always broken:
- tls: fix authentication failure in CCM mode
- vrf: reset IPCB/IP6CB when processing outbound pkts, prevent
incorrect processing
- dsa: mv88e6xxx: fixes for various device errata
- rds: correct socket tunable error in rds_tcp_tune()
- ipv6: fix memory leak in fib6_rule_suppress
- wireguard: reset peer src endpoint when netns exits
- wireguard: improve resilience to DoS around incoming handshakes
- tcp: fix page frag corruption on page fault which involves TCP
- mpls: fix missing attributes in delete notifications
- mt7915: fix NULL pointer dereference with ad-hoc mode
Misc:
- rt2x00: be more lenient about EPROTO errors during start
- mlx4_en: update reported link modes for 1/10G"
* tag 'net-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
net: dsa: b53: Add SPI ID table
gro: Fix inconsistent indenting
selftests: net: Correct case name
net/rds: correct socket tunable error in rds_tcp_tune()
mctp: Don't let RTM_DELROUTE delete local routes
net/smc: Keep smc_close_final rc during active close
ibmvnic: drop bad optimization in reuse_tx_pools()
ibmvnic: drop bad optimization in reuse_rx_pools()
net/smc: fix wrong list_del in smc_lgr_cleanup_early
Fix Comment of ETH_P_802_3_MIN
ethernet: aquantia: Try MAC address from device tree
ipv4: convert fib_num_tclassid_users to atomic_t
net: avoid uninit-value from tcp_conn_request
net: annotate data-races on txq->xmit_lock_owner
octeontx2-af: Fix a memleak bug in rvu_mbox_init()
net/mlx4_en: Fix an use-after-free bug in mlx4_en_try_alloc_resources()
vrf: Reset IPCB/IP6CB when processing outbound pkts in vrf dev xmit
net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings()
net: dsa: mv88e6xxx: Link in pcs_get_state() if AN is bypassed
net: dsa: mv88e6xxx: Fix inband AN for 2500base-x on 88E6393X family
...
Rename btf_member_bit_offset() and btf_member_bitfield_size() to
avoid conflicts with similarly named helpers in libbpf's btf.h.
Rename the kernel helpers, since libbpf helpers are part of uapi.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-3-alexei.starovoitov@gmail.com
The 'if (dev)' statement already move into dev_{put , hold}, so remove
redundant if statements.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'if (dev)' statement already move into dev_{put , hold}, so remove
redundant if statements.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct an error where setting /proc/sys/net/rds/tcp/rds_tcp_rcvbuf would
instead modify the socket's sk_sndbuf and would leave sk_rcvbuf untouched.
Fixes: c6a58ffed5 ("RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket")
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to test against the existing route type, not
the rtm_type in the netlink request.
Fixes: 83f0a0b728 ("mctp: Specify route types, require rtm_type in RTM_*ROUTE messages")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When smc_close_final() returns error, the return code overwrites by
kernel_sock_shutdown() in smc_close_active(). The return code of
smc_close_final() is more important than kernel_sock_shutdown(), and it
will pass to userspace directly.
Fix it by keeping both return codes, if smc_close_final() raises an
error, return it or kernel_sock_shutdown()'s.
Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/
Fixes: 606a63c978 ("net/smc: Ensure the active closing peer first closes clcsock")
Suggested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before commit faa041a40b ("ipv4: Create cleanup helper for fib_nh")
changes to net->ipv4.fib_num_tclassid_users were protected by RTNL.
After the change, this is no longer the case, as free_fib_info_rcu()
runs after rcu grace period, without rtnl being held.
Fixes: faa041a40b ("ipv4: Create cleanup helper for fib_nh")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit aeeecb8891.
The new SNMP variable (TCPSmallQueueFailure) can be incremented
for good reasons, even on a 100Gbit single TCP_STREAM flow.
If we really wanted to ease driver debugging [1], this would
require something more sophisticated.
[1] Usually, if a driver is delaying TX completions too much,
this can lead to stalls in TCP output. Various work arounds
have been used in the past, like skb_orphan() in ndo_start_xmit().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Menglong Dong <imagedong@tencent.com>
Link: https://lore.kernel.org/r/20211201033246.2826224-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support the use of phylink_generic_validate() when there is no
phylink_validate method given in the DSA switch operations and
mac_capabilities have been set in the phylink_config structure by the
DSA switch driver.
This gives DSA switch drivers the option to use this if they provide
the supported_interfaces and mac_capabilities, while still giving them
an option to override the default implementation if necessary.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Phylink needs slightly more information than phylink_get_interfaces()
allows us to get from the DSA drivers - we need the MAC capabilities.
Replace the phylink_get_interfaces() method with phylink_get_caps() to
allow DSA drivers to fill in the phylink_config MAC capabilities field
as well.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The code in port.c and slave.c creating the phylink instance is very
similar - let's consolidate this into a single function.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
!pols[0] is checked earlier. If we don't return, pols[0] is always
true. We should drop the check of pols[0] for the second time and the
binary is also smaller.
Before:
text data bss dec hex filename
48395 957 240 49592 c1b8 net/xfrm/xfrm_policy.o
After:
text data bss dec hex filename
48379 957 240 49576 c1a8 net/xfrm/xfrm_policy.o
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This makes 'bridge-nf-filter-pppoe-tagged' sysctl work for
bridged traffic.
Looking at the original commit it doesn't appear this ever worked:
static unsigned int br_nf_post_routing(unsigned int hook, struct sk_buff **pskb,
[..]
if (skb->protocol == htons(ETH_P_8021Q)) {
skb_pull(skb, VLAN_HLEN);
skb->network_header += VLAN_HLEN;
+ } else if (skb->protocol == htons(ETH_P_PPP_SES)) {
+ skb_pull(skb, PPPOE_SES_HLEN);
+ skb->network_header += PPPOE_SES_HLEN;
}
[..]
NF_HOOK(... POST_ROUTING, ...)
... but the adjusted offsets are never restored.
The alternative would be to rip this code out for good,
but otoh we'd have to keep this anyway for the vlan handling
(which works because vlan tag info is in the skb, not the packet
payload).
Reported-and-tested-by: Amish Chana <amish@3g.co.za>
Fixes: 516299d2f5 ("[NETFILTER]: bridge-nf: filter bridged IPv4/IPv6 encapsulated in pppoe traffic")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Allow packet redirection to another interface upon egress.
[lukas: set skb_iif, add commit message, original patch from Pablo. ]
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
net/netfilter/nfnetlink_queue.c:601:36: warning: variable 'ctinfo' is
uninitialized when used here [-Wuninitialized]
if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0)
ctinfo is only uninitialized if ct == NULL. Init it to 0 to silence this.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
dying is bool, the type conversion to true/false value is not
needed.
Signed-off-by: Bernard Zhao <bernard@vivo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its no longer needed after commit 8702997074
("netfilter: nf_queue: move hookfn registration out of struct net").
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_startat() to avoid confusing memset() about writing beyond
the target struct member.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The dest variable is not used after ip_vs_new_dest anymore in
ip_vs_add_dest, do not need pass it to ip_vs_new_dest, remove it.
Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The eBPF name has completely taken over from eBPF in general usage for
the actual eBPF representation, or BPF for any general in-kernel use.
Prune all remaining references to "internal BPF".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de
The devlink_resources_unregister() used second parameter as an
entry point for the recursive removal of devlink resources. None
of the callers outside of devlink core needed to use this field,
so let's remove it.
As part of this removal, the "struct devlink_resource" was moved
from .h to .c file as it is not possible to use in any place in
the code except devlink.c.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can remove a bit of code duplication by reusing the new
fib6_nh_release_dsts helper in fib6_nh_release. Their only difference is
that fib6_nh_release's version doesn't use atomic operation to swap the
pointers because it assumes the fib6_nh is no longer visible, while
fib6_nh_release_dsts can be used anywhere.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can optimize resilient nexthop group replaces by reducing the number of
synchronize_net calls. After commit 1005f19b93 ("net: nexthop: release
IPv6 per-cpu dsts when replacing a nexthop group") we always do a
synchronize_net because we must ensure no new dsts can be created for the
replaced group's removed nexthops, but we already did that when replacing
resilient groups, so if we always call synchronize_net after any group
type replacement we'll take care of both cases and reduce synchronize_net
calls for resilient groups.
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Assigning crypto_info variables in advance can simplify the logic
of accessing value and move related local variables to a smaller
scope.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'dest' bitmap is fully initialized by the 'for' loop, so there is no
need to explicitly reset it.
This also makes this function in line with 'ethnl_features_to_bitmap32()'
which does not clear the destination before writing it.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/17fca158231c6f03689bd891254f0dd1f4e84cb8.1638091829.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmGk9MIACgkQ+7dXa6fL
C2sFUw/9EulPVJSzUywa6Jf13BqWCqciuMoo+NrNlTFY7P0CnurloqOl+6hJ+hYI
cKr4JaClzOR/LT8DaJHYp+uyB0cn5U/udEDr2iekxvh16B/GDDRPWDDIvm9mzczw
jVqzZHH54Jl6s+fnfkIhkLZA6rTuH4kFtsEb1S0VBXIJC1zv4O79Ba15nxkGPQoQ
GfJzZqLTUOMmWPoiTuKTJmPhvCPHEzIGduKzU2hRgjrDJdk695BGdwFK1cs6/BcA
n0aEVnYmndnaJ3MyLjxzROi1D6MFfHs8+a4vKTD13kcWQRT8kPE51/KKhMvUELHn
HJsVXqFeip2FyQFqRlmNfRgfvhC+SWV9kceP0G6pZIoLAXlEMGkmhmB+BWZtnY6U
yXaOJlwchNEP83f2KWUBkX4dV9Zx8mdy5UUQBTSQbMwFxSgZ5FfOyCkf98xx03DS
LTqV9ZciJ+Z6QbDscYIrnJV2WvWTJeyUB5lsI8W3lAXXugu28WHL+lorhEKIIJWG
jdmxO4sEa29tdE7FVngQ3RLcNJETcJCKpMypNaAhWOCPlfnW7t0FpkcAoza7t1yy
DGQaGkR7zC1sZ5hr31YvWBhaZT8IjlAePOg/1GfpUnyL9122jsAqR9dz0xgPGnf6
fK0BLMcfnuYObtJdSEOAMqnx2qkRKTU9qrT2vXNWd/bApFTY4E0=
=q6GF
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20211129' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Leak fixes
Here are a couple of fixes for leaks in AF_RXRPC:
(1) Fix a leak of rxrpc_peer structs in rxrpc_look_up_bundle().
(2) Fix a leak of rxrpc_local structs in rxrpc_lookup_peer().
* tag 'rxrpc-fixes-20211129' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
rxrpc: Fix rxrpc_local leak in rxrpc_lookup_peer()
rxrpc: Fix rxrpc_peer leak in rxrpc_look_up_bundle()
====================
Link: https://lore.kernel.org/r/163820097905.226370.17234085194655347888.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each peer's endpoint contains a dst_cache entry that takes a reference
to another netdev. When the containing namespace exits, we take down the
socket and prevent future sockets from being created (by setting
creating_net to NULL), which removes that potential reference on the
netns. However, it doesn't release references to the netns that a netdev
cached in dst_cache might be taking, so the netns still might fail to
exit. Since the socket is gimped anyway, we can simply clear all the
dst_caches (by way of clearing the endpoint src), which will release all
references.
However, the current dst_cache_reset function only releases those
references lazily. But it turns out that all of our usages of
wg_socket_clear_peer_endpoint_src are called from contexts that are not
exactly high-speed or bottle-necked. For example, when there's
connection difficulty, or when userspace is reconfiguring the interface.
And in particular for this patch, when the netns is exiting. So for
those cases, it makes more sense to call dst_release immediately. For
that, we add a small helper function to dst_cache.
This patch also adds a test to netns.sh from Hangbin Liu to ensure this
doesn't regress.
Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Reported-by: Xiumei Mu <xmu@redhat.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Fixes: 900575aa33 ("wireguard: device: avoid circular netns references")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Need to call rxrpc_put_local() for peer candidate before kfree() as it
holds a ref to rxrpc_local.
[DH: v2: Changed to abstract the peer freeing code out into a function]
Fixes: 9ebeddef58 ("rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record")
Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/all/20211121041608.133740-2-eiichi.tsukata@nutanix.com/ # v1
Need to call rxrpc_put_peer() for bundle candidate before kfree() as it
holds a ref to rxrpc_peer.
[DH: v2: Changed to abstract out the bundle freeing code into a function]
Fixes: 245500d853 ("rxrpc: Rewrite the client connection manager")
Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20211121041608.133740-1-eiichi.tsukata@nutanix.com/ # v1
The kernel leaks memory when a `fib` rule is present in IPv6 nftables
firewall rules and a suppress_prefix rule is present in the IPv6 routing
rules (used by certain tools such as wg-quick). In such scenarios, every
incoming packet will leak an allocation in `ip6_dst_cache` slab cache.
After some hours of `bpftrace`-ing and source code reading, I tracked
down the issue to ca7a03c417 ("ipv6: do not free rt if
FIB_LOOKUP_NOREF is set on suppress rule").
The problem with that change is that the generic `args->flags` always have
`FIB_LOOKUP_NOREF` set[1][2] but the IPv6-specific flag
`RT6_LOOKUP_F_DST_NOREF` might not be, leading to `fib6_rule_suppress` not
decreasing the refcount when needed.
How to reproduce:
- Add the following nftables rule to a prerouting chain:
meta nfproto ipv6 fib saddr . mark . iif oif missing drop
This can be done with:
sudo nft create table inet test
sudo nft create chain inet test test_chain '{ type filter hook prerouting priority filter + 10; policy accept; }'
sudo nft add rule inet test test_chain meta nfproto ipv6 fib saddr . mark . iif oif missing drop
- Run:
sudo ip -6 rule add table main suppress_prefixlength 0
- Watch `sudo slabtop -o | grep ip6_dst_cache` to see memory usage increase
with every incoming ipv6 packet.
This patch exposes the protocol-specific flags to the protocol
specific `suppress` function, and check the protocol-specific `flags`
argument for RT6_LOOKUP_F_DST_NOREF instead of the generic
FIB_LOOKUP_NOREF when decreasing the refcount, like this.
[1]: ca7a03c417/net/ipv6/fib6_rules.c (L71)
[2]: ca7a03c417/net/ipv6/fib6_rules.c (L99)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215105
Fixes: ca7a03c417 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once tcp small queue check failed in tcp_small_queue_check(), the
throughput of tcp will be limited, and it's hard to distinguish
whether it is out of tcp congestion control.
Add statistics of LINUX_MIB_TCPSMALLQUEUEFAILURE for this scene.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET command doesn't have .doit callback
and has no use in internal_flags at all. Remove this misleading assignment.
Fixes: e44ef4e451 ("devlink: Hang reporter's dump method on a dumpit cb")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In our test device, we're currently freeing skbs in the transmit path
with kfree(), rather than kfree_skb(). This change uses the correct
kfree_skb() instead.
Fixes: ded21b7229 ("mctp: Add test utils")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the TLS cipher suite uses CCM mode, including AES CCM and
SM4 CCM, the first byte of the B0 block is flags, and the real
IV starts from the second byte. The XOR operation of the IV and
rec_seq should be skip this byte, that is, add the iv_offset.
Fixes: f295b3ae9f ("net/tls: Add support of AES128-CCM based ciphers")
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Cc: Vakul Garg <vakul.garg@nxp.com>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: David S. Miller <davem@davemloft.net>
There are separate for_nexthops and change_nexthops iterators. The
for_nexthops variant should use const.
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__nh is just a copy of nh with a different type.
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following the previous commit, nh_dev can no longer be accessed and
modified concurrently.
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are various problems related to netlink notifications for mpls route
changes in response to interfaces being deleted:
* delete interface of only nexthop
DELROUTE notification is missing RTA_OIF attribute
* delete interface of non-last nexthop
NEWROUTE notification is missing entirely
* delete interface of last nexthop
DELROUTE notification is missing nexthop
All of these problems stem from the fact that existing routes are modified
in-place before sending a notification. Restructure mpls_ifdown() to avoid
changing the route in the DELROUTE cases and to create a copy in the
NEWROUTE case.
Fixes: f8efb73c97 ("mpls: multipath route support")
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The writer acquires dev_base_lock with disabled bottom halves.
The reader can acquire dev_base_lock without disabling bottom halves
because there is no writer in softirq context.
On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves (as in write_lock_bh()) and somewhere else with
enabled bottom halves (as by read_lock() in netstat_show()) followed by
disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
-> spin_lock_bh()). This is the reverse locking order.
All read_lock() invocation are from sysfs callback which are not invoked
from softirq context. Therefore there is no need to disable bottom
halves while acquiring a write lock.
Acquire the write lock of dev_base_lock without disabling bottom halves.
Reported-by: Pei Zhang <pezhang@redhat.com>
Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously commit e02d494d2c ("l2tp: Convert rwlock to RCU") converted
most, but not all, rwlock instances in the l2tp subsystem to RCU.
The remaining rwlock protects the per-tunnel hashlist of sessions which
is used for session lookups in the UDP-encap data path.
Convert the remaining rwlock to rcu to improve performance of UDP-encap
tunnels.
Note that the tunnel and session, which both live on RCU-protected
lists, use slightly different approaches to incrementing their refcounts
in the various getter functions.
The tunnel has to use refcount_inc_not_zero because the tunnel shutdown
process involves dropping the refcount to zero prior to synchronizing
RCU readers (via. kfree_rcu).
By contrast, the session shutdown removes the session from the list(s)
it is on, synchronizes with readers, and then decrements the session
refcount. Since the getter functions increment the session refcount
with the RCU read lock held we prevent getters seeing a zero session
refcount, and therefore don't need to use refcount_inc_not_zero.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reload the regdom when the regulatory db is reloaded.
Otherwise, the user had to change the regulatoy domain
to a different one and then reset it to the correct
one to have a new regulatory db take effect after a
reload.
Signed-off-by: Finn Behrens <fin@nyantec.com>
Link: https://lore.kernel.org/r/YaIIZfxHgqc/UTA7@gimli.kloenk.dev
[edit commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It's easier to use and understand, and to extend for EHT later,
if we use the values here instead of the shifted values.
Unfortunately, we need to add _POS so that we can use it in
places like iwlwifi/mvm where constants are needed.
While at it, fix the typo ("NOMIMAL") which also helps catch any
conflicts.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://lore.kernel.org/r/20211126104817.7c29a05b8eb5.I2ca9faf06e177e3035bec91e2ae53c2f91d41774@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Misc fixes all over the place.
Revert of virtio used length validation series: the approach taken does
not seem to work, breaking too many guests in the process. We'll need to
do length validation using some other approach.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmGe0sEPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRp8WEH/imDIq1iduDeAuvFnmrm5eEO9w3wzXCT4NiG
8Pla241FzQ1pEFEAne16KP0+SlLhj7P0oc5FR8vkYvxxuyneDbCzcS2M1kYMOpA1
ry28PuObAnekzE/WXxvC031ozB5Zb/FL54gmw+/1EdAOdMGL0CdQ1aJxREBHRTBo
p4ZHr83GA2D2C/IyKCsgQ8cB9ZrMqImTQQ4vRD89HoFBp+GH2u2Di1iyXEWuOqdI
n1+7M9jjbyW8A+N1bkOicpShS/6UcyJQOOcg8kvUQOV6srVkYhfaiWC/CbOP2g73
8PKK+/K2Htf92s6RdvDUPSKmvqGR/4KPZWPtWThXBYXGgWul0uI=
=q6tO
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost,virtio,vdpa bugfixes from Michael Tsirkin:
"Misc fixes all over the place.
Revert of virtio used length validation series: the approach taken
does not seem to work, breaking too many guests in the process. We'll
need to do length validation using some other approach"
[ This merge also ends up reverting commit f7a36b03a7 ("vsock/virtio:
suppress used length validation"), which came in through the
networking tree in the meantime, and was part of that whole used
length validation series - Linus ]
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa_sim: avoid putting an uninitialized iova_domain
vhost-vdpa: clean irqs before reseting vdpa device
virtio-blk: modify the value type of num in virtio_queue_rq()
vhost/vsock: cleanup removing `len` variable
vhost/vsock: fix incorrect used length reported to the guest
Revert "virtio_ring: validate used buffer length"
Revert "virtio-net: don't let virtio core to validate used length"
Revert "virtio-blk: don't let virtio core to validate used length"
Revert "virtio-scsi: don't let virtio core to validate used buffer length"
Highlights include:
Stable fixes:
- NFSv42: Fix pagecache invalidation after COPY/CLONE
Bugfixes:
- NFSv42: Don't fail clone() just because the server failed to return
post-op attributes
- SUNRPC: use different lockdep keys for INET6 and LOCAL
- NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
- SUNRPC: fix header include guard in trace header
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGiTnMACgkQZwvnipYK
APJoJQ//VZYSCx/mGaTIj5oUwjBKE/n/9rz2EUGS1cfYjZjPpb5Xgm1tn+1A4c01
Ztu9/hKgwrDqknqkmtKvP1GsX5vYUqgfqAlc880Q2nXAqaLJBBZgB6BFMmTtcoQx
C24L0tgxlZVD9Vw0DJEVDVgDxXA/9VmdSQK6uptQRQhcYf4VtR1wAzELHWdkdkfq
1WrREeAwGWw1BaPTrlPn9XwW9qTaMlBH05XRHh6dM7gFmoIe3td7kq7BOnFxFsnA
AQZ/nCgeMTE04kQQMzYqUc4YqZvnzUHxueZ6q8s0K1RKJBpNIQNgdWUMa285Qo4a
JA9oBCPPo0JjmsEge2Km12zyBJoA7lLQDfc6UQJON50ADF0sTu3wszgGuC63KkhE
V+kUogK2WOlnGky2yYrHmv43mcCcyoJ/g+g+38GXNYGorsFi/XUhvctEpFFFPF71
0umQwWhA6Dhc52hMj5DN4nfspp//hEuV9o7/zJzlPi0elC+xtVaBWZCorNvznlXW
C/O5yobJVd89PuIE17Stg+c0Rq3k7RVPPoyS2IaMkZ5cs7DtT5Tz3nKClPAA8Bur
mPLAMkHSOSLO26cy30SVZCIx1JDMJU9dx/PkFemCexkzYXQxgp9px/8wmM2Xe6Oc
/hDqi8V7ayJUuYpYuJ6sA8oUqj3j+NkoP4w5HhlnmZDBhFMEBhM=
=jeHm
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- NFSv42: Fix pagecache invalidation after COPY/CLONE
Bugfixes:
- NFSv42: Don't fail clone() just because the server failed to return
post-op attributes
- SUNRPC: use different lockdep keys for INET6 and LOCAL
- NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
- SUNRPC: fix header include guard in trace header"
* tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: use different lock keys for INET6 and LOCAL
sunrpc: fix header include guard in trace header
NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION
NFSv42: Fix pagecache invalidation after COPY/CLONE
NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()
NFSv42: Don't fail clone() unless the OP_CLONE operation failed
When we bind an AF_UNIX socket without a name specified, the kernel selects
an available one from 0x00000 to 0xFFFFF. unix_autobind() starts searching
from a number in the 'static' variable and increments it after acquiring
two locks.
If multiple processes try autobind, they obtain the same lock and check if
a socket in the hash list has the same name. If not, one process uses it,
and all except one end up retrying the _next_ number (actually not, it may
be incremented by the other processes). The more we autobind sockets in
parallel, the longer the latency gets. We can avoid such a race by
searching for a name from a random number.
These show latency in unix_autobind() while 64 CPUs are simultaneously
autobind-ing 1024 sockets for each.
Without this patch:
usec : count distribution
0 : 1176 |*** |
2 : 3655 |*********** |
4 : 4094 |************* |
6 : 3831 |************ |
8 : 3829 |************ |
10 : 3844 |************ |
12 : 3638 |*********** |
14 : 2992 |********* |
16 : 2485 |******* |
18 : 2230 |******* |
20 : 2095 |****** |
22 : 1853 |***** |
24 : 1827 |***** |
26 : 1677 |***** |
28 : 1473 |**** |
30 : 1573 |***** |
32 : 1417 |**** |
34 : 1385 |**** |
36 : 1345 |**** |
38 : 1344 |**** |
40 : 1200 |*** |
With this patch:
usec : count distribution
0 : 1855 |****** |
2 : 6464 |********************* |
4 : 9936 |******************************** |
6 : 12107 |****************************************|
8 : 10441 |********************************** |
10 : 7264 |*********************** |
12 : 4254 |************** |
14 : 2538 |******** |
16 : 1596 |***** |
18 : 1088 |*** |
20 : 800 |** |
22 : 670 |** |
24 : 601 |* |
26 : 562 |* |
28 : 525 |* |
30 : 446 |* |
32 : 378 |* |
34 : 337 |* |
36 : 317 |* |
38 : 314 |* |
40 : 298 | |
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To replace unix_table_lock with per-hash locks in the next patch, we need
to save a hash in each socket because /proc/net/unix or BPF prog iterate
sockets while holding a hash table lock and release it later in a different
function.
Currently, we store a real/pseudo hash in struct unix_address. However, we
do not allocate it to unbound sockets, nor should we do just for that. For
this purpose, we can use sk_hash. Then, we no longer use the hash field in
struct unix_address and can remove it.
Also, this patch does
- rename unix_insert_socket() to unix_insert_unbound_socket()
- remove the redundant list argument from __unix_insert_socket() and
unix_insert_unbound_socket()
- use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash()
- remove 'inline' from unix_remove_socket() and
unix_insert_unbound_socket().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds three helper functions that calculate hashes for unbound
sockets and bound sockets with BSD/abstract addresses.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In BSD and abstract address cases, we store sockets in the hash table with
keys between 0 and UNIX_HASH_SIZE - 1. However, the hash saved in a socket
varies depending on its address type; sockets with BSD addresses always
have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash.
This is just for the UNIX_ABSTRACT() macro used to check the address type.
The difference of the saved hashes comes from the first byte of the address
in the first place. So, we can test it directly.
Then we can keep a real hash in each socket and replace unix_table_lock
with per-hash locks in the later patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To terminate address with '\0' in unix_bind_bsd(), we add
unix_create_addr() and call it in unix_bind_bsd() and unix_bind_abstract().
Also, unix_bind_abstract() does not return -EEXIST. Only
kern_path_create() and vfs_mknod() in unix_bind_bsd() can return it,
so we move the last error check in unix_bind() to unix_bind_bsd().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch removes unix_mkname() and postpones calculating a hash to
unix_bind_abstract(). Some BSD stuffs still remain in unix_bind()
though, the next patch packs them into unix_bind_bsd().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We should not call unix_mkname() before unix_find_other() and instead do
the same thing where necessary based on the address type:
- terminating the address with '\0' in unix_find_bsd()
- calculating the hash in unix_find_abstract().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
unix_mkname() tests socket address length and family and does some
processing based on the address type. It is called in the early stage,
and therefore some instructions are redundant and can end up in vain.
The address length/family tests are done twice in unix_bind(). Also, the
address type is rechecked later in unix_bind() and unix_find_other(), where
we can do the same processing. Moreover, in the BSD address case, the hash
is set to 0 but never used and confusing.
This patch moves the address tests out of unix_mkname(), and the following
patches move the other part into appropriate places and remove
unix_mkname() finally.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can return an error as a pointer and need not pass an additional
argument to unix_find_other().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As done in the commit fa42d910a3 ("unix_bind(): take BSD and abstract
address cases into new helpers"), this patch moves BSD and abstract address
cases from unix_find_other() into unix_find_bsd() and unix_find_abstract().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We do not use struct socket in unix_autobind() and pass struct sock to
unix_bind_bsd() and unix_bind_abstract(). Let's pass it to unix_autobind()
as well.
Also, this patch fixes these errors by checkpatch.pl.
ERROR: do not use assignment in if condition
#1795: FILE: net/unix/af_unix.c:1795:
+ if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
CHECK: Logical continuations should be on the previous line
#1796: FILE: net/unix/af_unix.c:1796:
+ if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
+ && (err = unix_autobind(sock)) != 0)
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The length of the AF_UNIX socket address contains an offset to the member
sun_path of struct sockaddr_un.
Currently, the preceding member is just sun_family, and its type is
sa_family_t and resolved to short. Therefore, the offset is represented by
sizeof(short). However, it is not clear and fragile to changes in struct
sockaddr_storage or sockaddr_un.
This commit makes it clear and robust by rewriting sizeof() with
offsetof().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The netdev (e.g. ifb, bareudp), which not support ethtool ops
(e.g. .get_drvinfo), we can use the rtnl kind as a default name.
ifb netdev may be created by others prefix, not ifbX.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Hao Chen <chenhao288@hisilicon.com>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Danielle Ratson <danieller@nvidia.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20211125163049.84970-1-xiangxia.m.yue@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When applications call shutdown() with SHUT_RDWR in userspace,
smc_close_active() calls kernel_sock_shutdown(), and it is called
twice in smc_shutdown().
This fixes this by checking sk_state before do clcsock shutdown, and
avoids missing the application's call of smc_shutdown().
Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/
Fixes: 606a63c978 ("net/smc: Ensure the active closing peer first closes clcsock")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20211126024134.45693-1-tonylu@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Inject error before dev_hold(real_dev) in register_vlan_dev(),
and execute the following testcase:
ip link add dev dummy1 type dummy
ip link add name dummy1.100 link dummy1 type vlan id 100
ip link del dev dummy1
When the dummy netdevice is removed, we will get a WARNING as following:
=======================================================================
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 2 PID: 0 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0
and an endless loop of:
=======================================================================
unregister_netdevice: waiting for dummy1 to become free. Usage count = -1073741824
That is because dev_put(real_dev) in vlan_dev_free() be called without
dev_hold(real_dev) in register_vlan_dev(). It makes the refcnt of real_dev
underflow.
Move the dev_hold(real_dev) to vlan_dev_init() which is the call-back of
ndo_init(). That makes dev_hold() and dev_put() for vlan's real_dev
symmetrical.
Fixes: 563bcbae3b ("net: vlan: fix a UAF in vlan_dev_real_dev()")
Reported-by: Petr Machata <petrm@nvidia.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Link: https://lore.kernel.org/r/20211126015942.2918542-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ethtool_set_coalesce() now uses both the .get_coalesce() and
.set_coalesce() callbacks. But the check for their availability is
buggy, so changing the coalesce settings on a device where the driver
provides only _one_ of the callbacks results in a NULL pointer
dereference instead of an -EOPNOTSUPP.
Fix the condition so that the availability of both callbacks is
ensured. This also matches the netlink code.
Note that reproducing this requires some effort - it only affects the
legacy ioctl path, and needs a specific combination of driver options:
- have .get_coalesce() and .coalesce_supported but no
.set_coalesce(), or
- have .set_coalesce() but no .get_coalesce(). Here eg. ethtool doesn't
cause the crash as it first attempts to call ethtool_get_coalesce()
and bails out on error.
Fixes: f3ccfda193 ("ethtool: extend coalesce setting uAPI with CQE mode")
Cc: Yufeng Mo <moyufeng@huawei.com>
Cc: Huazhong Tan <tanhuazhong@huawei.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20211126175543.28000-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When color change is triggered in multiple bssid case, allow
only for transmitting BSS, and when it changes its bss color,
notify the non transmitting BSSs also of the new bss color.
Signed-off-by: John Crispin <john@phrozen.org>
Co-developed-by: Lavanya Suresh <lavaks@codeaurora.org>
Signed-off-by: Lavanya Suresh <lavaks@codeaurora.org>
Co-developed-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Link: https://lore.kernel.org/r/1637146647-16282-1-git-send-email-quic_ramess@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow continuous radar detection on the offchannel chain in order
to switch to the monitored channel whenever the underlying driver
reports a radar pattern on the main channel.
Tested-by: Owen Peng <owen.peng@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/d46217310a49b14ff0e9c002f0a6e0547d70fd2c.1637071350.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If necessary schedule offchan_cac_abort_wk work in cfg80211_radar_event
routine adding offchan parameter to cfg80211_radar_event signature.
Rename cfg80211_radar_event in __cfg80211_radar_event and introduce
the two following inline helpers:
- cfg80211_radar_event
- cfg80211_offchan_radar_event
Doing so the drv will not need to run cfg80211_offchan_cac_abort() after
radar detection on the offchannel chain.
Tested-by: Owen Peng <owen.peng@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/3ff583e021e3343a3ced54a7b09b5e184d1880dc.1637062727.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When kzalloc failed and rdev->sacn_req or rdev->scan_msg is null, pass a
null pointer to kfree is redundant, delete it and return directly.
Signed-off-by: liuguoqiang <liuguoqiang@uniontech.com>
Link: https://lore.kernel.org/r/20211115092139.24407-1-liuguoqiang@uniontech.com
[remove now unused creq = NULL assigment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows drivers to provide a destination device + info for flow offload
Only supported in combination with 802.3 encap offload
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Tested-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/20211112112223.1209-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The assignment of these three local variables in the file will not
be used in the corresponding functions, so they should be deleted.
The clang_analyzer complains as follows:
net/mac80211/wpa.c:689:2 warning:
net/mac80211/wpa.c:883:2 warning:
net/mac80211/wpa.c:452:2 warning:
Value stored to 'hdr' is never read
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Link: https://lore.kernel.org/r/20211104061411.1744-1-luo.penghao@zte.com.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The following is from a system that went OOM due to a memory leak:
wlan0: Allocated STA 74:83:c2:64:0b:87
wlan0: Allocated STA 74:83:c2:64:0b:87
wlan0: IBSS finish 74:83:c2:64:0b:87 (---from ieee80211_ibss_add_sta)
wlan0: Adding new IBSS station 74:83:c2:64:0b:87
wlan0: moving STA 74:83:c2:64:0b:87 to state 2
wlan0: moving STA 74:83:c2:64:0b:87 to state 3
wlan0: Inserted STA 74:83:c2:64:0b:87
wlan0: IBSS finish 74:83:c2:64:0b:87 (---from ieee80211_ibss_work)
wlan0: Adding new IBSS station 74:83:c2:64:0b:87
wlan0: moving STA 74:83:c2:64:0b:87 to state 2
wlan0: moving STA 74:83:c2:64:0b:87 to state 3
.
.
wlan0: expiring inactive not authorized STA 74:83:c2:64:0b:87
wlan0: moving STA 74:83:c2:64:0b:87 to state 2
wlan0: moving STA 74:83:c2:64:0b:87 to state 1
wlan0: Removed STA 74:83:c2:64:0b:87
wlan0: Destroyed STA 74:83:c2:64:0b:87
The ieee80211_ibss_finish_sta() is called twice on the same STA from 2
different locations. On the second attempt, the allocated STA is not
destroyed creating a kernel memory leak.
This is happening because sta_info_insert_finish() does not call
sta_info_free() the second time when the STA already exists (returns
-EEXIST). Note that the caller sta_info_insert_rcu() assumes STA is
destroyed upon errors.
Same fix is applied to -ENOMEM.
Signed-off-by: Ahmed Zaki <anzaki@gmail.com>
Link: https://lore.kernel.org/r/20211002145329.3125293-1-anzaki@gmail.com
[change the error path label to use the existing code]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some drivers that do their own sequence number allocation (e.g. ath9k) rely
on being able to modify params->ssn on starting tx ampdu sessions.
This was broken by a change that modified it to use sta->tid_seq[tid] instead.
Cc: stable@vger.kernel.org
Fixes: 31d8bb4e07 ("mac80211: agg-tx: refactor sending addba")
Reported-by: Eneas U de Queiroz <cotequeiroz@gmail.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20211124094024.43222-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since retransmission clears info->control, rate control needs to be called
again, otherwise the driver might crash due to invalid rates.
Cc: stable@vger.kernel.org # 5.14+
Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Reported-by: Robert W <rwbugreport@lost-in-the-void.net>
Fixes: 03c3911d2d ("mac80211: call ieee80211_tx_h_rate_ctrl() when dequeue")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Link: https://lore.kernel.org/r/20211122204323.9787-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, the probe timer is reused as the raise timer when PLPMTUD is in
the Search Complete state. raise_count was introduced to count how many
times the probe timer has timed out. When raise_count reaches to 30, the
raise timer handler will be triggered.
During the whole processing above, the timer keeps timing out every probe_
interval. It is a waste for the Search Complete state, as the raise timer
only needs to time out after 30 * probe_interval.
Since the raise timer and probe timer are never used at the same time, it
is no need to keep probe timer 'alive' in the Search Complete state. This
patch to introduce sctp_transport_reset_raise_timer() to start the timer
as the raise timer when entering the Search Complete state. When entering
the other states, sctp_transport_reset_probe_timer() will still be called
to reset the timer to the probe timer.
raise_count can be removed from sctp_transport as no need to count probe
timer timeout for raise timer timeout. last_rtx_chunks can be removed as
sctp_transport_reset_probe_timer() can be called in the place where asoc
rtx_data_chunks is changed.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/edb0e48988ea85997488478b705b11ddc1ba724a.1637781974.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a skb comes to tipc_aead_encrypt(), it's always linear. The
unlikely check 'skb_cloned(skb) && tailen <= skb_tailroom(skb)'
can completely be taken care of in skb_cow_data() by the code
in branch "if (!skb_has_frag_list())".
Also, remove the 'TODO:' annotation, as the pages in skbs are not
writable, see more on commit 3cf4375a09 ("tipc: do not write
skb_shinfo frags when doing decrytion").
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Link: https://lore.kernel.org/r/47a478da0b6095b76e3cbe7a75cbd25d9da1df9a.1637773872.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We replace proto_ops whenever TLS is configured for RX. But our
replacement also overrides sendpage_locked, which will crash
unless TX is also configured. Similarly we plug both of those
in for TLS_HW (NIC crypto offload) even tho TLS_HW has a completely
different implementation for TX.
Last but not least we always plug in something based on inet_stream_ops
even though a few of the callbacks differ for IPv6 (getname, release,
bind).
Use a callback building method similar to what we do for struct proto.
Fixes: c46234ebb4 ("tls: RX path for ktls")
Fixes: d4ffb02dee ("net/tls: enable sk_msg redirect to tls socket egress")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
recvmsg() will put peek()ed and partially read records onto the rx_list.
splice_read() needs to consult that list otherwise it may miss data.
Align with recvmsg() and also put partially-read records onto rx_list.
tls_sw_advance_skb() is pretty pointless now and will be removed in
net-next.
Fixes: 692d7b5d1f ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We don't support splicing control records. TLS 1.3 changes moved
the record type check into the decrypt if(). The skb may already
be decrypted and still be an alert.
Note that decrypt_skb_update() is idempotent and updates ctx->decrypted
so the if() is pointless.
Reorder the check for decryption errors with the content type check
while touching them. This part is not really a bug, because if
decryption failed in TLS 1.3 content type will be DATA, and for
TLS 1.2 it will be correct. Nevertheless its strange to touch output
before checking if the function has failed.
Fixes: fedf201e12 ("net: tls: Refactor control message handling on recv")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When doing remote name request, we cannot scan. In the normal case it's
OK since we can expect it to finish within a short amount of time.
However, there is a possibility to scan lots of devices that
(1) requires Remote Name Resolve
(2) is unresponsive to Remote Name Resolve
When this happens, we are stuck to do Remote Name Resolve until all is
done before continue scanning.
This patch adds a time limit to stop us spending too long on remote
name request.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Introducing NAME_REQUEST_FAILED flag that will be sent together with
device found event on name resolve failure. This will provide the
userspace with an information so it can decide not to resolve the
name for these devices in the future.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
num_keys is actually 2 octects not 1:
BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 4, Part E
page 1989:
Num_Keys_Deleted:
Size: 2 octets
0xXXXX Number of Link Keys Deleted
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Both max_num_keys and num_key are 2 octects:
BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 4, Part E
page 1985:
Max_Num_Keys:
Size: 2 octets
Range: 0x0000 to 0xFFFF
Num_Keys_Read:
Size: 2 octets
Range: 0x0000 to 0xFFFF
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The kernel_listen function in smc_listen will fail when all the available
ports are occupied. At this point smc->clcsock->sk->sk_data_ready has
been changed to smc_clcsock_data_ready. When we call smc_listen again,
now both smc->clcsock->sk->sk_data_ready and smc->clcsk_data_ready point
to the smc_clcsock_data_ready function.
The smc_clcsock_data_ready() function calls lsmc->clcsk_data_ready which
now points to itself resulting in an infinite loop.
This patch restores smc->clcsock->sk->sk_data_ready with the old value.
Fixes: a60a2b1e0a ("net/smc: reduce active tcp_listen workers")
Signed-off-by: Guo DaXing <guodaxing@huawei.com>
Acked-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Coverity reports a possible NULL dereferencing problem:
in smc_vlan_by_tcpsk():
6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times).
7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next.
1623 ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower);
CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS)
8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev.
1624 if (is_vlan_dev(ndev)) {
Remove the manual implementation and use netdev_walk_all_lower_dev() to
iterate over the lower devices. While on it remove an obsolete function
parameter comment.
Fixes: cb9d43f677 ("net/smc: determine vlan_id of stacked net_device")
Suggested-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is to match ipv4 behaviour, see __ip_sock_set_tos()
implementation.
Technically for ipv6 this might not be required because normally we
do not allow tclass to influence routing, yet the cli tooling does
support it:
lpk11:~# ip -6 rule add pref 5 tos 45 lookup 5
lpk11:~# ip -6 rule
5: from all tos 0x45 lookup 5
and in general dscp/tclass based routing does make sense.
We already have cases where dscp can affect vlan priority and/or
transmit queue (especially on wifi).
So let's just make things match. Easier to reason about and no harm.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20211123223208.1117871-1-zenczykowski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is to match ipv4 behaviour, see __ip_sock_set_tos()
implementation at ipv4/ip_sockglue.c:579
void __ip_sock_set_tos(struct sock *sk, int val)
{
if (sk->sk_type == SOCK_STREAM) {
val &= ~INET_ECN_MASK;
val |= inet_sk(sk)->tos & INET_ECN_MASK;
}
if (inet_sk(sk)->tos != val) {
inet_sk(sk)->tos = val;
sk->sk_priority = rt_tos2priority(val);
sk_dst_reset(sk);
}
}
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211123223154.1117794-1-zenczykowski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A CAP_NET_RAW capable process can already spoof (on transmit) anything
it desires via raw packet sockets... There is no good reason to not
allow it to also be able to play routing tricks on packets from its
own normal sockets.
There is a desire to be able to use SO_MARK for routing table selection
(via ip rule fwmark) from within a user process without having to run
it as root. Granting it CAP_NET_RAW is much less dangerous than
CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change,
while CAP_NET_ADMIN does - by for example allowing the reconfiguration
of the routing tables and/or bringing up/down devices).
Let's keep CAP_NET_ADMIN for persistent state changes,
while using CAP_NET_RAW for non-configuration related stuff.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CAP_NET_ADMIN is and should continue to be about configuring the
system as a whole, not about configuring per-socket or per-packet
parameters.
Sending and receiving raw packets is what CAP_NET_RAW is all about.
It can already send packets with any VLAN tag, and any IPv4 TOS
mark, and any IPv6 TCLASS mark, simply by virtue of building
such a raw packet. Not to mention using any protocol and source/
/destination ip address/port tuple.
These are the fields that networking gear uses to prioritize packets.
Hence, a CAP_NET_RAW process is already capable of affecting traffic
prioritization after it hits the wire. This change makes it capable
of affecting traffic prioritization even in the host at the nic and
before that in the queueing disciplines (provided skb->priority is
actually being used for prioritization, and not the TOS/TCLASS field)
Hence it makes sense to allow a CAP_NET_RAW process to set the
priority of sockets and thus packets it sends.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While testing BIG TCP patch series, I was expecting that TCP_RR workloads
with 80KB requests/answers would send one 80KB TSO packet,
then being received as a single GRO packet.
It turns out this was not happening, and the root cause was that
cubic Hystart ACK train was triggering after a few (2 or 3) rounds of RPC.
Hystart was wrongly setting CWND/SSTHRESH to 30, while my RPC
needed a budget of ~20 segments.
Ideally these TCP_RR flows should not exit slow start.
Cubic Hystart should reset itself at each round, instead of assuming
every TCP flow is a bulk one.
Note that even after this patch, Hystart can still trigger, depending
on scheduling artifacts, but at a higher CWND/SSTHRESH threshold,
keeping optimal TSO packet sizes.
Tested:
ip link set dev eth0 gro_ipv6_max_size 131072 gso_ipv6_max_size 131072
nstat -n; netperf -H ... -t TCP_RR -l 5 -- -r 80000,80000 -K cubic; nstat|egrep "Ip6InReceives|Hystart|Ip6OutRequests"
Before:
8605
Ip6InReceives 87541 0.0
Ip6OutRequests 129496 0.0
TcpExtTCPHystartTrainDetect 1 0.0
TcpExtTCPHystartTrainCwnd 30 0.0
After:
8760
Ip6InReceives 88514 0.0
Ip6OutRequests 87975 0.0
Fixes: ae27e98a51 ("[TCP] CUBIC v2.3")
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20211123202535.1843771-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited commit converted simple_strtoul() to kstrtoul() as suggested by
the former's documentation. However, it also forced all the inputs to be
decimal resulting in user space breakage.
Fix by setting the base to '0' so that the base is automatically
detected.
Before:
# ip link add name br0 type bridge vlan_filtering 1
# echo "0x88a8" > /sys/class/net/br0/bridge/vlan_protocol
bash: echo: write error: Invalid argument
After:
# ip link add name br0 type bridge vlan_filtering 1
# echo "0x88a8" > /sys/class/net/br0/bridge/vlan_protocol
# echo $?
0
Fixes: 520fbdf7fb ("net/bridge: replace simple_strtoul to kstrtol")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20211124101122.3321496-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All gro_complete() handlers are called from napi_gro_complete()
while rcu_read_lock() has been called.
There is no point stacking more rcu_read_lock()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All gro_receive() handlers are called from dev_gro_receive()
while rcu_read_lock() has been called.
There is no point stacking more rcu_read_lock()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Check for out-of-bound read was being performed at the end of while
num_reports loop, and would fill journal with false positives. Added
check to beginning of loop processing so that it doesn't get checked
after ptr has been advanced.
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Update NC-SI command handler (both standard and OEM) to take into
account of payload paddings in allocating skb (in case of payload
size is not 32-bit aligned).
The checksum field follows payload field, without taking payload
padding into account can cause checksum being truncated, leading to
dropped packets.
Fixes: fb4ee67529 ("net/ncsi: Add NCSI OEM command support")
Signed-off-by: Kumar Thangavel <thangavel.k@hcl.com>
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch inlines dccp_listen_start() and removes a stale comment in
inet_dccp_listen() so that it looks like inet_listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Richard Sailer <richard_siegfried@systemli.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The commit 1295e2cf30 ("inet: minor optimization for backlog setting in
listen(2)") added change so that sk_max_ack_backlog is initialised earlier
in inet_dccp_listen() and inet_listen(). Since then, we no longer use
backlog in inet_csk_listen_start(), so let's remove it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Richard Sailer <richard_siegfried@systemli.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
.ndo_change_proto_down was added seemingly to enable out-of-tree
implementations. Over 2.5yrs later we still have no real users
upstream. Hardwire the generic implementation for now, we can
revert once real users materialize. (rocker is a test vehicle,
not a user.)
We need to drop the optimization on the sysfs side, because
unlike ndos priv_flags will be changed at runtime, so we'd
need READ_ONCE/WRITE_ONCE everywhere..
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The side that actively closed socket, it's clcsock doesn't enter
TIME_WAIT state, but the passive side does it. It should show the same
behavior as TCP sockets.
Consider this, when client actively closes the socket, the clcsock in
server enters TIME_WAIT state, which means the address is occupied and
won't be reused before TIME_WAIT dismissing. If we restarted server, the
service would be unavailable for a long time.
To solve this issue, shutdown the clcsock in [A], perform the TCP active
close progress first, before the passive closed side closing it. So that
the actively closed side enters TIME_WAIT, not the passive one.
Client | Server
close() // client actively close |
smc_release() |
smc_close_active() // PEERCLOSEWAIT1 |
smc_close_final() // abort or closed = 1|
smc_cdc_get_slot_and_msg_send() |
[A] |
|smc_cdc_msg_recv_action() // ACTIVE
| queue_work(smc_close_wq, &conn->close_work)
| smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1
| smc_close_passive_abort_received() // only in abort
|
|close() // server recv zero, close
| smc_release() // PROCESSABORT or APPCLOSEWAIT1
| smc_close_active()
| smc_close_abort() or smc_close_final() // CLOSED
| smc_cdc_get_slot_and_msg_send() // abort or closed = 1
smc_cdc_msg_recv_action() | smc_clcsock_release()
queue_work(smc_close_wq, &conn->close_work) | sock_release(tcp) // actively close clc, enter TIME_WAIT
smc_close_passive_work() // PEERCLOSEWAIT1 | smc_conn_free()
smc_close_passive_abort_received() // CLOSED|
smc_conn_free() |
smc_clcsock_release() |
sock_release(tcp) // passive close clc |
Link: https://www.spinics.net/lists/netdev/msg780407.html
Fixes: b38d732477 ("smc: socket closing and linkgroup cleanup")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There remains some variables to replace with local struct sock. So clean
them up all.
Fixes: 3163c5071f ("net/smc: use local struct sock variables consistently")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On egress side, xfrm lookup is called from __gre6_xmit() with the
fl6_gre_key field not initialized leading to policies selectors check
failure. Consequently, gre packets are sent without encryption.
On ingress side, INET6_PROTO_NOPOLICY was set, thus packets were not
checked against xfrm policies. Like for egress side, fl6_gre_key should be
correctly set, this is now done in decode_session6().
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Cc: stable@vger.kernel.org
Signed-off-by: Ghalem Boudour <ghalem.boudour@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xprtsock.c reclassifies sock locks based on the protocol.
However there are 3 protocols and only 2 classification keys.
The same key is used for both INET6 and LOCAL.
This causes lockdep complaints. The complaints started since Commit
ea9afca88b ("SUNRPC: Replace use of socket sk_callback_lock with
sock_lock") which resulted in the sock locks beings used more.
So add another key, and renumber them slightly.
Fixes: ea9afca88b ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock")
Fixes: 176e21ee2e ("SUNRPC: Support for RPC over AF_LOCAL transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a new device generic parameter to enable and disable
iWARP functionality on a multi-protocol RDMA device.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Leszek Kaliszczuk <leszek.kaliszczuk@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
the removed nexthop entries after an RCU grace period because they
contain references to the nexthop's net device and to the fib6 info.
With specific series of events[1] we can reach net device refcount
imbalance which is unrecoverable. IPv4 is not affected because dsts
don't take a refcount on the route.
[1]
$ ip nexthop list
id 200 via 2002:db8::2 dev bridge.10 scope link onlink
id 201 via 2002:db8::3 dev bridge scope link onlink
id 203 group 201/200
$ ip -6 route
2001:db8::10 nhid 203 metric 1024 pref medium
nexthop via 2002:db8::3 dev bridge weight 1 onlink
nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
Create rt6_info through one of the multipath legs, e.g.:
$ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10
(pkt_inj is just a custom packet generator, nothing special)
Then remove that leg from the group by replace (let's assume it is id
200 in this case):
$ ip nexthop replace id 203 group 201
Now remove the IPv6 route:
$ ip -6 route del 2001:db8::10/128
The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
(deleted) IPv6 route holds 1 reference over nhid 203
nh 203 holds 1 ref over id 201
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
$ ip nexthop replace id 203 group 201/200
And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.
To trigger the bug just delete the group (nhid 203):
$ ip nexthop del id 203
It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.
At this point the dependencies are:
(deleted, only unlinked) IPv6 route holds reference over group nh 203
(deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.
If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
$ ip nexthop del id 200
$ ip nexthop
$
Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.
Message from syslogd@dev at Nov 19 14:04:10 ...
kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Message from syslogd@dev at Nov 19 14:04:20 ...
kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Fixes: 7bf4796dd0 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need a way to release a fib6_nh's per-cpu dsts when replacing
nexthops otherwise we can end up with stale per-cpu dsts which hold net
device references, so add a new IPv6 stub called fib6_nh_release_dsts.
It must be used after an RCU grace period, so no new dsts can be created
through a group's nexthop entry.
Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
so it doesn't need a dummy stub when IPv6 is not enabled.
Fixes: 7bf4796dd0 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
Replace the existing empty member position markers "headers_start" and
"headers_end" with a struct_group(). This will allow memcpy() and sizeof()
to more easily reason about sizes, and improve readability.
"pahole" shows no size nor member offset changes to struct sk_buff.
"objdump -d" shows no object code changes (outside of WARNs affected by
source line number changes).
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/*
Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedor
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for using the struct_group() macro in struct sk_buff,
move the conditional preprocessor directives out of the region of struct
sk_buff that will be enclosed by struct_group(). While GCC and Clang are
happy with conditional preprocessor directives here, sparse is not, even
under -Wno-directive-within-macro[1], as would be seen under a C=1 build:
net/core/filter.c: note: in included file (through include/linux/netlink.h, include/linux/sock_diag.h):
./include/linux/skbuff.h:820:1: warning: directive in macro's argument list
./include/linux/skbuff.h:822:1: warning: directive in macro's argument list
./include/linux/skbuff.h:846:1: warning: directive in macro's argument list
./include/linux/skbuff.h:848:1: warning: directive in macro's argument list
Additionally remove empty macro argument definitions and usage.
"objdump -d" shows no object code differences.
[1] https://www.spinics.net/lists/linux-sparse/msg10857.html
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IPv6 module gets initialized, but it's hitting an error in inet6_init()
where it then needs to undo all the prior initialization work, it also might
do a call to ndisc_cleanup() which then calls neigh_table_clear(). In there
is a missing timer cancellation of the table's managed_work item.
The kernel test robot explicitly triggered this error path and caused a UAF
crash similar to the below:
[...]
[ 28.833183][ C0] BUG: unable to handle page fault for address: f7a43288
[ 28.833973][ C0] #PF: supervisor write access in kernel mode
[ 28.834660][ C0] #PF: error_code(0x0002) - not-present page
[ 28.835319][ C0] *pde = 06b2c067 *pte = 00000000
[ 28.835853][ C0] Oops: 0002 [#1] PREEMPT
[ 28.836367][ C0] CPU: 0 PID: 303 Comm: sed Not tainted 5.16.0-rc1-00233-g83ff5faa0d3b #7
[ 28.837293][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[ 28.838338][ C0] EIP: __run_timers.constprop.0+0x82/0x440
[...]
[ 28.845607][ C0] Call Trace:
[ 28.845942][ C0] <SOFTIRQ>
[ 28.846333][ C0] ? check_preemption_disabled.isra.0+0x2a/0x80
[ 28.846975][ C0] ? __this_cpu_preempt_check+0x8/0xa
[ 28.847570][ C0] run_timer_softirq+0xd/0x40
[ 28.848050][ C0] __do_softirq+0xf5/0x576
[ 28.848547][ C0] ? __softirqentry_text_start+0x10/0x10
[ 28.849127][ C0] do_softirq_own_stack+0x2b/0x40
[ 28.849749][ C0] </SOFTIRQ>
[ 28.850087][ C0] irq_exit_rcu+0x7d/0xc0
[ 28.850587][ C0] common_interrupt+0x2a/0x40
[ 28.851068][ C0] asm_common_interrupt+0x119/0x120
[...]
Note that IPv6 module cannot be unloaded as per 8ce4406103 ("ipv6: do not
allow ipv6 module to be removed") hence this can only be seen during module
initialization error. Tested with kernel test robot's reproducer.
Fixes: 7482e3841d ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Possible recursive locking is detected by lockdep when SMC
falls back to TCP. The corresponding warnings are as follows:
============================================
WARNING: possible recursive locking detected
5.16.0-rc1+ #18 Tainted: G E
--------------------------------------------
wrk/1391 is trying to acquire lock:
ffff975246c8e7d8 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0x109/0x250 [smc]
but task is already holding lock:
ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&ei->socket.wq.wait);
lock(&ei->socket.wq.wait);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by wrk/1391:
#0: ffff975246040130 (sk_lock-AF_SMC){+.+.}-{0:0}, at: smc_connect+0x43/0x150 [smc]
#1: ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc]
stack backtrace:
Call Trace:
<TASK>
dump_stack_lvl+0x56/0x7b
__lock_acquire+0x951/0x11f0
lock_acquire+0x27a/0x320
? smc_switch_to_fallback+0x109/0x250 [smc]
? smc_switch_to_fallback+0xfe/0x250 [smc]
_raw_spin_lock_irq+0x3b/0x80
? smc_switch_to_fallback+0x109/0x250 [smc]
smc_switch_to_fallback+0x109/0x250 [smc]
smc_connect_fallback+0xe/0x30 [smc]
__smc_connect+0xcf/0x1090 [smc]
? mark_held_locks+0x61/0x80
? __local_bh_enable_ip+0x77/0xe0
? lockdep_hardirqs_on+0xbf/0x130
? smc_connect+0x12a/0x150 [smc]
smc_connect+0x12a/0x150 [smc]
__sys_connect+0x8a/0xc0
? syscall_enter_from_user_mode+0x20/0x70
__x64_sys_connect+0x16/0x20
do_syscall_64+0x34/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The nested locking in smc_switch_to_fallback() is considered to
possibly cause a deadlock because smc_wait->lock and clc_wait->lock
are the same type of lock. But actually it is safe so far since
there is no other place trying to obtain smc_wait->lock when
clc_wait->lock is held. So the patch replaces spin_lock() with
spin_lock_nested() to avoid false report by lockdep.
Link: https://lkml.org/lkml/2021/11/19/962
Fixes: 2153bd1e3d ("Transfer remaining wait queue entries during fallback")
Reported-by: syzbot+e979d3597f48262cb4ee@syzkaller.appspotmail.com
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that vhost vsock violates the virtio spec
by supplying the out buffer length in the used length
(should just be the in length).
As a result, attempts to validate the used length fail with:
vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0
Since vsock driver does not use the length fox tx and
validates the length before use for rx, it is safe to
suppress the validation in virtio core for this driver.
Reported-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 939779f515 ("virtio_ring: validate used buffer length")
Cc: "Jason Wang" <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
proc_create_net() and remove_proc_entry() already contain the case
whether to define CONFIG_PROC_FS, so remove #ifdef CONFIG_PROC_FS.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'mask' bitmap is local to this function. So the non-atomic
'__set_bit()' can be used to save a few cycles.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/cold/could/
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-By: Devesh Sharma <devesh.s.sharma@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add netif_set_gso_max_segs() helper.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We deal with IPv6 packets, so we need to use IP6CB(skb)->flags and
IP6SKB_REROUTED, instead of IPCB(skb)->flags and IPSKB_REROUTED
Found by code inspection, please double check that fixing this bug
does not surface other bugs.
Fixes: 09ee9dba96 ("ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tobias Brunner <tobias@strongswan.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Tested-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new parameters kernel_ringparam and extack for
.get_ringparam and .set_ringparam to extend more ring params
through netlink.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to set rx buf len via ethtool -G parameter and get
rx buf len via ethtool -g parameter.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for ethtool to set/get tx copybreak buf size.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On kernels before v5.15, calling read() on a unix socket after
shutdown(SHUT_RD) or shutdown(SHUT_RDWR) would return the data
previously written or EOF. But now, while read() after
shutdown(SHUT_RD) still behaves the same way, read() after
shutdown(SHUT_RDWR) always fails with -EINVAL.
This behaviour change was apparently inadvertently introduced as part of
a bug fix for a different regression caused by the commit adding sockmap
support to af_unix, commit 94531cfcbe ("af_unix: Add
unix_stream_proto for sockmap"). Those commits, for unclear reasons,
started setting the socket state to TCP_CLOSE on shutdown(SHUT_RDWR),
while this state change had previously only been done in
unix_release_sock().
Restore the original behaviour. The sockmap tests in
tests/selftests/bpf continue to pass after this patch.
Fixes: d0c6416bd7 ("unix: Fix an issue in unix_shutdown causing the other end read/write failures")
Link: https://lore.kernel.org/lkml/20211111140000.GA10779@axis.com/
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scheduling a delack in mptcp_established_options_mp() is
not a good idea: such function is called by tcp_send_ack() and
the pending delayed ack will be cleared shortly after by the
tcp_event_ack_sent() call in __tcp_transmit_skb().
Instead use the mptcp delegated action infrastructure to
schedule the delayed ack after the current bh processing completes.
Additionally moves the schedule_3rdack_retransmission() helper
into protocol.c to avoid making it visible in a different compilation
unit.
Fixes: ec3edaa7ca ("mptcp: Add handling of outgoing MP_JOIN requests")
Reviewed-by: Mat Martineau <mathew.j.martineau>@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To compute the rtx timeout schedule_3rdack_retransmission() does multiple
things in the wrong way: srtt_us is measured in usec/8 and the timeout
itself is an absolute value.
Fixes: ec3edaa7ca ("mptcp: Add handling of outgoing MP_JOIN requests")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau>@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These options also need to be set before bind, so do the sync of msk to
new ssk socket a bit earlier.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOL_IP provides a way to configure network layer attributes in a
socket. This patch adds support for IP_TOS for setsockopt(.. ,SOL_IP, ..)
Support for SOL_IP is added in mptcp_setsockopt() and IP_TOS is handled in
a private function. The idea here is to take in the value passed for IP_TOS
and set it to the current subflow, open subflows as well new subflows that
might be created after the initial call to setsockopt(). This sync is done
using sync_socket_options(.., ssk) and setting the value of tos using
__ip_sock_set_tos(ssk,..).
The patch has been tested using the packetdrill script here -
https://github.com/multipath-tcp/mptcp_net-next/issues/220#issuecomment-947863717
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/220
Signed-off-by: Poorva Sonparote <psonparo@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Making the static function __ip_sock_set_tos() from net/ipv4/ip_sockglue.c
accessible by declaring it in include/net/ip.h
The reason for doing this is to use this function to set IP_TOS value in
mptcp_setsockopt() without the lock.
Signed-off-by: Poorva Sonparote <psonparo@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since all netdev->dev_addr modifications go via dev_addr_mod()
we can put it on the list. When address is change remove it
and add it back.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev->dev_addr should only be modified via helpers,
but someone may be casting off the const. Add a runtime
check to catch abuses.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no module callers in-tree and it's hard to justify
why anyone would init or flush addresses of a netdev (note
the flush is more of a destructor, it frees netdev->dev_addr).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. We converted all users to make modifications via appropriate
helpers, make netdev->dev_addr const.
The update helpers need to upcast from the buffer to
struct netdev_hw_addr.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a sock is added to a sock map we evaluate what proto op hooks need to
be used. However, when the program is removed from the sock map we have not
been evaluating if that changes the required program layout.
Before the patch listed in the 'fixes' tag this was not causing failures
because the base program set handles all cases. Specifically, the case with
a stream parser and the case with out a stream parser are both handled. With
the fix below we identified a race when running with a proto op that attempts
to read skbs off both the stream parser and the skb->receive_queue. Namely,
that a race existed where when the stream parser is empty checking the
skb->receive_queue from recvmsg at the precies moment when the parser is
paused and the receive_queue is not empty could result in skipping the stream
parser. This may break a RX policy depending on the parser to run.
The fix tag then loads a specific proto ops that resolved this race. But, we
missed removing that proto ops recv hook when the sock is removed from the
sockmap. The result is the stream parser is stopped so no more skbs will be
aggregated there, but the hook and BPF program continues to be attached on
the psock. User space will then get an EBUSY when trying to read the socket
because the recvmsg() handler is now waiting on a stopped stream parser.
To fix we rerun the proto ops init() function which will look at the new set
of progs attached to the psock and rest the proto ops hook to the correct
handlers. And in the above case where we remove the sock from the sock map
the RX prog will no longer be listed so the proto ops is removed.
Fixes: c5d2177a72 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com
When a TCP socket is added to a sock map we look at the programs attached
to the map to determine what proto op hooks need to be changed. Before
the patch in the 'fixes' tag there were only two categories -- the empty
set of programs or a TX policy. In any case the base set handled the
receive case.
After the fix we have an optimized program for receive that closes a small,
but possible, race on receive. This program is loaded only when the map the
psock is being added to includes a RX policy. Otherwise, the race is not
possible so we don't need to handle the race condition.
In order for the call to sk_psock_init() to correctly evaluate the above
conditions all progs need to be set in the psock before the call. However,
in the current code this is not the case. We end up evaluating the
requirements on the old prog state. If your psock is attached to multiple
maps -- for example a tx map and rx map -- then the second update would pull
in the correct maps. But, the other pattern with a single rx enabled map
the correct receive hooks are not used. The result is the race fixed by the
patch in the fixes tag below may still be seen in this case.
To fix we simply set all psock->progs before doing the call into
sock_map_init(). With this the init() call gets the full list of programs
and chooses the correct proto ops on the first iteration instead of
requiring the second update to pull them in. This fixes the race case when
only a single map is used.
Fixes: c5d2177a72 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211119181418.353932-2-john.fastabend@gmail.com
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Add struct_group() to mark region of struct stats_reply_data that should
be initialized, which can now be done in a single memset() call.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Add struct_group() to mark the region of struct iucv_sock that gets
initialized to zero. Avoid the future warning:
In function 'fortify_memset_chk',
inlined from 'iucv_sock_alloc' at net/iucv/af_iucv.c:476:2:
./include/linux/fortify-string.h:199:4: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
199 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/lkml/19ff61a0-0cda-6000-ce56-dc6b367c00d6@linux.ibm.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_after() to clear everything after the dst_entry member of
struct rt6_info.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_startat() so memset() doesn't get confused about writing
beyond the destination member that is intended to be the starting point
of zeroing through the end of the struct.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_startat() so memset() doesn't get confused about writing
beyond the destination member that is intended to be the starting point
of zeroing through the end of the struct.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix kernel doc comments where appropriate or remove incorrect kernel
doc indicators.
Also move kernel doc comments directly before functions.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter/IPVS fixes for net:
1) Add selftest for vrf+conntrack, from Florian Westphal.
2) Extend nfqueue selftest to cover nfqueue, also from Florian.
3) Remove duplicated include in nft_payload, from Wan Jiabing.
4) Several improvements to the nat port shadowing selftest,
from Phil Sutter.
5) Fix filtering of reply tuple in ctnetlink, from Florent Fourcot.
6) Do not override error with -EINVAL in filter setup path, also
from Florent.
7) Honor sysctl_expire_nodest_conn regardless conn_reuse_mode for
reused connections, from yangxingwu.
8) Replace snprintf() by sysfs_emit() in xt_IDLETIMER as reported
by Coccinelle, from Jing Yao.
9) Incorrect IPv6 tunnel match in flowtable offload, from Will
Mortensen.
10) Switch port shadow selftest to use socat, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make cfg80211_offchan_cac_abort() (renamed from
cfg80211_offchan_cac_event) callable in other contexts and
without so much locking restrictions, make it trigger a new
work instead of operating directly.
Do some other renames while at it to clarify.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/6145c3d0f30400a568023f67981981d24c7c6133.1635325205.git.lorenzo@kernel.org
[rewrite commit log]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Similar to cfg80211, introduce set_radar_offchan callback in mac80211_ops
in order to configure a dedicated offchannel chain available on some hw
(e.g. mt7915) to perform offchannel CAC detection and avoid tx/rx downtime.
Tested-by: Evelyn Tsai <evelyn.tsai@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/201110606d4f3a7dfdf31440e351f2e2c375d4f0.1634979655.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback,
fix missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGWf08ACgkQMUZtbf5S
Irt+lxAAj8FAoLoSmQKUK3LttLLh0ZQQXu8Riey+wrP8Z9Yp8xWXIaVRF1c0vCE6
clbrF+mLfk6Wvv/RzOgwyBMHvK+djr/oVDNSmjlRvss4MLDfOQZhUV8V4XpvF4Up
hI7wyKfHtd7niosNqil6wklJFpLU8WyIAWrPSIPE6JlPkJmcm3GUGsliwEPwdLY1
yl7z4zsxigjA+hKxYqNQX6tixF3xnbDUbAnWshrSPL89melwz4GMao45qmcxJEVr
EipPhKifk0hT067jG08FMXcKBFKt6rKk7SVNo4mtq8Tl6HleJBj8fdaJAjSdFahB
+rYJ0sDZwGoDL5CxZ5mD3fM1cDgh4WFEM0z//0b/bZhoPDRKEpLr9LPuv+N6+/rA
8D98EHsvyNjlFgdyd8celMstiGtBn4YLEoLNYYh9Qibgm0XsCuv0yox7g0AOLMmQ
QiBmh2EnaXNPQ8PRZNMK3VH5ol2KoYWL6yrpJYV+wOWVLfezwlSsjkPSfW5pF9FG
hU0iQBp/YTCdCadR9YLj8qfDWDUAkCN7WpqIu9EA9FXJcYjJVaix0MA/tAVlzXyR
xlB7cU6O5NABcs/+04zPkKLwKbVYNMqgvKE+FVDVm+BKxo0UMxcmz/Np/ZYxfhkF
bwKplaiPb2H4D6t0sdxqaeYirPwt1BcleLilae6vHG1jO90H9Vw=
=tlqV
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, mac80211.
Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing
progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback, fix
missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to
ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks"
* tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits)
ptp: ocp: Fix a couple NULL vs IS_ERR() checks
net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock()
net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound
ipv6: check return value of ipv6_skip_exthdr
e100: fix device suspend/resume
devlink: Don't throw an error if flash notification sent before devlink visible
page_pool: Revert "page_pool: disable dma mapping support..."
ethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge_srst_by_port()
octeontx2-af: debugfs: don't corrupt user memory
NFC: add NCI_UNREG flag to eliminate the race
NFC: reorder the logic in nfc_{un,}register_device
NFC: reorganize the functions in nci_request
tipc: check for null after calling kmemdup
i40e: Fix display error code in dmesg
i40e: Fix creation of first queue by omitting it if is not power of two
i40e: Fix warning message and call stack during rmmod i40e driver
i40e: Fix ping is lost after configuring ADq on VF
i40e: Fix changing previously set num_queue_pairs for PFs
i40e: Fix NULL ptr dereference on VSI filter sync
i40e: Fix correct max_pkt_size on VF RX queue
...
The statement in the switch is repeated with the statement at the
beginning of the while loop, so this statement is meaningless.
The clang_analyzer complains as follows:
net/xfrm/xfrm_policy.c:3392:2 warning:
Value stored to 'exthdr' is never read
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The definition of this variable is just to find the length of the
structure after aligning the structure. The PTR alignment function
is to optimize the size of the structure. In fact, it doesn't seem
to be of much use, because both members of the structure are of
type u32.
So I think that the definition of the variable and the
corresponding alignment can be deleted, the value of extralen can
be directly passed in the size of the structure.
The clang_analyzer complains as follows:
net/ipv6/esp6.c:117:27 warning:
Value stored to 'extra' during its initialization is never read
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In 99ce45d5e, we moved a route refcount decrement from
mctp_do_fragment_route into the caller. This invalidates the assumption
that the route test makes about refcount behaviour, so the route tests
fail.
This change fixes the test case to suit the new refcount behaviour.
Fixes: 99ce45d5e7 ("mctp: Implement extended addressing")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid
opencoding it.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yao Jing <yao.jing2@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The offset value is used in pointer math on skb->data.
Since ipv6_skip_exthdr may return -1 the pointer to uh and th
may not point to the actual udp and tcp headers and potentially
overwrite other stuff. This is why I think this should be checked.
EDIT: added {}'s, thanks Kees
Signed-off-by: Jordy Zomer <jordy@pwning.systems>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mlxsw driver calls to various devlink flash routines even before
users can get any access to the devlink instance itself. For example,
mlxsw_core_fw_rev_validate() one of such functions.
__mlxsw_core_bus_device_register
-> mlxsw_core_fw_rev_validate
-> mlxsw_core_fw_flash
-> mlxfw_firmware_flash
-> mlxfw_status_notify
-> devlink_flash_update_status_notify
-> __devlink_flash_update_notify
-> WARN_ON(...)
It causes to the WARN_ON to trigger warning about devlink not registered.
Fixes: cf53021740 ("devlink: Notify users when objects are accessible")
Reported-by: Danielle Ratson <danieller@nvidia.com>
Tested-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit d00e60ee54.
As reported by Guillaume in [1]:
Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT
in 32-bit systems, which breaks the bootup proceess when a
ethernet driver is using page pool with PP_FLAG_DMA_MAP flag.
As we were hoping we had no active consumers for such system
when we removed the dma mapping support, and LPAE seems like
a common feature for 32 bits system, so revert it.
1. https://www.spinics.net/lists/netdev/msg779890.html
Fixes: d00e60ee54 ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Tested-by: "kernelci.org bot" <bot@kernelci.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to inet v4 raw sockets for binding to nonlocal addresses
through the IP_FREEBIND and IP_TRANSPARENT socket options, as well as
the ipv4.ip_nonlocal_bind kernel parameter.
Add helper function to inet_sock.h to check for bind address validity on
the base of the address type and whether nonlocal address are enabled
for the socket via any of the sockopts/sysctl, deduplicating checks in
ipv4/ping.c, ipv4/af_inet.c, ipv6/af_inet6.c (for mapped v4->v6
addresses), and ipv4/raw.c.
Add test cases with IP[V6]_FREEBIND verifying that both v4 and v6 raw
sockets support binding to nonlocal addresses after the change. Add
necessary support for the test cases to nettest.
Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211117090010.125393-1-pbl@bestov.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two sites that calls queue_work() after the
destroy_workqueue() and lead to possible UAF.
The first site is nci_send_cmd(), which can happen after the
nci_close_device as below
nfcmrvl_nci_unregister_dev | nfc_genl_dev_up
nci_close_device |
flush_workqueue |
del_timer_sync |
nci_unregister_device | nfc_get_device
destroy_workqueue | nfc_dev_up
nfc_unregister_device | nci_dev_up
device_del | nci_open_device
| __nci_request
| nci_send_cmd
| queue_work !!!
Another site is nci_cmd_timer, awaked by the nci_cmd_work from the
nci_send_cmd.
... | ...
nci_unregister_device | queue_work
destroy_workqueue |
nfc_unregister_device | ...
device_del | nci_cmd_work
| mod_timer
| ...
| nci_cmd_timer
| queue_work !!!
For the above two UAF, the root cause is that the nfc_dev_up can race
between the nci_unregister_device routine. Therefore, this patch
introduce NCI_UNREG flag to easily eliminate the possible race. In
addition, the mutex_lock in nci_close_device can act as a barrier.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211116152732.19238-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a potential UAF between the unregistration routine and the NFC
netlink operations.
The race that cause that UAF can be shown as below:
(FREE) | (USE)
nfcmrvl_nci_unregister_dev | nfc_genl_dev_up
nci_close_device |
nci_unregister_device | nfc_get_device
nfc_unregister_device | nfc_dev_up
rfkill_destory |
device_del | rfkill_blocked
... | ...
The root cause for this race is concluded below:
1. The rfkill_blocked (USE) in nfc_dev_up is supposed to be placed after
the device_is_registered check.
2. Since the netlink operations are possible just after the device_add
in nfc_register_device, the nfc_dev_up() can happen anywhere during the
rfkill creation process, which leads to data race.
This patch reorder these actions to permit
1. Once device_del is finished, the nfc_dev_up cannot dereference the
rfkill object.
2. The rfkill_register need to be placed after the device_add of nfc_dev
because the parent device need to be created first. So this patch keeps
the order but inject device_lock to prevent the data race.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Fixes: be055b2f89 ("NFC: RFKILL support")
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211116152652.19217-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a possible data race as shown below:
thread-A in nci_request() | thread-B in nci_close_device()
| mutex_lock(&ndev->req_lock);
test_bit(NCI_UP, &ndev->flags); |
... | test_and_clear_bit(NCI_UP, &ndev->flags)
mutex_lock(&ndev->req_lock); |
|
This race will allow __nci_request() to be awaked while the device is
getting removed.
Similar to commit e2cb6b891a ("bluetooth: eliminate the potential race
condition when removing the HCI controller"). this patch alters the
function sequence in nci_request() to prevent the data races between the
nci_close_device().
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Link: https://lore.kernel.org/r/20211115145600.8320-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no reason for stopping all TX queues from dev_watchdog()
Not only this stops feeding the NIC, it also migrates all qdiscs
to be serviced on the cpu calling netif_tx_unlock(), causing
a potential latency artifact.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are not fast path, there is no point in inlining them.
Also provide netif_freeze_queues()/netif_unfreeze_queues()
so that we can use them from dev_watchdog() in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In following patches, dev_watchdog() will no longer stop all queues.
It will read queue->trans_start locklessly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tx_timeout_show() assumed dev_watchdog() would stop all
the queues, to fetch queue->trans_timeout under protection
of the queue->_xmit_lock.
As we want to no longer disrupt transmits, we use an
atomic_long_t instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: david decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without dropping dst, the packets sent from local mirred/redirected
to ingress will may still use the old dst. ip_rcv() will drop it as
the old dst is for output and its .input is dst_discard.
This patch is to fix by also dropping dst for those packets that are
mirred or redirected from egress to ingress in act_mirred.
Note that we don't drop it for the direction change from ingress to
egress, as on which there might be a user case attaching a metadata
dst by act_tunnel_key that would be used later.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
siphash keys use 16 bytes.
Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* bad dont-reorder check
* throughput LED trigger for various new(ish) paths
* radiotap header generation
* locking assertions in mac80211 with monitor mode
* radio statistics
* don't try to access IV when not present
* call stop_ap for P2P_GO as well as we should
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmGT1iIACgkQB8qZga/f
l8QDhA/+JeKZeXQ+Ct8kh34xMN4VxCUeGlJNY1aH2mTlJFSwxqJ4XH7eCiSTezXB
ZEZ46wDLOpUF9epGl4NLzFkR5+Sanpn3LJOMLBDYQwkl3xcvNiQEZ0y4T8PayJRR
kSs9ina3zxvZvBjCYX1KKdFatZiE+X5qwYyaK8T4ZjNQ4vjQpeX/V3eFpFeMDjuX
fzIUN27yiESZG4heB2BkyeTR/jb4hG/gCbdGfNxa9MbF0OGJMJkuwH9MNHRQR2cx
NvEPoq2IcooZfbqWnVk9lxWzRpm8Buio+gth9Knd13AGDl4Mez4+n0KxAEhafjhE
ljUYekjh+55iSkt7eczW7wjvxmdGC+lfGN7GUop9mw4LU6NvpFCKXW5MqaH8XBy4
UIkcblXsMKkD8sFbMGzNIl7pT4JGdXkxlorxg7xI5s02nMFppnWJ7mrXQmcBkBG7
5JLkaXahiQzQYq/VGepCPMYYwKQqbHXts9TPMq/W3aIeR8QxA3uINZ5P9eGF9qn8
xTcvEQPYvIid1BIorO5/0V/VgE9VNDP1w8aKvn8bJjOmbWfuE68Wx+K6LlHy+E+O
/MhjiRwnCvvTfdi6vPUjTxis8Xh5e9xfYmcqyRVIQujLmhta31FEJ/q+R+A1A4Rs
RsesqpLgEOo3rkSciAAfItmRf3SVgJlIuZcfe16XFw0s45oYEck=
=aD/1
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Couple of fixes:
* bad dont-reorder check
* throughput LED trigger for various new(ish) paths
* radiotap header generation
* locking assertions in mac80211 with monitor mode
* radio statistics
* don't try to access IV when not present
* call stop_ap for P2P_GO as well as we should
* tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
mac80211: fix throughput LED trigger
mac80211: fix monitor_sdata RCU/locking assertions
mac80211: drop check for DONT_REORDER in __ieee80211_select_queue
mac80211: fix radiotap header generation
mac80211: do not access the IV when it was stripped
nl80211: fix radio statistics in survey dump
cfg80211: call cfg80211_stop_ap when switch from P2P_GO type
====================
Link: https://lore.kernel.org/r/20211116160845.157214-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2021-11-16
We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).
The main changes are:
1) Fix pruning regression where verifier went overly conservative rejecting
previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.
2) Fix verifier TOCTOU bug when using read-only map's values as constant
scalars during verification, from Daniel Borkmann.
3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.
4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.
5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
programs due to deadlock possibilities, from Dmitrii Banshchikov.
6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.
7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.
8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
udp: Validate checksum in udp_read_sock()
bpf: Fix toctou on read-only map's constant scalar tracking
samples/bpf: Fix build error due to -isystem removal
selftests/bpf: Add tests for restricted helpers
bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
libbpf: Perform map fd cleanup for gen_loader in case of error
samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
tools/runqslower: Fix cross-build
samples/bpf: Fix summary per-sec stats in xdp_sample_user
selftests/bpf: Check map in map pruning
bpf: Fix inner map state pruning regression.
xsk: Fix crash on double free in buffer pool
====================
Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We should clear the flag if the adv instance removed due to receiving
this error status is the last one we have.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This event is received when the controller stops advertising,
specifically for these three reasons:
(a) Connection is successfully created (success).
(b) Timeout is reached (error).
(c) Number of advertising events is reached (error).
(*) This event is NOT generated when the host stops the advertisement.
Refer to the BT spec ver 5.3 vol 4 part E sec 7.7.65.18. Note that the
section was revised from BT spec ver 5.0 vol 2 part E sec 7.7.65.18
which was ambiguous about (*).
Some chips (e.g. RTL8822CE) send this event when the host stops the
advertisement with status = HCI_ERROR_CANCELLED_BY_HOST (due to (*)
above). This is treated as an error and the advertisement will be
removed and userspace will be informed via MGMT event.
On suspend, we are supposed to temporarily disable advertisements,
and continue advertising on resume. However, due to the behavior
above, the advertisements are removed instead.
This patch returns early if HCI_ERROR_CANCELLED_BY_HOST is received.
Btmon snippet of the unexpected behavior:
@ MGMT Command: Remove Advertising (0x003f) plen 1
Instance: 1
< HCI Command: LE Set Extended Advertising Enable (0x08|0x0039) plen 6
Extended advertising: Disabled (0x00)
Number of sets: 1 (0x01)
Entry 0
Handle: 0x01
Duration: 0 ms (0x00)
Max ext adv events: 0
> HCI Event: LE Meta Event (0x3e) plen 6
LE Advertising Set Terminated (0x12)
Status: Operation Cancelled by Host (0x44)
Handle: 1
Connection handle: 0
Number of completed extended advertising events: 5
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Advertising Enable (0x08|0x0039) ncmd 2
Status: Success (0x00)
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This work is no longer necessary since all the code using it has been
converted to use hci_passive_scan/hci_passive_scan_sync.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This makes MGMT_OP_SET_CONNEABLE use hci_cmd_sync_queue instead of
use a dedicated connetable_update work.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This makes MGMT_OP_SET_DISCOVERABLE use hci_cmd_sync_queue instead of
use a dedicated discoverable_update work.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This is distracting really, let's make this simpler,
because many callers had to take care of this
by themselves, even if on x86 this adds more
code than really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net->core.sock_inuse is a per cpu variable (int),
while net->core.prot_inuse is another per cpu variable
of 64 integers.
per cpu allocator tend to place them in very different places.
Grouping them together makes sense, since it makes
updates potentially faster, if hitting the same
cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MPTCP hard codes it, let us instead provide this helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_prot_inuse_add() is very small, we can inline it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.
gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper is used once, no need to keep it in fat net/core/skbuff.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under pressure, tcp recvmsg() has logic to process the socket backlog,
but calls tcp_cleanup_rbuf() right before.
Avoiding sending ACK right before processing new segments makes
a lot of sense, as this decrease the number of ACK packets,
with no impact on effective ACK clocking.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Testing timeo before sk_err/sk_state/sk_shutdown makes more sense.
Modern applications use non-blocking IO, while a socket is terminated
only once during its life time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.
A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.
Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.
This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.
This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.
Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.
One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)
Tested:
100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs
MTU : 1500
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+(headers)
Before: 82 Gbit
After: 95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP uses sk_eat_skb() when skbs can be removed from receive queue.
However, the call to skb_orphan() from __kfree_skb() incurs
an indirect call so sock_rfee(), which is more expensive than
a direct call, especially for CONFIG_RETPOLINE=y.
Add tcp_eat_recv_skb() function to make the call before
__kfree_skb().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
owned.
Also, it is faster to first check tp->urg_data in tcp_poll(),
then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
located in a different/cold cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_segs_in() can be called from BH, while socket spinlock
is held but socket owned by user, eventually reading these
fields from tcp_get_info()
Found by code inspection, no need to backport this patch
to older kernels.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use INDIRECT_CALL_INET() to avoid an indirect call
when/if CONFIG_RETPOLINE=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reading large chunks of data, incoming packets might
be added to the backlog from BH.
tcp recvmsg() detects the backlog queue is not empty, and uses
a release_sock()/lock_sock() pair to process this backlog.
We now have __sk_flush_backlog() to perform this
a bit faster.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_memory_allocated and tcp_sockets_allocated often share
a common cache line, source of false sharing.
Also take care of udp_memory_allocated and mptcp_sockets_allocated.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We were only using one bit, and we can replace it by sk_is_tcp()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move sk_is_tcp() to include/net/sock.h and use it where we can.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For TCP flows, inet6_sk(sk)->saddr has the same value
than sk->sk_v6_rcv_saddr.
Using sk->sk_v6_rcv_saddr increases data locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
are not used. Defer their access to the point we need them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Coverity Scan report:
[...]
*** CID 1493985: Uninitialized variables (UNINIT)
/net/bluetooth/hci_event.c: 4535 in hci_sync_conn_complete_evt()
4529
4530 /* Notify only in case of SCO over HCI transport data path which
4531 * is zero and non-zero value shall be non-HCI transport data path
4532 */
4533 if (conn->codec.data_path == 0) {
4534 if (hdev->notify)
>>> CID 1493985: Uninitialized variables (UNINIT)
>>> Using uninitialized value "notify_evt" when calling "*hdev->notify".
4535 hdev->notify(hdev, notify_evt);
4536 }
4537
4538 hci_connect_cfm(conn, ev->status);
4539 if (ev->status)
4540 hci_conn_del(conn);
[...]
Although only btusb uses air_mode, and he only handles HCI_NOTIFY_ENABLE_SCO_CVSD
and HCI_NOTIFY_ENABLE_SCO_TRANSP, there is still a very small chance that
ev->air_mode is not equal to 0x2 and 0x3, but notify_evt is initialized to
HCI_NOTIFY_ENABLE_SCO_CVSD or HCI_NOTIFY_ENABLE_SCO_TRANSP. the context is
maybe not correct.
Let us directly use the required function instead of re-initializing it,
so as to restore the original logic and make the code more correct.
Addresses-Coverity: ("Uninitialized variables")
Fixes: f4f9fa0c07 ("Bluetooth: Allow usb to auto-suspend when SCO use non-HCI transport")
Suggested-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Syzbot reported slab-out-of-bounds read in hci_le_adv_report_evt(). The
problem was in missing validaion check.
We should check if data is not malicious and we can read next data block.
If we won't check ptr validness, code can read a way beyond skb->end and
it can cause problems, of course.
Fixes: e95beb4141 ("Bluetooth: hci_le_adv_report_evt code refactoring")
Reported-and-tested-by: syzbot+e3fcb9c4f3c2a931dc40@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
It turns out the skb's in sock receive queue could have bad checksums, as
both ->poll() and ->recvmsg() validate checksums. We have to do the same
for ->read_sock() path too before they are redirected in sockmap.
Fixes: d7f571188e ("udp: Implement ->read_sock() for sockmap")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211115044006.26068-1-xiyou.wangcong@gmail.com
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-11-15
We've added 72 non-merge commits during the last 13 day(s) which contain
a total of 171 files changed, 2728 insertions(+), 1143 deletions(-).
The main changes are:
1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to
BTF such that BPF verifier will be able to detect misuse, from Yonghong Song.
2) Big batch of libbpf improvements including various fixes, future proofing APIs,
and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko.
3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the
programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush.
4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to
ensure exception handling still works, from Russell King and Alan Maguire.
5) Add a new bpf_find_vma() helper for tracing to map an address to the backing
file such as shared library, from Song Liu.
6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump,
updating documentation and bash-completion among others, from Quentin Monnet.
7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as
the API is heavily tailored around perf and is non-generic, from Dave Marchevsky.
8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an
opt-out for more relaxed BPF program requirements, from Stanislav Fomichev.
9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpftool: Use libbpf_get_error() to check error
bpftool: Fix mixed indentation in documentation
bpftool: Update the lists of names for maps and prog-attach types
bpftool: Fix indent in option lists in the documentation
bpftool: Remove inclusion of utilities.mak from Makefiles
bpftool: Fix memory leak in prog_dump()
selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
selftests/bpf: Fix an unused-but-set-variable compiler warning
bpf: Introduce btf_tracing_ids
bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
bpftool: Enable libbpf's strict mode by default
docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support
selftests/bpf: Clarify llvm dependency with btf_tag selftest
selftests/bpf: Add a C test for btf_type_tag
selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication
selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests
selftests/bpf: Test libbpf API function btf__add_type_tag()
bpftool: Support BTF_KIND_TYPE_TAG
libbpf: Support BTF_KIND_TYPE_TAG
...
====================
Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The link_id is supposed to be unique, but smcr_next_link_id() doesn't
skip the used link_id as expected. So the patch fixes this.
Fixes: 026c381fb4 ("net/smc: introduce link_idx for link group array")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_clone_lock() needs to call sock_inuse_add(1) before entering the
sk_free_unlock_clone() error path, for __sk_free() from sk_free() from
sk_free_unlock_clone() calls sock_inuse_add(-1).
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 648845ab7e ("sock: Move the socket inuse to namespace.")
Signed-off-by: David S. Miller <davem@davemloft.net>
The MSG_CRYPTO msgs are always encrypted and sent to other nodes
for keys' deployment. But when receiving in peers, if those nodes
do not validate it and make sure it's encrypted, one could craft
a malicious MSG_CRYPTO msg to deploy its key with no need to know
other nodes' keys.
This patch is to do that by checking TIPC_SKB_CB(skb)->decrypted
and discard it if this packet never got decrypted.
Note that this is also a supplementary fix to CVE-2021-43267 that
can be triggered by an unencrypted malicious MSG_CRYPTO msg.
Fixes: 1ef6f7c939 ("tipc: add automatic session key exchange")
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When kmemdup called failed and register_net_sysctl return NULL, should
return ENOMEM instead of ENOBUFS
Signed-off-by: liuguoqiang <liuguoqiang@uniontech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'inuse' bitmap is local to this function. So we can use the
non-atomic '__set_bit()' to save a few cycles.
While at it, also remove some useless {}.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SMC fallback is incomplete currently. There may be some
wait queue entries remaining in smc socket->wq, which should
be removed to clcsocket->wq during the fallback.
For example, in nginx/wrk benchmark, this issue causes an
all-zeros test result:
server: nginx -g 'daemon off;'
client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html
Running 5s test @ http://11.200.15.93/index.html
1 threads and 1 connections
Thread Stats Avg Stdev Max ± Stdev
Latency 0.00us 0.00us 0.00us -nan%
Req/Sec 0.00 0.00 0.00 -nan%
0 requests in 5.00s, 0.00B read
Requests/sec: 0.00
Transfer/sec: 0.00B
The reason for this all-zeros result is that when wrk used SMC
to replace TCP, it added an eppoll_entry into smc socket->wq
and expected to be notified if epoll events like EPOLL_IN/
EPOLL_OUT occurred on the smc socket.
However, once a fallback occurred, wrk switches to use clcsocket.
Now it is clcsocket->wq instead of smc socket->wq which will
be woken up. The eppoll_entry remaining in smc socket->wq does
not work anymore and wrk stops the test.
This patch fixes this issue by removing remaining wait queue
entries from smc socket->wq to clcsocket->wq during the fallback.
Link: https://www.spinics.net/lists/netdev/msg779769.html
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed comments to match description with variable names and
refactored code to match the convention as per [1].
To match the convention mapping is done as follows:
State 3 - LOST_IN_BURST_PERIOD
State 4 - LOST_IN_GAP_PERIOD
[1] S. Salsano, F. Ludovici, A. Ordine, "Definition of a general
and intuitive loss model for packet networks and its implementation
in the Netem module in the Linux kernel"
Fixes: a6e2fe17eb ("sch_netem: replace magic numbers with enumerate")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some functions, like tipc_crypto_start use inconsisten GFP flags
when allocating memory. The mentioned function use GFP_ATOMIC to
to alloc a crypto instance, and then calls alloc_ordered_workqueue()
which allocates memory with GFP_KERNEL. tipc_aead_init() function
even uses GFP_KERNEL and GFP_ATOMIC interchangeably.
No doc comment specifies what context a function is designed to
work in, but the flags should at least be consistent within a function.
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: tipc-discussion@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, creating a batman-adv interface in an unprivileged LXD
container and attaching secondary interfaces to it with "ip" or "batctl"
works fine. However all batctl debug and configuration commands
fail:
root@container:~# batctl originators
Error received: Operation not permitted
root@container:~# batctl orig_interval
1000
root@container:~# batctl orig_interval 2000
root@container:~# batctl orig_interval
1000
To fix this change the generic netlink permissions from GENL_ADMIN_PERM
to GENL_UNS_ADMIN_PERM. This way a batman-adv interface is fully
maintainable as root from within a user namespace, from an unprivileged
container.
All except one batman-adv netlink setting are per interface and do not
leak information or change settings from the host system and are
therefore save to retrieve or modify as root from within an unprivileged
container.
"batctl routing_algo" / BATADV_CMD_GET_ROUTING_ALGOS is the only
exception: It provides the batman-adv kernel module wide default routing
algorithm. However it is read-only from netlink and an unprivileged
container is still not allowed to modify
/sys/module/batman_adv/parameters/routing_algo. Instead it is advised to
use the newly introduced "batctl if create routing_algo RA_NAME" /
IFLA_BATADV_ALGO_NAME to set the routing algorithm on interface
creation, which already works fine in an unprivileged container.
Cc: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This version will contain all the (major or even only minor) changes for
Linux 5.17.
The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The codepaths for rx with decap offload and tx with itxq were not updating
the counters for the throughput led trigger.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20211113063415.55147-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since commit a05829a722 ("cfg80211: avoid holding the RTNL when
calling the driver") we've not only been protecting the pointer
to monitor_sdata with the RTNL, but also with the wiphy->mtx. This
is relevant in a number of lockdep assertions, e.g. the one we hit
in ieee80211_set_monitor_channel(). However, we're now protecting
all the assignments/dereferences, even the one in interface iter,
with the wiphy->mtx, so switch over the lockdep assertions to that
lock.
Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20211112135143.cb8e8ceffef3.Iaa210f16f6904c8a7a24954fb3396da0ef86ec08@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When __ieee80211_select_queue is called, skb->cb has not been cleared yet,
which means that info->control.flags can contain garbage.
In some cases this leads to IEEE80211_TX_CTRL_DONT_REORDER being set, causing
packets marked for other queues to randomly end up in BE instead.
This flag only needs to be checked in ieee80211_select_queue_80211, since
the radiotap parser is the only piece of code that sets it
Fixes: 66d06c8473 ("mac80211: adhere to Tx control flag that prevents frame reordering")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20211110212201.35452-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In commit 8c89f7b3d3 ("mac80211: Use flex-array for radiotap header
bitmap") we accidentally pointed the position to the wrong place, so
we overwrite a present bitmap, and thus cause all kinds of trouble.
To see the issue, note that the previous code read:
pos = (void *)(it_present + 1);
The requirement now is that we need to calculate pos via it_optional,
to not trigger the compiler hardening checks, as:
pos = (void *)&rthdr->it_optional[...];
Rewriting the original expression, we get (obviously, since that just
adds "+ x - x" terms):
pos = (void *)(it_present + 1 + rthdr->it_optional - rthdr->it_optional)
and moving the "+ rthdr->it_optional" outside to be used as an array:
pos = (void *)&rthdr->it_optional[it_present + 1 - rthdr->it_optional];
The original is off by one, fix it.
Cc: stable@vger.kernel.org
Fixes: 8c89f7b3d3 ("mac80211: Use flex-array for radiotap header bitmap")
Reported-by: Sid Hayn <sidhayn@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: Sid Hayn <sidhayn@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211109100203.c61007433ed6.I1dade57aba7de9c4f48d68249adbae62636fd98c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_get_keyid() will return false value if IV has been stripped,
such as return 0 for IP/ARP frames due to LLC header, and return -EINVAL
for disassociation frames due to its length... etc. Don't try to access
it if it's not present.
Signed-off-by: Xing Song <xing.song@mediatek.com>
Link: https://lore.kernel.org/r/20211101024657.143026-1-xing.song@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Even if userspace specifies the NL80211_ATTR_SURVEY_RADIO_STATS
attribute, we cannot get the statistics because we're not really
parsing the incoming attributes properly any more.
Fix this by passing the attrbuf to nl80211_prepare_wdev_dump()
and filling it there, if given, and using a local version only
if no output is desired.
Since I'm touching it anyway, make nl80211_prepare_wdev_dump()
static.
Fixes: 50508d941c ("cfg80211: use parallel_ops for genl")
Reported-by: Jan Fuchs <jf@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: Sven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20211029092539.2851b4799386.If9736d4575ee79420cbec1bd930181e1d53c7317@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the userspace tools switch from NL80211_IFTYPE_P2P_GO to
NL80211_IFTYPE_ADHOC via send_msg(NL80211_CMD_SET_INTERFACE), it
does not call the cleanup cfg80211_stop_ap(), this leads to the
initialization of in-use data. For example, this path re-init the
sdata->assigned_chanctx_list while it is still an element of
assigned_vifs list, and makes that linked list corrupt.
Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Reported-by: syzbot+bbf402b783eeb6d908db@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20211027173722.777287-1-phind.uet@gmail.com
Cc: stable@vger.kernel.org
Fixes: ac800140c2 ("cfg80211: .stop_ap when interface is going down")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch reverts two prior patches, e7310c9402
("security: implement sctp_assoc_established hook in selinux") and
7c2ef0240e ("security: add sctp_assoc_established hook"), which
create the security_sctp_assoc_established() LSM hook and provide a
SELinux implementation. Unfortunately these two patches were merged
without proper review (the Reviewed-by and Tested-by tags from
Richard Haines were for previous revisions of these patches that
were significantly different) and there are outstanding objections
from the SELinux maintainers regarding these patches.
Work is currently ongoing to correct the problems identified in the
reverted patches, as well as others that have come up during review,
but it is unclear at this point in time when that work will be ready
for inclusion in the mainline kernel. In the interest of not keeping
objectionable code in the kernel for multiple weeks, and potentially
a kernel release, we are reverting the two problematic patches.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This statement is repeated with the initialization statement
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
there is a same action when the variable is initialized
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The assignment in the if statement will be overwritten by the
following statement
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
in 5.7 are now enabled by default. This should greatly speed up things
like rm, tar and rsync. To opt out, wsync mount option can be used.
Other than that we have a pile of bug fixes all across the filesystem
from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from
Luis.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmGORY8THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5XmB/0SRQTW+BRAhYvSD/Ib2/c6uVZGnmJU
MUCO/uuD4fZvfxyMVb3qnAzHIGh3YlFWa/dgSZNStvLmY0L9P0MIvyaYolVuC4Tu
7llX1I+yckTns9VmiULBNy9D812eRY282nMbRzikMGPO1eb6Yqo3r50AdTvkam/R
Qs5pfwRqLerbP7VUv4vSsrBflwVyHOrCFZaUUVTu4f2kKz/FzZd/FCw5VV51KYzq
ygN+8eFotf5zluMX0tl0FXVgJ13N9vbg+YNxyOzHmOAV0AhSmmtV8vJ+j+m7+sOj
b7wNl5AuI5Ogeg8PHSGsXOHBVn6IbhdGDcougEVL+1gHkxxOQ5hfBHWQ
=22Vd
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"One notable change here is that async creates and unlinks introduced
in 5.7 are now enabled by default. This should greatly speed up things
like rm, tar and rsync. To opt out, wsync mount option can be used.
Other than that we have a pile of bug fixes all across the filesystem
from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from
Luis"
* tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client:
ceph: add a new metric to keep track of remote object copies
libceph, ceph: move ceph_osdc_copy_from() into cephfs code
ceph: clean-up metrics data structures to reduce code duplication
ceph: split 'metric' debugfs file into several files
ceph: return the real size read when it hits EOF
ceph: properly handle statfs on multifs setups
ceph: shut down mount on bad mdsmap or fsmap decode
ceph: fix mdsmap decode when there are MDS's beyond max_mds
ceph: ignore the truncate when size won't change with Fx caps issued
ceph: don't rely on error_string to validate blocklisted session.
ceph: just use ci->i_version for fscache aux info
ceph: shut down access to inode when async create fails
ceph: refactor remove_session_caps_cb
ceph: fix auth cap handling logic in remove_session_caps_cb
ceph: drop private list from remove_session_caps_cb
ceph: don't use -ESTALE as special return code in try_get_cap_refs
ceph: print inode numbers instead of pointer values
ceph: enable async dirops by default
libceph: drop ->monmap and err initialization
ceph: convert to noop_direct_IO
TCP Receive zerocopy iterates through the SKB queue via
tcp_recv_skb(), acquiring a pointer to an SKB and an offset within
that SKB to read from. From there, it iterates the SKB frags array to
determine which offset to start remapping pages from.
However, this is built on the assumption that the offset read so far
within the SKB is smaller than the SKB length. If this assumption is
violated, we can attempt to read an invalid frags array element, which
would cause a fault.
tcp_recv_skb() can cause such an SKB to be returned when the TCP FIN
flag is set. Therefore, we must guard against this occurrence inside
skb_advance_frag().
One way that we can reproduce this error follows:
1) In a receiver program, call getsockopt(TCP_ZEROCOPY_RECEIVE) with:
char some_array[32 * 1024];
struct tcp_zerocopy_receive zc = {
.copybuf_address = (__u64) &some_array[0],
.copybuf_len = 32 * 1024,
};
2) In a sender program, after a TCP handshake, send the following
sequence of packets:
i) Seq = [X, X+4000]
ii) Seq = [X+4000, X+5000]
iii) Seq = [X+4000, X+5000], Flags = FIN | URG, urgptr=1000
(This can happen without URG, if we have a signal pending, but URG is
a convenient way to reproduce the behaviour).
In this case, the following event sequence will occur on the receiver:
tcp_zerocopy_receive():
-> receive_fallback_to_copy() // copybuf_len >= inq
-> tcp_recvmsg_locked() // reads 5000 bytes, then breaks due to URG
-> tcp_recv_skb() // yields skb with skb->len == offset
-> tcp_zerocopy_set_hint_for_skb()
-> skb_advance_to_frag() // will returns a frags ptr. >= nr_frags
-> find_next_mappable_frag() // will dereference this bad frags ptr.
With this patch, skb_advance_to_frag() will no longer return an
invalid frags pointer, and will return NULL instead, fixing the issue.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 05255b823a ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
Link: https://lore.kernel.org/r/20211111235215.2605384-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch reverts two prior patches, e7310c9402
("security: implement sctp_assoc_established hook in selinux") and
7c2ef0240e ("security: add sctp_assoc_established hook"), which
create the security_sctp_assoc_established() LSM hook and provide a
SELinux implementation. Unfortunately these two patches were merged
without proper review (the Reviewed-by and Tested-by tags from
Richard Haines were for previous revisions of these patches that
were significantly different) and there are outstanding objections
from the SELinux maintainers regarding these patches.
Work is currently ongoing to correct the problems identified in the
reverted patches, as well as others that have come up during review,
but it is unclear at this point in time when that work will be ready
for inclusion in the mainline kernel. In the interest of not keeping
objectionable code in the kernel for multiple weeks, and potentially
a kernel release, we are reverting the two problematic patches.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Fix a crash in the buffer pool allocator when a buffer is double
freed. It is possible to trigger this behavior not only from a faulty
driver, but also from user space like this: Create a zero-copy AF_XDP
socket. Load an XDP program that will issue XDP_DROP for all
packets. Put the same umem buffer into the fill ring multiple times,
then bind the socket and send some traffic. This will crash the kernel
as the XDP_DROP action triggers one call to xsk_buff_free()/xp_free()
for every packet dropped. Each call will add the corresponding buffer
entry to the free_list and increase the free_list_cnt. Some entries
will have been added multiple times due to the same buffer being
freed. The buffer allocation code will then traverse this broken list
and since the same buffer is in the list multiple times, it will try
to delete the same buffer twice from the list leading to a crash.
The fix for this is just to test that the buffer has not been added
before in xp_free(). If it has been, just return from the function and
do not put it in the free_list a second time.
Note that this bug was not present in the code before the commit
referenced in the Fixes tag. That code used one list entry per
allocated buffer, so multiple frees did not have any side effects. But
the commit below optimized the usage of the pool and only uses a
single entry per buffer in the umem, meaning that multiple
allocations/frees of the same buffer will also only use one entry,
thus leading to the problem.
Fixes: 47e4075df3 ("xsk: Batched buffer allocation for the pool")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20211111075707.21922-1-magnus.karlsson@gmail.com
and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different
from the tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit operations
to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries
to access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S
IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf
fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2
5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL
xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ
cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK
AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV
DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8
JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T
mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH
dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4
KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY=
=Ttgq
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different from the
tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the
workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit
operations to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs"
* tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
selftests/net: udpgso_bench_rx: fix port argument
net: wwan: iosm: fix compilation warning
cxgb4: fix eeprom len when diagnostics not implemented
net: fix premature exit from NAPI state polling in napi_disable()
net/smc: fix sk_refcnt underflow on linkdown and fallback
net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
gve: fix unmatched u64_stats_update_end()
net: ethernet: lantiq_etop: Fix compilation error
selftests: forwarding: Fix packet matching in mirroring selftests
vsock: prevent unnecessary refcnt inc for nonblocking connect
net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
net: stmmac: allow a tc-taprio base-time of zero
selftests: net: test_vxlan_under_vrf: fix HV connectivity test
net: hns3: allow configure ETS bandwidth of all TCs
net: hns3: remove check VF uc mac exist when set by PF
net: hns3: fix some mac statistics is always 0 in device version V2
net: hns3: fix kernel crash when unload VF while it is being reset
net: hns3: sync rx ring head in echo common pull
net: hns3: fix pfc packet number incorrect after querying pfc parameters
...
Commit 719c571970 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.
Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an uninitialized variable, rather than straight to
another round of the state check.
Error path looks like:
napi_disable():
unsigned long val, new; /* new is uninitialized */
do {
val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
NAPIF_STATE_SCHED is set */
if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
usleep_range(20, 200);
continue; /* go straight to the condition check */
}
new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
writes garbage */
napi_enable():
do {
val = READ_ONCE(n->state);
BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>
while the typical BUG splat is like:
[ 172.652461] ------------[ cut here ]------------
[ 172.652462] kernel BUG at net/core/dev.c:6937!
[ 172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G I 5.15.0 #42
[ 172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[ 172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[ 172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[ 172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[ 172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[ 172.782403] Call Trace:
[ 172.784857] <TASK>
[ 172.786963] ice_up_complete+0x6f/0x210 [ice]
[ 172.791349] ice_xdp+0x136/0x320 [ice]
[ 172.795108] ? ice_change_mtu+0x180/0x180 [ice]
[ 172.799648] dev_xdp_install+0x61/0xe0
[ 172.803401] dev_xdp_attach+0x1e0/0x550
[ 172.807240] dev_change_xdp_fd+0x1e6/0x220
[ 172.811338] do_setlink+0xee8/0x1010
[ 172.814917] rtnl_setlink+0xe5/0x170
[ 172.818499] ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[ 172.823732] ? security_capable+0x36/0x50
< snip >
Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
for-loop with an explicit break.
From v1 [0]:
- just use a for-loop to simplify both the fix and the existing
code (Eric).
[0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com
Fixes: 719c571970 ("net: make napi_disable() symmetric with enable")
Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
6aiufVWKNNRQef9y
=yKBl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck"
* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd4: remove obselete comment
nfsd: document server-to-server-copy parameters
NFSD:fix boolreturn.cocci warning
nfsd: update create verifier comment
SUNRPC: Change return value type of .pc_encode
SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
NFSD: Save location of NFSv4 COMPOUND status
SUNRPC: Change return value type of .pc_decode
SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
SUNRPC: De-duplicate .pc_release() call sites
SUNRPC: Simplify the SVC dispatch code path
SUNRPC: Capture value of xdr_buf::page_base
SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
svcrdma: Split svcrmda_wc_{read,write} tracepoints
svcrdma: Split the svcrdma_wc_send() tracepoint
svcrdma: Split the svcrdma_wc_receive() tracepoint
NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
NFSD: Initialize pointer ni with NULL and not plain integer 0
NFSD: simplify struct nfsfh
...
Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other tracing
improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK
APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU
SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi
IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA
onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2
XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl
UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza
wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm
y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT
bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb
YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af
RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g=
=n8lj
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other
tracing improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>"
* tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
NFS: Remove the nfs4_label argument from decode_getattr_*() functions
NFS: Remove the nfs4_label argument from nfs_setsecurity
NFS: Remove the nfs4_label argument from nfs_fhget()
NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
NFS: Remove the nfs4_label argument from nfs_instantiate()
NFS: Remove the nfs4_label from the nfs_setattrres
NFS: Remove the nfs4_label from the nfs4_getattr_res
NFS: Remove the f_label from the nfs4_opendata and nfs_openres
NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
NFS: Remove the label from the nfs4_lookup_res struct
NFS: Remove the nfs4_label from the nfs4_link_res struct
NFS: Remove the nfs4_label from the nfs4_create_res struct
NFS: Remove the nfs4_label from the nfs_entry struct
NFS: Create a new nfs_alloc_fattr_with_label() function
NFS: Always initialise fattr->label in nfs_fattr_alloc()
NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
NFSv4: Remove unnecessary 'minor version' check
NFSv4: Fix potential Oops in decode_op_map()
...
It may be helpful to have access to the ifindex during bpf socket
lookup. An example may be to scope certain socket lookup logic to
specific interfaces, i.e. an interface may be made exempt from custom
lookup code.
Add the ifindex of the arriving connection to the bpf_sk_lookup API.
Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com