Commit Graph

73504 Commits

Author SHA1 Message Date
Chuck Lever 7ea9c1ec66 net/handshake: Fix handshake_dup() ref counting
If get_unused_fd_flags() fails, we ended up calling fput(sock->file)
twice.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 22:05:24 -07:00
Chuck Lever a095326e2c net/handshake: Remove unneeded check from handshake_dup()
handshake_req_submit() now verifies that the socket has a file.

Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 22:05:23 -07:00
Jakub Kicinski 0c615f1cc3 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZG4AiAAKCRDbK58LschI
 g+xlAQCmefGbDuwPckZLnomvt6gl4bkIjs7kc1ySbG9QBnaInwD/WyrJaQIPijuD
 qziHPAyx+MEgPseFU1b7Le35SZ66IwM=
 =s4R1
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-05-24

We've added 19 non-merge commits during the last 10 day(s) which contain
a total of 20 files changed, 738 insertions(+), 448 deletions(-).

The main changes are:

1) Batch of BPF sockmap fixes found when running against NGINX TCP tests,
   from John Fastabend.

2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails,
   from Anton Protopopov.

3) Init the BPF offload table earlier than just late_initcall,
   from Jakub Kicinski.

4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields,
   from Will Deacon.

5) Remove a now unsupported __fallthrough in BPF samples,
   from Andrii Nakryiko.

6) Fix a typo in pkg-config call for building sign-file,
   from Jeremy Sowden.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: Test progs verifier error with latest clang
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
  bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0
  bpf, sockmap: Build helper to create connected socket pair
  bpf, sockmap: Pull socket helpers out of listen test for general use
  bpf, sockmap: Incorrectly handling copied_seq
  bpf, sockmap: Wake up polling after data copy
  bpf, sockmap: TCP data stall on recv before accept
  bpf, sockmap: Handle fin correctly
  bpf, sockmap: Improved check for empty queue
  bpf, sockmap: Reschedule is now done through backlog
  bpf, sockmap: Convert schedule_work into delayed_work
  bpf, sockmap: Pass skb ownership through read_skb
  bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
  bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
  samples/bpf: Drop unnecessary fallthrough
  bpf: netdev: init the offload table earlier
  selftests/bpf: Fix pkg-config call building sign-file
====================

Link: https://lore.kernel.org/r/20230524170839.13905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 21:57:57 -07:00
Jiri Pirko 9277649c66 devlink: pass devlink_port pointer to ops->port_del() instead of index
Historically there was a reason why port_dev() along with for example
port_split() did get port_index instead of the devlink_port pointer.
With the locking changes that were done which ensured devlink instance
mutex is hold for every command, the port ops could get devlink_port
pointer directly. Change the forgotten port_dev() op to be as others
and pass devlink_port pointer instead of port_index.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 10:34:26 +01:00
Jiri Pirko c496daeb86 devlink: remove duplicate port notification
The notification about created port is send from devl_port_register()
function called from ops->port_new(). No need to send it again here,
so remove the call and the helper function.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 10:34:26 +01:00
Gavrilov Ilia 878ecb0897 ipv6: Fix out-of-bounds access in ipv6_find_tlv()
optlen is fetched without checking whether there is more than one byte to parse.
It can lead to out-of-bounds access.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Fixes: c61a404325 ("[IPV6]: Find option offset by type.")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:43:39 +01:00
Guillaume Nault 0e26371db5 udp: Stop using RTO_ONLINK.
Use ip_sendmsg_scope() to properly initialise the scope in
flowi4_init_output(), instead of overriding tos with the RTO_ONLINK
flag. The objective is to eventually remove RTO_ONLINK, which will
allow converting .flowi4_tos to dscp_t.

Now that the scope is determined by ip_sendmsg_scope(), we need to
check its result to set the 'connected' variable.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:22:06 +01:00
Guillaume Nault c85be08fc4 raw: Stop using RTO_ONLINK.
Use ip_sendmsg_scope() to properly initialise the scope in
flowi4_init_output(), instead of overriding tos with the RTO_ONLINK
flag. The objective is to eventually remove RTO_ONLINK, which will
allow converting .flowi4_tos to dscp_t.

The MSG_DONTROUTE and SOCK_LOCALROUTE cases were already handled by
raw_sendmsg() (SOCK_LOCALROUTE was handled by the RT_CONN_FLAGS*()
macros called by get_rtconn_flags()). However, opt.is_strictroute
wasn't taken into account. Therefore, a side effect of this patch is to
now honour opt.is_strictroute, and thus align raw_sendmsg() with
ping_v4_sendmsg() and udp_sendmsg().

Since raw_sendmsg() was the only user of get_rtconn_flags(), we can now
remove this function.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:22:06 +01:00
Guillaume Nault 726de790f6 ping: Stop using RTO_ONLINK.
Define a new helper to figure out the correct route scope to use on TX,
depending on socket configuration, ancillary data and send flags.

Use this new helper to properly initialise the scope in
flowi4_init_output(), instead of overriding tos with the RTO_ONLINK
flag.

The objective is to eventually remove RTO_ONLINK, which will allow
converting .flowi4_tos to dscp_t.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:22:06 +01:00
Pratyush Yadav 8a02fb71d7 net: fix skb leak in __skb_tstamp_tx()
Commit 50749f2dd6 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with
TX timestamp.") added a call to skb_orphan_frags_rx() to fix leaks with
zerocopy skbs. But it ended up adding a leak of its own. When
skb_orphan_frags_rx() fails, the function just returns, leaking the skb
it just cloned. Free it before returning.

This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.

Fixes: 50749f2dd6 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230522153020.32422-1-ptyadav@amazon.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:51:43 -07:00
David Howells 57d44a354a unix: Convert unix_stream_sendpage() to use MSG_SPLICE_PAGES
Convert unix_stream_sendpage() to use sendmsg() with MSG_SPLICE_PAGES
rather than directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Kuniyuki Iwashima <kuniyu@amazon.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:28 -07:00
David Howells a0dbf5f818 af_unix: Support MSG_SPLICE_PAGES
Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the
source iterator if possible and copying the data in otherwise.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Kuniyuki Iwashima <kuniyu@amazon.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells c49cf26632 ip: Remove ip_append_page()
ip_append_page() is no longer used with the removal of udp_sendpage(), so
remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 7ac7c98785 udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
Convert udp_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than
directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 6d8192bd69 ip6, udp6: Support MSG_SPLICE_PAGES
Make IP6/UDP6 sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator if possible, copying the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 7da0dde684 ip, udp: Support MSG_SPLICE_PAGES
Make IP/UDP sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 5367f9bbb8 tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
Fold do_tcp_sendpages() into its last remaining caller,
tcp_sendpage_locked().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells e117dcfd64 tls: Inline do_tcp_sendpages()
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 7f8816ab4b espintcp: Inline do_tcp_sendpages()
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steffen Klassert <steffen.klassert@secunet.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells ebf2e8860e tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it.  This is part of replacing ->sendpage() with a call to
sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells c5c37af6ec tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
Convert do_tcp_sendpages() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself.  do_tcp_sendpages() can then be
inlined in subsequent patches into its callers.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 270a1c3de4 tcp: Support MSG_SPLICE_PAGES
Make TCP's sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced or copied (if it cannot be spliced) from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 2e910b9532 net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES
Add a function to handle MSG_SPLICE_PAGES being passed internally to
sendmsg().  Pages are spliced into the given socket buffer if possible and
copied in if not (e.g. they're slab pages or have a zero refcount).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells 96449f9024 net: Pass max frags into skb_append_pagefrags()
Pass the maximum number of fragments into skb_append_pagefrags() rather
than using MAX_SKB_FRAGS so that it can be used from code that wants to
specify sysctl_max_skb_frags.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells b841b901c4 net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
network protocol that it should splice pages from the source iterator
rather than copying the data if it can.  This flag is added to a list that
is cleared by sendmsg syscalls on entry.

This is intended as a replacement for the ->sendpage() op, allowing a way
to splice in several multipage folios in one go.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
Yunsheng Lin 368d3cb406 page_pool: fix inconsistency for page_pool_ring_[un]lock()
page_pool_ring_[un]lock() use in_softirq() to decide which
spin lock variant to use, and when they are called in the
context with in_softirq() being false, spin_lock_bh() is
called in page_pool_ring_lock() while spin_unlock() is
called in page_pool_ring_unlock(), because spin_lock_bh()
has disabled the softirq in page_pool_ring_lock(), which
causes inconsistency for spin lock pair calling.

This patch fixes it by returning in_softirq state from
page_pool_producer_lock(), and use it to decide which
spin lock variant to use in page_pool_producer_unlock().

As pool->ring has both producer and consumer lock, so
rename it to page_pool_producer_[un]lock() to reflect
the actual usage. Also move them to page_pool.c as they
are only used there, and remove the 'inline' as the
compiler may have better idea to do inlining or not.

Fixes: 7886244736 ("net: page_pool: Add bulk support for ptr_ring")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20230522031714.5089-1-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:25:13 -07:00
John Fastabend e5c6de5fa0 bpf, sockmap: Incorrectly handling copied_seq
The read_skb() logic is incrementing the tcp->copied_seq which is used for
among other things calculating how many outstanding bytes can be read by
the application. This results in application errors, if the application
does an ioctl(FIONREAD) we return zero because this is calculated from
the copied_seq value.

To fix this we move tcp->copied_seq accounting into the recv handler so
that we update these when the recvmsg() hook is called and data is in
fact copied into user buffers. This gives an accurate FIONREAD value
as expected and improves ACK handling. Before we were calling the
tcp_rcv_space_adjust() which would update 'number of bytes copied to
user in last RTT' which is wrong for programs returning SK_PASS. The
bytes are only copied to the user when recvmsg is handled.

Doing the fix for recvmsg is straightforward, but fixing redirect and
SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
call this from skmsg handlers. This fixes another issue where a broken
socket with a BPF program doing a resubmit could hang the receiver. This
happened because although read_skb() consumed the skb through sock_drop()
it did not update the copied_seq. Now if a single reccv socket is
redirecting to many sockets (for example for lb) the receiver sk will be
hung even though we might expect it to continue. The hang comes from
not updating the copied_seq numbers and memory pressure resulting from
that.

We have a slight layer problem of calling tcp_eat_skb even if its not
a TCP socket. To fix we could refactor and create per type receiver
handlers. I decided this is more work than we want in the fix and we
already have some small tweaks depending on caller that use the
helper skb_bpf_strparser(). So we extend that a bit and always set
the strparser bit when it is in use and then we can gate the
seq_copied updates on this.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
2023-05-23 16:10:42 +02:00
John Fastabend 6df7f764cd bpf, sockmap: Wake up polling after data copy
When TCP stack has data ready to read sk_data_ready() is called. Sockmap
overwrites this with its own handler to call into BPF verdict program.
But, the original TCP socket had sock_def_readable that would additionally
wake up any user space waiters with sk_wake_async().

Sockmap saved the callback when the socket was created so call the saved
data ready callback and then we can wake up any epoll() logic waiting
on the read.

Note we call on 'copied >= 0' to account for returning 0 when a FIN is
received because we need to wake up user for this as well so they
can do the recvmsg() -> 0 and detect the shutdown.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com
2023-05-23 16:10:35 +02:00
John Fastabend ea444185a6 bpf, sockmap: TCP data stall on recv before accept
A common mechanism to put a TCP socket into the sockmap is to hook the
BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
that can map the socket info to the correct BPF verdict parser. When
the user adds the socket to the map the psock is created and the new
ops are assigned to ensure the verdict program will 'see' the sk_buffs
as they arrive.

Part of this process hooks the sk_data_ready op with a BPF specific
handler to wake up the BPF verdict program when data is ready to read.
The logic is simple enough (posted here for easy reading)

 static void sk_psock_verdict_data_ready(struct sock *sk)
 {
	struct socket *sock = sk->sk_socket;

	if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
		return;
	sock->ops->read_skb(sk, sk_psock_verdict_recv);
 }

The oversight here is sk->sk_socket is not assigned until the application
accepts() the new socket. However, its entirely ok for the peer application
to do a connect() followed immediately by sends. The socket on the receiver
is sitting on the backlog queue of the listening socket until its accepted
and the data is queued up. If the peer never accepts the socket or is slow
it will eventually hit data limits and rate limit the session. But,
important for BPF sockmap hooks when this data is received TCP stack does
the sk_data_ready() call but the read_skb() for this data is never called
because sk_socket is missing. The data sits on the sk_receive_queue.

Then once the socket is accepted if we never receive more data from the
peer there will be no further sk_data_ready calls and all the data
is still on the sk_receive_queue(). Then user calls recvmsg after accept()
and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
The handler checks for data in the sk_msg ingress queue expecting that
the BPF program has already run from the sk_data_ready hook and enqueued
the data as needed. So we are stuck.

To fix do an unlikely check in recvmsg handler for data on the
sk_receive_queue and if it exists wake up data_ready. We have the sock
locked in both read_skb and recvmsg so should avoid having multiple
runners.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com
2023-05-23 16:10:28 +02:00
John Fastabend 901546fd8f bpf, sockmap: Handle fin correctly
The sockmap code is returning EAGAIN after a FIN packet is received and no
more data is on the receive queue. Correct behavior is to return 0 to the
user and the user can then close the socket. The EAGAIN causes many apps
to retry which masks the problem. Eventually the socket is evicted from
the sockmap because its released from sockmap sock free handling. The
issue creates a delay and can cause some errors on application side.

To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
is set then set return to zero. A selftest will be added to check this
condition.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com
2023-05-23 16:10:18 +02:00
John Fastabend 405df89dd5 bpf, sockmap: Improved check for empty queue
We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.

But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.

Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.

To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.

To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
2023-05-23 16:10:11 +02:00
John Fastabend bce22552f9 bpf, sockmap: Reschedule is now done through backlog
Now that the backlog manages the reschedule() logic correctly we can drop
the partial fix to reschedule from recvmsg hook.

Rescheduling on recvmsg hook was added to address a corner case where we
still had data in the backlog state but had nothing to kick it and
reschedule the backlog worker to run and finish copying data out of the
state. This had a couple limitations, first it required user space to
kick it introducing an unnecessary EBUSY and retry. Second it only
handled the ingress case and egress redirects would still be hung.

With the correct fix, pushing the reschedule logic down to where the
enomem error occurs we can drop this fix.

Fixes: bec217197b ("skmsg: Schedule psock work if the cached skb exists on the psock")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
2023-05-23 16:10:04 +02:00
John Fastabend 29173d07f7 bpf, sockmap: Convert schedule_work into delayed_work
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.

The flow for Cilium's common use case with a stream parser is,

 tcp_read_sock()
  sk_psock_verdict_recv
    ret = bpf_prog_run_pin_on_cpu()
    sk_psock_verdict_apply(sock, skb, ret)
     // if system is under memory pressure or app is slow we may
     // need to queue skb. Do this queuing through ingress_skb and
     // then kick timer to wake up handler
     skb_queue_tail(ingress_skb, skb)
     schedule_work(work);

The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.

Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.

To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.

To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.

>From on list discussion. This commit

 bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")

was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
2023-05-23 16:09:56 +02:00
John Fastabend 78fa0d61d9 bpf, sockmap: Pass skb ownership through read_skb
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.

This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,

 skb_linearize()
   __pskb_pull_tail()
     pskb_expand_head()
       BUG_ON(skb_shared(skb))

Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.

To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.

Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.

[  106.536188] ------------[ cut here ]------------
[  106.536197] kernel BUG at net/core/skbuff.c:1693!
[  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  106.542255] Call Trace:
[  106.542383]  <IRQ>
[  106.542487]  __pskb_pull_tail+0x4b/0x3e0
[  106.542681]  skb_ensure_writable+0x85/0xa0
[  106.542882]  sk_skb_pull_data+0x18/0x20
[  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[  106.543536]  ? migrate_disable+0x66/0x80
[  106.543871]  sk_psock_verdict_recv+0xe2/0x310
[  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
[  106.544561]  tcp_read_skb+0x7b/0x120
[  106.544740]  tcp_data_queue+0x904/0xee0
[  106.544931]  tcp_rcv_established+0x212/0x7c0
[  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
[  106.545326]  tcp_v4_rcv+0xe70/0xf60
[  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
[  106.545744]  ip_local_deliver_finish+0xa7/0x150

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Reported-by: William Findlay <will@isovalent.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
2023-05-23 16:09:47 +02:00
Nicolas Dichtel 3632679d9e ipv{4,6}/raw: fix output xfrm lookup wrt protocol
With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
protocol field of the flow structure, build by raw_sendmsg() /
rawv6_sendmsg()),  is set to IPPROTO_RAW. This breaks the ipsec policy
lookup when some policies are defined with a protocol in the selector.

For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
specify the protocol. Just accept all values for IPPROTO_RAW socket.

For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
without breaking backward compatibility (the value of this field was never
checked). Let's add a new kind of control message, so that the userland
could specify which protocol is used.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
CC: stable@vger.kernel.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-23 15:38:59 +02:00
Chuck Lever 18c40a1cc1 net/handshake: Fix sock->file allocation
sock->file = sock_alloc_file(sock, O_NONBLOCK, NULL);
	^^^^                         ^^^^

sock_alloc_file() calls release_sock() on error but the left hand
side of the assignment dereferences "sock".  This isn't the bug and
I didn't report this earlier because there is an assert that it
doesn't fail.

net/handshake/handshake-test.c:221 handshake_req_submit_test4() error: dereferencing freed memory 'sock'
net/handshake/handshake-test.c:233 handshake_req_submit_test4() warn: 'req' was already freed.
net/handshake/handshake-test.c:254 handshake_req_submit_test5() error: dereferencing freed memory 'sock'
net/handshake/handshake-test.c:290 handshake_req_submit_test6() error: dereferencing freed memory 'sock'
net/handshake/handshake-test.c:321 handshake_req_cancel_test1() error: dereferencing freed memory 'sock'
net/handshake/handshake-test.c:355 handshake_req_cancel_test2() error: dereferencing freed memory 'sock'
net/handshake/handshake-test.c:367 handshake_req_cancel_test2() warn: 'req' was already freed.
net/handshake/handshake-test.c:395 handshake_req_cancel_test3() error: dereferencing freed memory 'sock'
net/handshake/handshake-test.c:407 handshake_req_cancel_test3() warn: 'req' was already freed.
net/handshake/handshake-test.c:451 handshake_req_destroy_test1() error: dereferencing freed memory 'sock'
net/handshake/handshake-test.c:463 handshake_req_destroy_test1() warn: 'req' was already freed.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 88232ec1ec ("net/handshake: Add Kunit tests for the handshake consumer API")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/168451609436.45209.15407022385441542980.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-22 19:25:24 -07:00
Chuck Lever b21c7ba6d9 net/handshake: Squelch allocation warning during Kunit test
The "handshake_req_alloc excessive privsize" kunit test is intended
to check what happens when the maximum privsize is exceeded. The
WARN_ON_ONCE_GFP at mm/page_alloc.c:4744 can be disabled safely for
this test.

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Fixes: 88232ec1ec ("net/handshake: Add Kunit tests for the handshake consumer API")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/168451636052.47152.9600443326570457947.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-22 19:24:52 -07:00
Linus Torvalds 421ca22e31 NFS Client Bugfixes for Linux 6.4-rc
Stable Fix:
   * Don't change task->tk_status after the call to rpc_exit_task
 
 Other Bugfixes:
   * Convert kmap_atomic() to kmap_local_folio()
   * Fix a potential double free with READ_PLUS
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRrttUACgkQ18tUv7Cl
 QOuhaA//QFHklXZk/vCkQnNQMYWL11GJliWawLoDfcZal6uQ/a2QCQV1Cbmav62B
 FR2BmXDxzM2PRdLu2VHGpkn0CQW3M1tvgaNjGD1xdOxpyIkn47T5lfAd/4X2XPiU
 M1ck2Usc258UB1yoKV+jbUD3ptn2BvC+VMWJInaA578hv8TA6Ouh7lP7rPJfDHoJ
 OfoLxx9/VqGqMWzfExAHnGw328oieXNnOwynETAdapVwjQeiEcYAED82pJmVsD7+
 m++6dRVQRA2bMIMRFWmW8HsO08sR32wzy76XgKws4Xu59Fiy+TQ8PoeUjCtTNq6/
 9ibPwH4R7VbcxXa2eT23EbtO2nSkZw/dFiL0s5VNYqeVrBwwlzyklU1uSvIEPegk
 zHamqxMMlVLkoMwJa83wIKB8/viPKwV5zcF9UjmrJy67+wXZet6M0c7S9HyiTj9U
 NzVbqyK3KhMtsD4ps/EGVWsgGKAIeWbE8wPlP7GF7PHwEw+hWa9pHir6L6BizNqG
 DJ/2zfZxDvOGy2r5OvSqGn07/zsj+0URixzEq0IOn1Li/osFZpvK3EVFncd/qsvW
 NwPRoF+70skFRdXhbdWa/HEUZlyN2uiIU24luraMrN0U4b4X7aw+EMnMekBi+Vec
 bEtWEUJ/vK3mlsOde4gVW0PZBhe8JE6PHlqkQBn5zobV3/cXXCw=
 =6xFZ
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Stable Fix:

   - Don't change task->tk_status after the call to rpc_exit_task

  Other Bugfixes:

   - Convert kmap_atomic() to kmap_local_folio()

   - Fix a potential double free with READ_PLUS"

* tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.2: Fix a potential double free with READ_PLUS
  SUNRPC: Don't change task->tk_status after the call to rpc_exit_task
  NFS: Convert kmap_atomic() to kmap_local_folio()
2023-05-22 12:01:13 -07:00
Pavel Begunkov fe79bd65c8 net/tcp: refactor tcp_inet6_sk()
Don't keep hand coded offset caluclations and replace it with
container_of(). It should be type safer and a bit less confusing.

It also makes it with a macro instead of inline function to preserve
constness, which was previously casted out like in case of
tcp_v6_send_synack().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-22 11:22:58 +01:00
Arnd Bergmann dbb99d7852 net: ipconfig: move ic_nameservers_fallback into #ifdef block
The new variable is only used when IPCONFIG_BOOTP is defined and otherwise
causes a warning:

net/ipv4/ipconfig.c:177:12: error: 'ic_nameservers_fallback' defined but not used [-Werror=unused-variable]

Move it next to the user.

Fixes: 81ac2722fa ("net: ipconfig: Allow DNS to be overwritten by DHCPACK")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-22 11:17:55 +01:00
Xin Long 6ca328e985 sctp: fix an issue that plpmtu can never go to complete state
When doing plpmtu probe, the probe size is growing every time when it
receives the ACK during the Search state until the probe fails. When
the failure occurs, pl.probe_high is set and it goes to the Complete
state.

However, if the link pmtu is huge, like 65535 in loopback_dev, the probe
eventually keeps using SCTP_MAX_PLPMTU as the probe size and never fails.
Because of that, pl.probe_high can not be set, and the plpmtu probe can
never go to the Complete state.

Fix it by setting pl.probe_high to SCTP_MAX_PLPMTU when the probe size
grows to SCTP_MAX_PLPMTU in sctp_transport_pl_recv(). Also, not allow
the probe size greater than SCTP_MAX_PLPMTU in the Complete state.

Fixes: b87641aff9 ("sctp: do state transition when a probe succeeds on HB ACK recv path")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-22 11:05:20 +01:00
Aditi Ghag 4ddbcb8862 bpf: Add bpf_sock_destroy kfunc
The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends.  The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated.  The kfunc also allows terminating sockets that may
or may not be actively sending traffic.

The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from  BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.

The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.

Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info.  The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 22:44:28 -07:00
Aditi Ghag c96dac8d36 bpf: udp: Implement batching for sockets iterator
Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.

Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators.  With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.

The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-6-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 17:45:47 -07:00
Aditi Ghag e4fe1bf13e udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
This is a preparatory commit to remove the field. The field was
previously shared between proc fs and BPF UDP socket iterators. As the
follow-up commits will decouple the implementation for the iterators,
remove the field. As for BPF socket iterator, filtering of sockets is
exepected to be done in BPF programs.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-5-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 17:45:47 -07:00
Aditi Ghag 7625d2e974 bpf: udp: Encapsulate logic to get udp table
This is a preparatory commit that encapsulates the logic
to get udp table in iterator inside udp_get_table_afinfo, and
renames the function to `udp_get_table_seq` accordingly.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-4-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 17:45:46 -07:00
Aditi Ghag f44b1c5158 udp: seq_file: Helper function to match socket attributes
This is a preparatory commit to refactor code that matches socket
attributes in iterators to a helper function, and use it in the
proc fs iterator.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-3-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 17:45:46 -07:00
Aditi Ghag 9378096e8a bpf: tcp: Avoid taking fast sock lock in iterator
This is a preparatory commit to replace `lock_sock_fast` with
`lock_sock`,and facilitate BPF programs executed from the TCP sockets
iterator to be able to destroy TCP sockets using the bpf_sock_destroy
kfunc (implemented in follow-up commits).

Previously, BPF TCP iterator was acquiring the sock lock with BH
disabled. This led to scenarios where the sockets hash table bucket lock
can be acquired with BH enabled in some path versus disabled in other.
In such situation, kernel issued a warning since it thinks that in the
BH enabled path the same bucket lock *might* be acquired again in the
softirq context (BH disabled), which will lead to a potential dead lock.
Since bpf_sock_destroy also happens in a process context, the potential
deadlock warning is likely a false alarm.

Here is a snippet of annotated stack trace that motivated this change:

```

Possible interrupt unsafe locking scenario:

      CPU0                    CPU1
      ----                    ----
 lock(&h->lhash2[i].lock);
                              local_bh_disable();
                              lock(&h->lhash2[i].lock);
kernel imagined possible scenario:
  local_bh_disable();  /* Possible softirq */
  lock(&h->lhash2[i].lock);
*** Potential Deadlock ***

process context:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH enabled
__inet_hash+0x4b/0x210
inet_csk_listen_start+0xe6/0x100
inet_listen+0x95/0x1d0
__sys_listen+0x69/0xb0
__x64_sys_listen+0x14/0x20
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

bpf_sock_destroy run from iterator:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH disabled
inet_unhash+0x9a/0x110
tcp_set_state+0x6a/0x210
tcp_abort+0x10d/0x200
bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
bpf_iter_run_prog+0x1ff/0x340
------> lock_sock_fast that acquires sock lock with BH disabled
bpf_iter_tcp_seq_show+0xca/0x190
bpf_seq_read+0x177/0x450

```

Also, Yonghong reported a deadlock for non-listening TCP sockets that
this change resolves. Previously, `lock_sock_fast` held the sock spin
lock with BH which was again being acquired in `tcp_abort`:

```
watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331]
RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500
Call Trace:
 <TASK>
 _raw_spin_lock+0x84/0x90
 tcp_abort+0x13c/0x1f0
 bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89
 bpf_iter_run_prog+0x1aa/0x2c0
 ? preempt_count_sub+0x1c/0xd0
 ? from_kuid_munged+0x1c8/0x210
 bpf_iter_tcp_seq_show+0x14e/0x1b0
 bpf_seq_read+0x36c/0x6a0

bpf_iter_tcp_seq_show
   lock_sock_fast
     __lock_sock_fast
       spin_lock_bh(&sk->sk_lock.slock);
	/* * Fast path return with bottom halves disabled and * sock::sk_lock.slock held.* */

 ...
 tcp_abort
   local_bh_disable();
   spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk)

```

With the switch to `lock_sock`, it calls `spin_unlock_bh` before returning:

```
lock_sock
    lock_sock_nested
       spin_lock_bh(&sk->sk_lock.slock);
       :
       spin_unlock_bh(&sk->sk_lock.slock);
```

Acked-by: Yonghong Song <yhs@meta.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-2-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 17:45:46 -07:00
Ruihan Li a2904d2825 Bluetooth: Unlink CISes when LE disconnects in hci_conn_del
Currently, hci_conn_del calls hci_conn_unlink for BR/EDR, (e)SCO, and
CIS connections, i.e., everything except LE connections. However, if
(e)SCO connections are unlinked when BR/EDR disconnects, CIS connections
should also be unlinked when LE disconnects.

In terms of disconnection behavior, CIS and (e)SCO connections are not
too different. One peculiarity of CIS is that when CIS connections are
disconnected, the CIS handle isn't deleted, as per [BLUETOOTH CORE
SPECIFICATION Version 5.4 | Vol 4, Part E] 7.1.6 Disconnect command:

        All SCO, eSCO, and CIS connections on a physical link should be
        disconnected before the ACL connection on the same physical
        connection is disconnected. If it does not, they will be
        implicitly disconnected as part of the ACL disconnection.
        ...
        Note: As specified in Section 7.7.5, on the Central, the handle
        for a CIS remains valid even after disconnection and, therefore,
        the Host can recreate a disconnected CIS at a later point in
        time using the same connection handle.

Since hci_conn_link invokes both hci_conn_get and hci_conn_hold,
hci_conn_unlink should perform both hci_conn_put and hci_conn_drop as
well. However, currently it performs only hci_conn_put.

This patch makes hci_conn_unlink call hci_conn_drop as well, which
simplifies the logic in hci_conn_del a bit and may benefit future users
of hci_conn_unlink. But it is noted that this change additionally
implies that hci_conn_unlink can queue disc_work on conn itself, with
the following call stack:

        hci_conn_unlink(conn)  [conn->parent == NULL]
                -> hci_conn_unlink(child)  [child->parent == conn]
                        -> hci_conn_drop(child->parent)
                                -> queue_delayed_work(&conn->disc_work)

Queued disc_work after hci_conn_del can be spurious, so during the
process of hci_conn_del, it is necessary to make the call to
cancel_delayed_work(&conn->disc_work) after invoking hci_conn_unlink.

Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Co-developed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-05-19 15:37:45 -07:00
Ruihan Li a2ac591cb4 Bluetooth: Fix UAF in hci_conn_hash_flush again
Commit 06149746e7 ("Bluetooth: hci_conn: Add support for linking
multiple hcon") reintroduced a previously fixed bug [1] ("KASAN:
slab-use-after-free Read in hci_conn_hash_flush"). This bug was
originally fixed by commit 5dc7d23e16 ("Bluetooth: hci_conn: Fix
possible UAF").

The hci_conn_unlink function was added to avoid invalidating the link
traversal caused by successive hci_conn_del operations releasing extra
connections. However, currently hci_conn_unlink itself also releases
extra connections, resulted in the reintroduced bug.

This patch follows a more robust solution for cleaning up all
connections, by repeatedly removing the first connection until there are
none left. This approach does not rely on the inner workings of
hci_conn_del and ensures proper cleanup of all connections.

Meanwhile, we need to make sure that hci_conn_del never fails. Indeed it
doesn't, as it now always returns zero. To make this a bit clearer, this
patch also changes its return type to void.

Reported-by: syzbot+8bb72f86fc823817bc5d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-bluetooth/000000000000aa920505f60d25ad@google.com/
Fixes: 06149746e7 ("Bluetooth: hci_conn: Add support for linking multiple hcon")
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Co-developed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-05-19 15:37:45 -07:00
Ruihan Li 2910431ab0 Bluetooth: Refcnt drop must be placed last in hci_conn_unlink
If hci_conn_put(conn->parent) reduces conn->parent's reference count to
zero, it can immediately deallocate conn->parent. At the same time,
conn->link->list has its head in conn->parent, causing use-after-free
problems in the latter list_del_rcu(&conn->link->list).

This problem can be easily solved by reordering the two operations,
i.e., first performing the list removal with list_del_rcu and then
decreasing the refcnt with hci_conn_put.

Reported-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Closes: https://lore.kernel.org/linux-bluetooth/CABBYNZ+1kce8_RJrLNOXd_8=Mdpb=2bx4Nto-hFORk=qiOkoCg@mail.gmail.com/
Fixes: 06149746e7 ("Bluetooth: hci_conn: Add support for linking multiple hcon")
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-05-19 15:37:45 -07:00
Ruihan Li ca1fd42e7d Bluetooth: Fix potential double free caused by hci_conn_unlink
The hci_conn_unlink function is being called by hci_conn_del, which
means it should not call hci_conn_del with the input parameter conn
again. If it does, conn may have already been released when
hci_conn_unlink returns, leading to potential UAF and double-free
issues.

This patch resolves the problem by modifying hci_conn_unlink to release
only conn's child links when necessary, but never release conn itself.

Reported-by: syzbot+690b90b14f14f43f4688@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-bluetooth/000000000000484a8205faafe216@google.com/
Fixes: 06149746e7 ("Bluetooth: hci_conn: Add support for linking multiple hcon")
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reported-by: syzbot+690b90b14f14f43f4688@syzkaller.appspotmail.com
Reported-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Reported-by: syzbot+8bb72f86fc823817bc5d@syzkaller.appspotmail.com
2023-05-19 15:37:45 -07:00
Trond Myklebust d180891fba SUNRPC: Don't change task->tk_status after the call to rpc_exit_task
Some calls to rpc_exit_task() may deliberately change the value of
task->tk_status, for instance because it gets checked by the RPC call's
rpc_release() callback. That makes it wrong to reset the value to
task->tk_rpc_status.
In particular this causes a bug where the rpc_call_done() callback tries
to fail over a set of pNFS/flexfiles writes to a different IP address,
but the reset of task->tk_status causes nfs_commit_release_pages() to
immediately mark the file as having a fatal error.

Fixes: 39494194f9 ("SUNRPC: Fix races with rpc_killall_tasks()")
Cc: stable@vger.kernel.org # 6.1.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19 16:50:05 -04:00
Wen Gu 3511227167 net/smc: Reset connection when trying to use SMCRv2 fails.
We found a crash when using SMCRv2 with 2 Mellanox ConnectX-4. It
can be reproduced by:

- smc_run nginx
- smc_run wrk -t 32 -c 500 -d 30 http://<ip>:<port>

 BUG: kernel NULL pointer dereference, address: 0000000000000014
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 8000000108713067 P4D 8000000108713067 PUD 151127067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 4 PID: 2441 Comm: kworker/4:249 Kdump: loaded Tainted: G        W   E      6.4.0-rc1+ #42
 Workqueue: smc_hs_wq smc_listen_work [smc]
 RIP: 0010:smc_clc_send_confirm_accept+0x284/0x580 [smc]
 RSP: 0018:ffffb8294b2d7c78 EFLAGS: 00010a06
 RAX: ffff8f1873238880 RBX: ffffb8294b2d7dc8 RCX: 0000000000000000
 RDX: 00000000000000b4 RSI: 0000000000000001 RDI: 0000000000b40c00
 RBP: ffffb8294b2d7db8 R08: ffff8f1815c5860c R09: 0000000000000000
 R10: 0000000000000400 R11: 0000000000000000 R12: ffff8f1846f56180
 R13: ffff8f1815c5860c R14: 0000000000000001 R15: 0000000000000001
 FS:  0000000000000000(0000) GS:ffff8f1aefd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000014 CR3: 00000001027a0001 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? mlx5_ib_map_mr_sg+0xa1/0xd0 [mlx5_ib]
  ? smcr_buf_map_link+0x24b/0x290 [smc]
  ? __smc_buf_create+0x4ee/0x9b0 [smc]
  smc_clc_send_accept+0x4c/0xb0 [smc]
  smc_listen_work+0x346/0x650 [smc]
  ? __schedule+0x279/0x820
  process_one_work+0x1e5/0x3f0
  worker_thread+0x4d/0x2f0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xe5/0x120
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x2c/0x50
  </TASK>

During the CLC handshake, server sequentially tries available SMCRv2
and SMCRv1 devices in smc_listen_work().

If an SMCRv2 device is found. SMCv2 based link group and link will be
assigned to the connection. Then assumed that some buffer assignment
errors happen later in the CLC handshake, such as RMB registration
failure, server will give up SMCRv2 and try SMCRv1 device instead. But
the resources assigned to the connection won't be reset.

When server tries SMCRv1 device, the connection creation process will
be executed again. Since conn->lnk has been assigned when trying SMCRv2,
it will not be set to the correct SMCRv1 link in
smcr_lgr_conn_assign_link(). So in such situation, conn->lgr points to
correct SMCRv1 link group but conn->lnk points to the SMCRv2 link
mistakenly.

Then in smc_clc_send_confirm_accept(), conn->rmb_desc->mr[link->link_idx]
will be accessed. Since the link->link_idx is not correct, the related
MR may not have been initialized, so crash happens.

 | Try SMCRv2 device first
 |     |-> conn->lgr:	assign existed SMCRv2 link group;
 |     |-> conn->link:	assign existed SMCRv2 link (link_idx may be 1 in SMC_LGR_SYMMETRIC);
 |     |-> sndbuf & RMB creation fails, quit;
 |
 | Try SMCRv1 device then
 |     |-> conn->lgr:	create SMCRv1 link group and assign;
 |     |-> conn->link:	keep SMCRv2 link mistakenly;
 |     |-> sndbuf & RMB creation succeed, only RMB->mr[link_idx = 0]
 |         initialized.
 |
 | Then smc_clc_send_confirm_accept() accesses
 | conn->rmb_desc->mr[conn->link->link_idx, which is 1], then crash.
 v

This patch tries to fix this by cleaning conn->lnk before assigning
link. In addition, it is better to reset the connection and clean the
resources assigned if trying SMCRv2 failed in buffer creation or
registration.

Fixes: e49300a6bf ("net/smc: add listen processing for SMC-Rv2")
Link: https://lore.kernel.org/r/20220523055056.2078994-1-liuyacan@corp.netease.com/
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:54:04 +01:00
Jakub Kicinski 74836ec828 tls: rx: strp: don't use GFP_KERNEL in softirq context
When receive buffer is small, or the TCP rx queue looks too
complicated to bother using it directly - we allocate a new
skb and copy data into it.

We already use sk->sk_allocation... but nothing actually
sets it to GFP_ATOMIC on the ->sk_data_ready() path.

Users of HW offload are far more likely to experience problems
due to scheduling while atomic. "Copy mode" is very rarely
triggered with SW crypto.

Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Tested-by: Shai Amiram <samiram@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:37:37 +01:00
Jakub Kicinski eca9bfafee tls: rx: strp: preserve decryption status of skbs when needed
When receive buffer is small we try to copy out the data from
TCP into a skb maintained by TLS to prevent connection from
stalling. Unfortunately if a single record is made up of a mix
of decrypted and non-decrypted skbs combining them into a single
skb leads to loss of decryption status, resulting in decryption
errors or data corruption.

Similarly when trying to use TCP receive queue directly we need
to make sure that all the skbs within the record have the same
status. If we don't the mixed status will be detected correctly
but we'll CoW the anchor, again collapsing it into a single paged
skb without decrypted status preserved. So the "fixup" code will
not know which parts of skb to re-encrypt.

Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Tested-by: Shai Amiram <samiram@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:37:37 +01:00
Jakub Kicinski c1c607b1e5 tls: rx: strp: factor out copying skb data
We'll need to copy input skbs individually in the next patch.
Factor that code out (without assuming we're copying a full record).

Tested-by: Shai Amiram <samiram@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:37:37 +01:00
Jakub Kicinski 8b0c0dc9fb tls: rx: strp: fix determining record length in copy mode
We call tls_rx_msg_size(skb) before doing skb->len += chunk.
So the tls_rx_msg_size() code will see old skb->len, most
likely leading to an over-read.

Worst case we will over read an entire record, next iteration
will try to trim the skb but may end up turning frag len negative
or discarding the subsequent record (since we already told TCP
we've read it during previous read but now we'll trim it out of
the skb).

Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Tested-by: Shai Amiram <samiram@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:37:37 +01:00
Jakub Kicinski 14c4be92eb tls: rx: strp: force mixed decrypted records into copy mode
If a record is partially decrypted we'll have to CoW it, anyway,
so go into copy mode and allocate a writable skb right away.

This will make subsequent fix simpler because we won't have to
teach tls_strp_msg_make_copy() how to copy skbs while preserving
decrypt status.

Tested-by: Shai Amiram <samiram@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:37:37 +01:00
Jakub Kicinski 210620ae44 tls: rx: strp: set the skb->len of detached / CoW'ed skbs
alloc_skb_with_frags() fills in page frag sizes but does not
set skb->len and skb->data_len. Set those correctly otherwise
device offload will most likely generate an empty skb and
hit the BUG() at the end of __skb_nsg().

Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Tested-by: Shai Amiram <samiram@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:37:37 +01:00
Jakub Kicinski b3a03b540e tls: rx: device: fix checking decryption status
skb->len covers the entire skb, including the frag_list.
In fact we're guaranteed that rxm->full_len <= skb->len,
so since the change under Fixes we were not checking decrypt
status of any skb but the first.

Note that the skb_pagelen() added here may feel a bit costly,
but it's removed by subsequent fixes, anyway.

Reported-by: Tariq Toukan <tariqt@nvidia.com>
Fixes: 86b259f6f8 ("tls: rx: device: bound the frag walk")
Tested-by: Shai Amiram <samiram@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19 08:37:37 +01:00
Paolo Abeni 45b1a1227a mptcp: introduces more address related mibs
Currently we don't track explicitly a few events related to address
management suboption handling; this patch adds new mibs for ADD_ADDR
and RM_ADDR options tx and for missed tx events due to internal storage
exhaustion.

The self-tests must be updated to properly handle different mibs with
the same/shared prefix.

Additionally removes a couple of warning tracking the loss event.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/378
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-18 20:06:32 -07:00
Paolo Abeni e76c8ef5cc mptcp: refactor mptcp_stream_accept()
Rewrite the mptcp socket accept op, leveraging the new
__inet_accept() helper.

This way we can avoid acquiring the new socket lock twice
and we can avoid a couple of indirect calls.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-18 20:06:32 -07:00
Paolo Abeni 711bdd5141 inet: factor out locked section of inet_accept() in a new helper
No functional changes intended. The new helper will be used
by the MPTCP protocol in the next patch to avoid duplicating
a few LoC.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-18 20:06:31 -07:00
Jakub Kicinski 90223c1136 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/freescale/fec_main.c
  6ead9c98ca ("net: fec: remove the xdp_return_frame when lack of tx BDs")
  144470c88c ("net: fec: using the standard return codes when xdp xmit errors")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-18 14:39:34 -07:00
Linus Torvalds 1f594fe7c9 Networking fixes for 6.4-rc3, including fixes from can, xfrm,
bluetooth and netfilter.
 
 Current release - regressions:
 
   - ipv6: fix RCU splat in ipv6_route_seq_show()
 
   - wifi: iwlwifi: disable RFI feature
 
 Previous releases - regressions:
 
   - tcp: fix possible sk_priority leak in tcp_v4_send_reset()
 
   - tipc: do not update mtu if msg_max is too small in mtu negotiation
 
   - netfilter: fix null deref on element insertion
 
   - devlink: change per-devlink netdev notifier to static one
 
   - phylink: fix ksettings_set() ethtool call
 
   - wifi: mac80211: fortify the spinlock against deadlock by interrupt
 
   - wifi: brcmfmac: check for probe() id argument being NULL
 
   - eth: ice:
     - fix undersized tx_flags variable
     - fix ice VF reset during iavf initialization
 
   - eth: hns3: fix sending pfc frames after reset issue
 
 Previous releases - always broken:
 
   - xfrm: release all offloaded policy memory
 
   - nsh: use correct mac_offset to unwind gso skb in nsh_gso_segment()
 
   - vsock: avoid to close connected socket after the timeout
 
   - dsa: rzn1-a5psw: enable management frames for CPU port
 
   - eth: virtio_net: fix error unwinding of XDP initialization
 
   - eth: tun: fix memory leak for detached NAPI queue.
 
 Misc:
 
   - MAINTAINERS: sctp: move Neil to CREDITS
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRmEHoSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkXu0P/AnSeVtu2CYZSjyCQYvkKpdpbDOSlsee
 GOnG0jduOdJ+OabYyM6prg9JHp1t3QxxTVJs+Spc7Eh9EX+YHRwK5DNPhv3GQ7RI
 pSwQxwiEhLVXVjaEtqUo1Wf8JgCQJ+ThisSIDgQRnaHKQnrRIlbZRngwn6TwNFba
 kxBpCUn2RZcZZhL8xdVF4UbSbEeLEupN76rHiePBVKNG70QVRptX3Rd6EV6FxmTa
 EzAtfNh1r0p3BHW1YYgsbpy7PeXhKGhlVLqIld8h9r/y4hATsrQ2f7Bv0RNrVBDf
 f8r1bdG5E0K3V8AzFPpyOe304G3GAwV+V/wtvA3GRjiwmPUwzJOzaNnSgJZZ/sbq
 mnR4pCwJ4gDGnDxLa8hBh6+emQGB6LJJX0JTQY7vMPNmUwtIuEQc6tLxJ4DpXTzW
 psEndQPDA7yR/7pNoE7ax+8CKCxPvfiBRnV9sxzmPV6FcxWtzJeLQihOuOA4IB8i
 Ddhq2OYH+HCodTNOLWNyMSjk65O7Whee1O/YGiVW9+iUbKBUSBFatJ9fJjH4bXMT
 VRZZnhlFGGIMuZXhkL5+a4ZnomqfPRXNGJ/QBbB4Ty7CXr84mXb0SX5gW4qsLOBA
 YEuxiqD8Oej2Mrid4ypF5GmwBYLAf4CCZajGVii0yyz2hp69RvlQ0c5lA0WisQu3
 wvY2HDFGBsa+
 =PfWP
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from can, xfrm, bluetooth and netfilter.

  Current release - regressions:

   - ipv6: fix RCU splat in ipv6_route_seq_show()

   - wifi: iwlwifi: disable RFI feature

  Previous releases - regressions:

   - tcp: fix possible sk_priority leak in tcp_v4_send_reset()

   - tipc: do not update mtu if msg_max is too small in mtu negotiation

   - netfilter: fix null deref on element insertion

   - devlink: change per-devlink netdev notifier to static one

   - phylink: fix ksettings_set() ethtool call

   - wifi: mac80211: fortify the spinlock against deadlock by interrupt

   - wifi: brcmfmac: check for probe() id argument being NULL

   - eth: ice:
      - fix undersized tx_flags variable
      - fix ice VF reset during iavf initialization

   - eth: hns3: fix sending pfc frames after reset issue

  Previous releases - always broken:

   - xfrm: release all offloaded policy memory

   - nsh: use correct mac_offset to unwind gso skb in nsh_gso_segment()

   - vsock: avoid to close connected socket after the timeout

   - dsa: rzn1-a5psw: enable management frames for CPU port

   - eth: virtio_net: fix error unwinding of XDP initialization

   - eth: tun: fix memory leak for detached NAPI queue.

  Misc:

   - MAINTAINERS: sctp: move Neil to CREDITS"

* tag 'net-6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
  MAINTAINERS: skip CCing netdev for Bluetooth patches
  mdio_bus: unhide mdio_bus_init prototype
  bridge: always declare tunnel functions
  atm: hide unused procfs functions
  net: isa: include net/Space.h
  Revert "ARM: dts: stm32: add CAN support on stm32f746"
  netfilter: nft_set_rbtree: fix null deref on element insertion
  netfilter: nf_tables: fix nft_trans type confusion
  netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT
  net: wwan: t7xx: Ensure init is completed before system sleep
  net: selftests: Fix optstring
  net: pcs: xpcs: fix C73 AN not getting enabled
  net: wwan: iosm: fix NULL pointer dereference when removing device
  vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit()
  mailmap: add entries for Nikolay Aleksandrov
  igb: fix bit_shift to be in [1..8] range
  net: dsa: mv88e6xxx: Fix mv88e6393x EPC write command offset
  cassini: Fix a memory leak in the error handling path of cas_init_one()
  tun: Fix memory leak for detached NAPI queue.
  can: kvaser_pciefd: Disable interrupts in probe error path
  ...
2023-05-18 08:52:14 -07:00
Pablo Neira Ayuso e05b536216 netfilter: flowtable: split IPv6 datapath in helper functions
Add context structure and helper functions to look up for a matching
IPv6 entry in the flowtable and to forward packets.

No functional changes are intended.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:55 +02:00
Pablo Neira Ayuso a10fa0b489 netfilter: flowtable: split IPv4 datapath in helper functions
Add context structure and helper functions to look up for a matching
IPv4 entry in the flowtable and to forward packets.

No functional changes are intended.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:55 +02:00
Pablo Neira Ayuso fa502c8656 netfilter: flowtable: simplify route logic
Grab reference to dst from skbuff earlier to simplify route caching.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:55 +02:00
Faicker Mo d671fd82ea netfilter: conntrack: allow insertion clash of gre protocol
NVGRE tunnel is used in the VM-to-VM communications. The VM packets
are encapsulated in NVGRE and sent from the host. For NVGRE
there are two tuples(outer sip and outer dip) in the host conntrack item.
Insertion clashes are more likely to happen if the concurrent connections
are sent from the VM.

Signed-off-by: Faicker Mo <faicker.mo@ucloud.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:55 +02:00
Christophe JAILLET a2a0ffb084 netfilter: nft_set_pipapo: Use struct_size()
Use struct_size() instead of hand writing it.
This is less verbose and more informative.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:55 +02:00
Jeremy Sowden b9f9a485fb netfilter: nft_exthdr: add boolean DCCP option matching
The xt_dccp iptables module supports the matching of DCCP packets based
on the presence or absence of DCCP options.  Extend nft_exthdr to add
this functionality to nftables.

Link: https://bugzilla.netfilter.org/show_bug.cgi?id=930
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:54 +02:00
Florian Westphal d4b7f29eb8 netfilter: nf_tables: always increment set element count
At this time, set->nelems counter only increments when the set has
a maximum size.

All set elements decrement the counter unconditionally, this is
confusing.

Increment the counter unconditionally to make this symmetrical.
This would also allow changing the set maximum size after set creation
in a later patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:54 +02:00
Florian Westphal a4878eeae3 netfilter: nf_tables: relax set/map validation checks
Its currently not allowed to perform queries on a map, for example:

table t {
	map m {
		typeof ip saddr : meta mark
		..

	chain c {
		ip saddr @m counter

will fail, because kernel requires that userspace provides a destination
register when the referenced set is a map.

However, internally there is no real distinction between sets and maps,
maps are just sets where each key is associated with a value.

Relax this so that maps can be used just like sets.

This allows to have rules that query if a given key exists
without making use of the associated value.

This also permits != checks which don't work for map lookups.

When no destination reg is given for a map, then permit this for named
maps.

Data and dump paths need to be updated to consider priv->dreg_set
instead of the 'set-is-a-map' check.

Checks in reduce and validate callbacks are not changed, this
can be relaxed later if a need arises.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-18 08:48:54 +02:00
Jakub Kicinski 30a0f49d54 netfilter pull request
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmRkx4QNHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2ADZEEADalO7AV8lFYHn9AXAq3hUsLf/praxJ6uD4wRtT
 y21bb7wyQErNy4LTOIQy4Xm1RUsoxflZIuO/EL9HMX+AHMX+Iw4YgfxmN+sKPO1t
 U19ZHc4XlSnbCBrxySF9dS1CiW5HuEOa2Sh+v+LCA9M95+IJPhXtnOXTI8J6gRyP
 vO7+0qhbCVWk044nSPqtXOSx8apQylw1y9TSaSq8Qe8wWQRsWcLbykuQNyuslYrJ
 i5jQ6dOT7wlJ+WUQGueZ1oPju0ipIBcNFoxVs44evLclWl8BdDHvpySrcQvuOFb7
 fcacIR+I4grZf2pyPyuOaiSDFeFCmKw+qJw1HDfPj9qshyw+pYG7kRukfbDZbyxD
 Ck2rMcstsQUwrNmoschZhWMvtUGCovZQpIJ2sJOWAut5VYP6B1TMfmZKrVillTRN
 6dWdStJB527ifZt7l1ssT3+oTz9SwP2atUmpyYKE9WOoaOOghr5C8Hfo2ruNKmmO
 S2k0vigtKk9fJ0j9sxcEF3GYszirb+cZVjeQgIv6K0mXr3IKKu6MhDPlbS8KHaFt
 stPxe1daRcfTPpymCiFvO6fWHiLn2tP+iVNjI/vPv0nzD2yB/vt4PE8ka8pYxoNx
 VKDT/y/Cb9Q4F/6PiWNaTViEhyluHNidKhhDm5MDoFPdYrhyHsGiijgsfM7B+4pe
 EbLbrg==
 =0dU1
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-05-17' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Florian Westphal says:

====================
Netfilter fixes for net

1. Silence warning about unused variable when CONFIG_NF_NAT=n, from Tom Rix.
2. nftables: Fix possible out-of-bounds access, from myself.
3. nftables: fix null deref+UAF during element insertion into rbtree,
   also from myself.

* tag 'nf-23-05-17' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_set_rbtree: fix null deref on element insertion
  netfilter: nf_tables: fix nft_trans type confusion
  netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT
====================

Link: https://lore.kernel.org/r/20230517123756.7353-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17 21:43:20 -07:00
Jakub Kicinski c259ad1169 wireless fixes for v6.4
A lot of fixes this time, for both the stack and the drivers. The
 brcmfmac resume fix has been reported by several people so I would say
 it's the most important here. The iwlwifi RFI workaround is also
 something which was reported as a regression recently.
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmRk8CoRHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZtoQwf/br5+tn0nIRgxeeJh8OuScUKEX+JOccQl
 3YZE9FXrzSlNgBYl4BAu+pTaqbqrapTmkMHUQohSilozB6IKUYJsHt0d6uQf7gKw
 CX+gwEuAIw0uZnLb/JEWNcZWTRX6rpwIC1yzS9mIl8Q1iTqZRAVnGDPxhQj/RB3R
 zs2eNOxo+TkEJZCyShtarbWzF8IbGU4hICeSuxfbuomT6sjB1rRKlLXAnM1Wns0j
 e4fTeMNXKwhXT9cgzPDAPsJNUKChRI4Q5H2rgrQKdZpXox4H2eGZhxktEDwkaWP5
 M69Pbl1/XS5xUTGrrMR2SILJUXClJi4WLv1HbpC6+JJmUUQo1BpyUw==
 =i2s7
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2023-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Kalle Valo says:

====================
wireless fixes for v6.4

A lot of fixes this time, for both the stack and the drivers. The
brcmfmac resume fix has been reported by several people so I would say
it's the most important here. The iwlwifi RFI workaround is also
something which was reported as a regression recently.

* tag 'wireless-2023-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (31 commits)
  wifi: b43: fix incorrect __packed annotation
  wifi: rtw88: sdio: Always use two consecutive bytes for word operations
  mac80211_hwsim: fix memory leak in hwsim_new_radio_nl
  wifi: iwlwifi: mvm: Add locking to the rate read flow
  wifi: iwlwifi: Don't use valid_links to iterate sta links
  wifi: iwlwifi: mvm: don't trust firmware n_channels
  wifi: iwlwifi: mvm: fix OEM's name in the tas approved list
  wifi: iwlwifi: fix OEM's name in the ppag approved list
  wifi: iwlwifi: mvm: fix initialization of a return value
  wifi: iwlwifi: mvm: fix access to fw_id_to_mac_id
  wifi: iwlwifi: fw: fix DBGI dump
  wifi: iwlwifi: mvm: fix number of concurrent link checks
  wifi: iwlwifi: mvm: fix cancel_delayed_work_sync() deadlock
  wifi: iwlwifi: mvm: don't double-init spinlock
  wifi: iwlwifi: mvm: always free dup_data
  wifi: mac80211: recalc chanctx mindef before assigning
  wifi: mac80211: consider reserved chanctx for mindef
  wifi: mac80211: simplify chanctx allocation
  wifi: mac80211: Abort running color change when stopping the AP
  wifi: mac80211: fix min center freq offset tracing
  ...
====================

Link: https://lore.kernel.org/r/20230517151914.B0AF6C433EF@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17 21:36:21 -07:00
Arnd Bergmann 89dcd87ce5 bridge: always declare tunnel functions
When CONFIG_BRIDGE_VLAN_FILTERING is disabled, two functions are still
defined but have no prototype or caller. This causes a W=1 warning for
the missing prototypes:

net/bridge/br_netlink_tunnel.c:29:6: error: no previous prototype for 'vlan_tunid_inrange' [-Werror=missing-prototypes]
net/bridge/br_netlink_tunnel.c:199:5: error: no previous prototype for 'br_vlan_tunnel_info' [-Werror=missing-prototypes]

The functions are already contitional on CONFIG_BRIDGE_VLAN_FILTERING,
and I coulnd't easily figure out the right set of #ifdefs, so just
move the declarations out of the #ifdef to avoid the warning,
at a small cost in code size over a more elaborate fix.

Fixes: 188c67dd19 ("net: bridge: vlan options: add support for tunnel id dumping")
Fixes: 569da08228 ("net: bridge: vlan options: add support for tunnel mapping set/del")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20230516194625.549249-3-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17 21:28:58 -07:00
Arnd Bergmann fb1b7be9b1 atm: hide unused procfs functions
When CONFIG_PROC_FS is disabled, the function declarations for some
procfs functions are hidden, but the definitions are still build,
as shown by this compiler warning:

net/atm/resources.c:403:7: error: no previous prototype for 'atm_dev_seq_start' [-Werror=missing-prototypes]
net/atm/resources.c:409:6: error: no previous prototype for 'atm_dev_seq_stop' [-Werror=missing-prototypes]
net/atm/resources.c:414:7: error: no previous prototype for 'atm_dev_seq_next' [-Werror=missing-prototypes]

Add another #ifdef to leave these out of the build.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230516194625.549249-2-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17 21:27:30 -07:00
Christophe JAILLET b50a8b0d57 net: openvswitch: Use struct_size()
Use struct_size() instead of hand writing it.
This is less verbose and more informative.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/e7746fbbd62371d286081d5266e88bbe8d3fe9f0.1683388991.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17 21:25:46 -07:00
Linus Torvalds 1b66c114d1 nfsd-6.4 fixes:
- A collection of minor bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRiQVEACgkQM2qzM29m
 f5eYfBAAg5Qz45PL+fo1qWxkJ1ZKaNV1vPdi4tqCt9NEItDTjAnjj0am+rKNGZAz
 EOM2yFt4xaZGyMYgXe4VnYl0N+rSpbI/H+Rk/wOq4OHPURQD5EO9VeP86qZ7rmGl
 ECPqb39TFTwAiRomC/DHO4eNpoe1rQuXu0tW9+GmqDGxeuh8xdxTk33g17ZXwCFN
 tdPkkPjxVPdWd8X7HQg9kWm8AWfV+GyuzE2rKAoOjbs6Wv6d9GCY8Cb5HXkRsQhF
 4Zh0PVQuTuXurZwtPXwnS0k4kfvQwjlTIKHlXuo0ZLh+SuFbrWHzv0fVyD+kUpSK
 HtWbJ8JcruUvz0WGMtZatzRLHCZLDguV6oVXPp7rtmuxTj4szzHSFpEeAV901sIm
 Nkvuomvd02K/fiTo7s3yr6t1VG2vju9LDwhBe197iA3leHAlockfbbxE3NJMGbzQ
 NoOPd+lu95cfsanOM1LZZLNfbLrZofoSLK9K1+HD0yAVdyq7u47FyHRrymvCaMrj
 GiheuqrBfBMEq+2mCwUn37aM0FblYEXQM0xTVXPQcHtBBN/nGZxPJukmpr7ScNlR
 aqMtDoOLu4OEFuo6fe2/94eNi+N5XAZgWmx/mSyaytE8Xw9LJxeQ83UTigaGcYKc
 3YIuG1YXg9IyIoIdLkghB+Aj/6fivsGFK9Gud6g7I3xw4f15noA=
 =PiNG
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - A collection of minor bug fixes

* tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Remove open coding of string copy
  SUNRPC: Fix trace_svc_register() call site
  SUNRPC: always free ctxt when freeing deferred request
  SUNRPC: double free xprt_ctxt while still in use
  SUNRPC: Fix error handling in svc_setup_socket()
  SUNRPC: Fix encoding of accepted but unsuccessful RPC replies
  lockd: define nlm_port_min,max with CONFIG_SYSCTL
  nfsd: define exports_proc_ops with CONFIG_PROC_FS
  SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
2023-05-17 09:56:01 -07:00
Florian Westphal 61ae320a29 netfilter: nft_set_rbtree: fix null deref on element insertion
There is no guarantee that rb_prev() will not return NULL in nft_rbtree_gc_elem():

general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
 nft_add_set_elem+0x14b0/0x2990
  nf_tables_newsetelem+0x528/0xb30

Furthermore, there is a possible use-after-free while iterating,
'node' can be free'd so we need to cache the next value to use.

Fixes: c9e6978e27 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-17 14:18:28 +02:00
Florian Westphal e3c361b8ac netfilter: nf_tables: fix nft_trans type confusion
nft_trans_FOO objects all share a common nft_trans base structure, but
trailing fields depend on the real object size. Access is only safe after
trans->msg_type check.

Check for rule type first.  Found by code inspection.

Fixes: 1a94e38d25 ("netfilter: nf_tables: add NFTA_RULE_ID attribute")
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-17 14:16:02 +02:00
Tom Rix 224a876e37 netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT
gcc with W=1 and ! CONFIG_NF_NAT
net/netfilter/nf_conntrack_netlink.c:3463:32: error:
  ‘exp_nat_nla_policy’ defined but not used [-Werror=unused-const-variable=]
 3463 | static const struct nla_policy exp_nat_nla_policy[CTA_EXPECT_NAT_MAX+1] = {
      |                                ^~~~~~~~~~~~~~~~~~
net/netfilter/nf_conntrack_netlink.c:2979:33: error:
  ‘any_addr’ defined but not used [-Werror=unused-const-variable=]
 2979 | static const union nf_inet_addr any_addr;
      |                                 ^~~~~~~~

These variables use is controlled by CONFIG_NF_NAT, so should their definitions.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-17 14:15:57 +02:00
Eric Dumazet dacab578c7 vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit()
syzbot triggered the following splat [1], sending an empty message
through pppoe_sendmsg().

When VLAN_FLAG_REORDER_HDR flag is set, vlan_dev_hard_header()
does not push extra bytes for the VLAN header, because vlan is offloaded.

Unfortunately vlan_dev_hard_start_xmit() first reads veth->h_vlan_proto
before testing (vlan->flags & VLAN_FLAG_REORDER_HDR).

We need to swap the two conditions.

[1]
BUG: KMSAN: uninit-value in vlan_dev_hard_start_xmit+0x171/0x7f0 net/8021q/vlan_dev.c:111
vlan_dev_hard_start_xmit+0x171/0x7f0 net/8021q/vlan_dev.c:111
__netdev_start_xmit include/linux/netdevice.h:4883 [inline]
netdev_start_xmit include/linux/netdevice.h:4897 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x253/0xa20 net/core/dev.c:3596
__dev_queue_xmit+0x3c7f/0x5ac0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3053 [inline]
pppoe_sendmsg+0xa93/0xb80 drivers/net/ppp/pppoe.c:900
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0xa24/0xe40 net/socket.c:2501
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2555
__sys_sendmmsg+0x411/0xa50 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Uninit was created at:
slab_post_alloc_hook+0x12d/0xb60 mm/slab.h:774
slab_alloc_node mm/slub.c:3452 [inline]
kmem_cache_alloc_node+0x543/0xab0 mm/slub.c:3497
kmalloc_reserve+0x148/0x470 net/core/skbuff.c:520
__alloc_skb+0x3a7/0x850 net/core/skbuff.c:606
alloc_skb include/linux/skbuff.h:1277 [inline]
sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2583
pppoe_sendmsg+0x3af/0xb80 drivers/net/ppp/pppoe.c:867
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0xa24/0xe40 net/socket.c:2501
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2555
__sys_sendmmsg+0x411/0xa50 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

CPU: 0 PID: 29770 Comm: syz-executor.0 Not tainted 6.3.0-rc6-syzkaller-gc478e5b17829 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-17 12:55:39 +01:00
Yuya Tajima fa0583c202 seg6: Cleanup duplicates of skb_dst_drop calls
In processing IPv6 segment routing header (SRH), several functions call
skb_dst_drop before ip6_route_input. However, ip6_route_input calls
skb_dst_drop within it, so there is no need to call skb_dst_drop in advance.

Signed-off-by: Yuya Tajima <yuya.tajimaa@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-17 09:05:47 +01:00
Pavel Begunkov a753358472 net/tcp: optimise io_uring zc ubuf refcounting
io_uring keeps a reference to ubuf_info during submission, so if
tcp_sendmsg_locked() sees msghdr::msg_ubuf in can be sure the buffer
will be kept alive and doesn't need to additionally pin it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-17 08:37:55 +01:00
Pavel Begunkov eea96a3e2c net/tcp: don't peek at tail for io_uring zc
Move tcp_write_queue_tail() to SOCK_ZEROCOPY specific flag as zerocopy
setup for msghdr->ubuf_info doesn't need to peek into the last request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-17 08:37:54 +01:00
Jiri Olsa 65eb006d85 bpf: Move kernel test kfuncs to bpf_testmod
Moving kernel test kfuncs into bpf_testmod kernel module, and adding
necessary init calls and BTF IDs records.

We need to keep following structs in kernel:
  struct prog_test_ref_kfunc
  struct prog_test_member (embedded in prog_test_ref_kfunc)

The reason is because they need to be marked as rcu safe (check test
prog mark_ref_as_untrusted_or_null) and such objects are being required
to be defined only in kernel at the moment (see rcu_safe_kptr check
in kernel).

We need to keep also dtor functions for both objects in kernel:
  bpf_kfunc_call_test_release
  bpf_kfunc_call_memb_release

We also keep the copy of these struct in bpf_testmod_kfunc.h, because
other test functions use them. This is unfortunate, but this is just
temporary solution until we are able to these structs them to bpf_testmod
completely.

As suggested by David adding bpf_testmod.ko make dependency for
bpf programs, so they are rebuilt if we change the bpf_testmod.ko
module.

Also adding missing __bpf_kfunc to bpf_kfunc_call_test4 functions.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230515133756.1658301-11-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-16 22:09:24 -07:00
Jakub Kicinski 6ad85ed0eb ipsec-2023-05-16
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmRjDy4ACgkQrB3Eaf9P
 W7fTYQ/+J1Vic3YQI3xzrpeDWQInX0t/0vsNx0aTkoYZ+jP7F+0ybjb7TK+8yfmb
 N0Ov0l1NM3LCXlTaRRMAMO3KBtPnqQsHYYJpin1zSRr4zI/KE6R+u+DExIHB2p/z
 ArgNzahC2r8zKulJkrcGVm3LniQbv5NcSjAgAj/Y85flknQMghpOX5Y3qamtoc5s
 y9H2kcBxeAdAR+z8mphpD5lDmk6idQQP3VUg+YB9OmvKmrVT2FSO4p1KK+nxf3+5
 9zxwaFLyEqY7IZsify5rRy2p9OXNReHt9lgDEmVdF4SRV49nLE7EMqa5cYxC1bfx
 TUcUqc8pZ0yNsqkkTqs33y45m/ogU+Uf3lF/uIpdZZBYyLmhTkw39hupvK31UfWv
 lDRQ7hj9pcUSTNBHadjnPsFmytBE3mKDyya1BWvNbWWSKpwxLDww88vAXmDqE1H0
 SiElDIdojf4O8xQ31nAtNXIVDqcZfLROcEFHq6yqxd4i0HIe1wqeg/s+leibDkYG
 jA1wf98BPjuyX0iXxVLsGMATyXyeyZTDgEbiRIXnzDzrgWODeRmm2nYBEHLuyQTD
 MQEiMjoEtJJ5RBKC6pvsSewH4Zjjx80ahYAIBQmTO4Yt5fAgniz0fDRYvzlMTtK+
 ZlwN2q+s+BB3DxA9VXp9ajEib60BNmykZi32GetlQtQLM+cvDlY=
 =/LPA
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2023-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2023-05-16

1) Don't check the policy default if we have an allow
   policy. Fix from Sabrina Dubroca.

2) Fix netdevice refount usage on offload.
   From Leon Romanovsky.

3) Use netdev_put instead of dev_puti to correctly release
   the netdev on failure in xfrm_dev_policy_add.
   From Leon Romanovsky.

4) Revert "Fix XFRM-I support for nested ESP tunnels"
   This broke Netfilter policy matching.
   From Martin Willi.

5) Reject optional tunnel/BEET mode templates in outbound policies
   on netlink and pfkey sockets. From Tobias Brunner.

6) Check if_id in inbound policy/secpath match to make
   it symetric to the outbound codepath.
   From Benedict Wong.

* tag 'ipsec-2023-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: Check if_id in inbound policy/secpath match
  af_key: Reject optional tunnel/BEET mode templates in outbound policies
  xfrm: Reject optional tunnel/BEET mode templates in outbound policies
  Revert "Fix XFRM-I support for nested ESP tunnels"
  xfrm: Fix leak of dev tracker
  xfrm: release all offloaded policy memory
  xfrm: don't check the default policy if the policy allows the packet
====================

Link: https://lore.kernel.org/r/20230516052405.2677554-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16 20:52:35 -07:00
Jakub Kicinski 47d55c62bd linux-can-fixes-for-6.4-20230515
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmRiloITHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRC+UBxKKooE6KQjB/920q1esTr4MiOeVHShwj6OGqfjcWlf
 PvPzoMbMy2KuRgPF08HL3PO/fnrttSkI+qn1jJFD+IE+0QERbFf3adTOns+iiM2d
 li9kLQgFf6a6ne2lRFwGXsxIuABzGBpq4LO4TDl7CRx2U0FZgETVV/8ImqAaxafp
 ryS5ko3gghHmxAg96RjaPEMhZjaYpDqpY+AR6lD445CpzhGs5nhO/WKyJRm5wgue
 PlCUVjPyE9Wyf11e0MDkmFEAV4nR8qHIm0TnVz1h9Z3oOPWuvoOHJiXijrgjpKP8
 kfhQcBE8HXi3euOH/RUSr54euk37sLk0UpOdkrq7mp2Lu1pb/IfNznrg
 =hpEt
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-6.4-20230515' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2023-05-15

The first 2 patches are by Oliver Hartkopp and allow the
MSG_CMSG_COMPAT flag for isotp and j1939.

The next patch is by Oliver Hartkopp, too and adds missing CAN XL
support in can_put_echo_skb().

Geert Uytterhoeven's patch let's the bxcan driver depend on
ARCH_STM32.

The last 5 patches are from Dario Binacchi and also affect the bxcan
driver. The bxcan driver hit mainline with v6.4-rc1 and was originally
written for IP cores containing 2 CAN interfaces with shared
resources. Dario's series updates the DT bindings and driver to
support IP cores with a single CAN interface instance as well as
adding the bxcan to the stm32f746's device tree.

* tag 'linux-can-fixes-for-6.4-20230515' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  ARM: dts: stm32: add CAN support on stm32f746
  can: bxcan: add support for single peripheral configuration
  ARM: dts: stm32: add pin map for CAN controller on stm32f7
  ARM: dts: stm32f429: put can2 in secondary mode
  dt-bindings: net: can: add "st,can-secondary" property
  can: CAN_BXCAN should depend on ARCH_STM32
  can: dev: fix missing CAN XL support in can_put_echo_skb()
  can: j1939: recvmsg(): allow MSG_CMSG_COMPAT flag
  can: isotp: recvmsg(): allow MSG_CMSG_COMPAT flag
====================

Link: https://lore.kernel.org/r/20230515204722.1000957-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16 20:49:42 -07:00
Jakub Kicinski e7480a44d7 Revert "net: Remove low_thresh in ip defrag"
This reverts commit b2cbac9b9b.

We have multiple reports of obvious breakage from this patch.

Reported-by: Ido Schimmel <idosch@idosch.org>
Link: https://lore.kernel.org/all/ZGIRWjNcfqI8yY8W@shredder/
Link: https://lore.kernel.org/all/CADJHv_sDK=0RrMA2FTZQV5fw7UQ+qY=HG21Wu5qb0V9vvx5w6A@mail.gmail.com/
Reported-by: syzbot+a5e719ac7c268e414c95@syzkaller.appspotmail.com
Reported-by: syzbot+a03fd670838d927d9cd8@syzkaller.appspotmail.com
Fixes: b2cbac9b9b ("net: Remove low_thresh in ip defrag")
Link: https://lore.kernel.org/r/20230517034112.1261835-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16 20:46:30 -07:00
Ido Schimmel d6352dae09 devlink: Fix crash with CONFIG_NET_NS=n
'__net_initdata' becomes a no-op with CONFIG_NET_NS=y, but when this
option is disabled it becomes '__initdata', which means the data can be
freed after the initialization phase. This annotation is obviously
incorrect for the devlink net device notifier block which is still
registered after the initialization phase [1].

Fix this crash by removing the '__net_initdata' annotation.

[1]
general protection fault, probably for non-canonical address 0xcccccccccccccccc: 0000 [#1] PREEMPT SMP
CPU: 3 PID: 117 Comm: (udev-worker) Not tainted 6.4.0-rc1-custom-gdf0acdc59b09 #64
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
RIP: 0010:notifier_call_chain+0x58/0xc0
[...]
Call Trace:
 <TASK>
 dev_set_mac_address+0x85/0x120
 dev_set_mac_address_user+0x30/0x50
 do_setlink+0x219/0x1270
 rtnl_setlink+0xf7/0x1a0
 rtnetlink_rcv_msg+0x142/0x390
 netlink_rcv_skb+0x58/0x100
 netlink_unicast+0x188/0x270
 netlink_sendmsg+0x214/0x470
 __sys_sendto+0x12f/0x1a0
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x38/0x80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: e93c9378e3 ("devlink: change per-devlink netdev notifier to static one")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Closes: https://lore.kernel.org/netdev/600ddf9e-589a-2aa0-7b69-a438f833ca10@samsung.com/
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230515162925.1144416-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16 19:57:52 -07:00
Jakub Kicinski a0e35a648f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZGKqEAAKCRDbK58LschI
 g6LYAQDp1jAszCOkmJ8VUA0ZyC5NAFDv+7y9Nd1toYWYX1btzAEAkf8+5qBJ1qmI
 P5M0hjMTbH4MID9Aql10ZbMHheyOBAo=
 =NUQM
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-05-16

We've added 57 non-merge commits during the last 19 day(s) which contain
a total of 63 files changed, 3293 insertions(+), 690 deletions(-).

The main changes are:

1) Add precision propagation to verifier for subprogs and callbacks,
   from Andrii Nakryiko.

2) Improve BPF's {g,s}setsockopt() handling with wrong option lengths,
   from Stanislav Fomichev.

3) Utilize pahole v1.25 for the kernel's BTF generation to filter out
   inconsistent function prototypes, from Alan Maguire.

4) Various dyn-pointer verifier improvements to relax restrictions,
   from Daniel Rosenberg.

5) Add a new bpf_task_under_cgroup() kfunc for designated task,
   from Feng Zhou.

6) Unblock tests for arm64 BPF CI after ftrace supporting direct call,
   from Florent Revest.

7) Add XDP hint kfunc metadata for RX hash/timestamp for igc,
   from Jesper Dangaard Brouer.

8) Add several new dyn-pointer kfuncs to ease their usability,
   from Joanne Koong.

9) Add in-depth LRU internals description and dot function graph,
   from Joe Stringer.

10) Fix KCSAN report on bpf_lru_list when accessing node->ref,
    from Martin KaFai Lau.

11) Only dump unprivileged_bpf_disabled log warning upon write,
    from Kui-Feng Lee.

12) Extend test_progs to directly passing allow/denylist file,
    from Stephen Veiss.

13) Fix BPF trampoline memleak upon failure attaching to fentry,
    from Yafang Shao.

14) Fix emitting struct bpf_tcp_sock type in vmlinux BTF,
    from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (57 commits)
  bpf: Fix memleak due to fentry attach failure
  bpf: Remove bpf trampoline selector
  bpf, arm64: Support struct arguments in the BPF trampoline
  bpftool: JIT limited misreported as negative value on aarch64
  bpf: fix calculation of subseq_idx during precision backtracking
  bpf: Remove anonymous union in bpf_kfunc_call_arg_meta
  bpf: Document EFAULT changes for sockopt
  selftests/bpf: Correctly handle optlen > 4096
  selftests/bpf: Update EFAULT {g,s}etsockopt selftests
  bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
  libbpf: fix offsetof() and container_of() to work with CO-RE
  bpf: Address KCSAN report on bpf_lru_list
  bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25
  selftests/bpf: Accept mem from dynptr in helper funcs
  bpf: verifier: Accept dynptr mem as mem in helpers
  selftests/bpf: Check overflow in optional buffer
  selftests/bpf: Test allowing NULL buffer in dynptr slice
  bpf: Allow NULL buffers in bpf_dynptr_slice(_rw)
  selftests/bpf: Add testcase for bpf_task_under_cgroup
  bpf: Add bpf_task_under_cgroup() kfunc
  ...
====================

Link: https://lore.kernel.org/r/20230515225603.27027-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16 19:50:05 -07:00
Johannes Berg 04312de4ce wifi: mac80211: recalc chanctx mindef before assigning
When we allocate a new channel context, or find an existing one
that is compatible, we currently assign it to a link before its
mindef is updated. This leads to strange situations, especially
in link switching where you switch to an 80 MHz link and expect
it to be active immediately, but the mindef is still configured
to 20 MHz while assigning.  Also, it's strange that the chandef
passed to the assign method's argument is wider than the one in
the context.

Fix this by calculating the mindef with the new link considered
before calling the driver.

In particular, this fixes an iwlwifi problem during link switch
where the firmware would assert because the (link) station that
was added for the AP is configured to transmit at a bandwidth
that's wider than the channel context that it's configured on.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230504134511.828474-5-gregory.greenman@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:26:00 -07:00
Johannes Berg b72a455a24 wifi: mac80211: consider reserved chanctx for mindef
When a chanctx is reserved for a new vif and we recalculate
the minimal definition for it, we need to consider the new
interface it's being reserved for before we assign it, so it
can be used directly with the correct min channel width.

Fix the code to - optionally - consider that, and use that
option just before doing the reassignment.

Also, when considering channel context reservations, we
should only consider the one link we're currently working with.
Change the boolean argument to a link pointer to do that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230504134511.828474-4-gregory.greenman@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:25:43 -07:00
Johannes Berg 860e1b43da wifi: mac80211: simplify chanctx allocation
There's no need to call ieee80211_recalc_chanctx_min_def()
since it cannot and won't call the driver anyway; just use
_ieee80211_recalc_chanctx_min_def() instead.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230504134511.828474-3-gregory.greenman@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:25:31 -07:00
Michael Lee a23d7f5b2f wifi: mac80211: Abort running color change when stopping the AP
When stopping the AP, there might be a color change in progress. It
should be deactivated here, or the driver might later finalize a color
change on a stopped AP.

Fixes: 5f9404abdf (mac80211: add support for BSS color change)
Signed-off-by: Michael Lee <michael-cy.lee@mediatek.com>
Link: https://lore.kernel.org/r/20230504080441.22958-1-michael-cy.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:18:51 -07:00
Johannes Berg 248e477651 wifi: mac80211: fix min center freq offset tracing
We need to set the correct trace variable, otherwise we're
overwriting something else instead and the right one that
we print later is not initialized.

Fixes: b6011960f3 ("mac80211: handle channel frequency offset")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230504134511.828474-2-gregory.greenman@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:18:06 -07:00
Christophe JAILLET 13ad2b1eea wifi: mac80211: Fix puncturing bitmap handling in __ieee80211_csa_finalize()
'changed' can be OR'ed with BSS_CHANGED_EHT_PUNCTURING which is larger than
an u32.
So, turn 'changed' into an u64 and update ieee80211_set_after_csa_beacon()
accordingly.

In the commit in Fixes, only ieee80211_start_ap() was updated.

Fixes: 2cc25e4b2a ("wifi: mac80211: configure puncturing bitmap")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/e84a3f80fe536787f7a2c7180507efc36cd14f95.1682358088.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:15:01 -07:00
Mirsad Goran Todorovac ef6e1997da wifi: mac80211: fortify the spinlock against deadlock by interrupt
In the function ieee80211_tx_dequeue() there is a particular locking
sequence:

begin:
	spin_lock(&local->queue_stop_reason_lock);
	q_stopped = local->queue_stop_reasons[q];
	spin_unlock(&local->queue_stop_reason_lock);

However small the chance (increased by ftracetest), an asynchronous
interrupt can occur in between of spin_lock() and spin_unlock(),
and the interrupt routine will attempt to lock the same
&local->queue_stop_reason_lock again.

This will cause a costly reset of the CPU and the wifi device or an
altogether hang in the single CPU and single core scenario.

The only remaining spin_lock(&local->queue_stop_reason_lock) that
did not disable interrupts was patched, which should prevent any
deadlocks on the same CPU/core and the same wifi device.

This is the probable trace of the deadlock:

kernel: ================================
kernel: WARNING: inconsistent lock state
kernel: 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 Tainted: G        W
kernel: --------------------------------
kernel: inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kernel: kworker/5:0/25656 [HC0[0]:SC0[0]:HE1:SE1] takes:
kernel: ffff9d6190779478 (&local->queue_stop_reason_lock){+.?.}-{2:2}, at: return_to_handler+0x0/0x40
kernel: {IN-SOFTIRQ-W} state was registered at:
kernel:   lock_acquire+0xc7/0x2d0
kernel:   _raw_spin_lock+0x36/0x50
kernel:   ieee80211_tx_dequeue+0xb4/0x1330 [mac80211]
kernel:   iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm]
kernel:   iwl_mvm_mac_wake_tx_queue+0x2d/0xd0 [iwlmvm]
kernel:   ieee80211_queue_skb+0x450/0x730 [mac80211]
kernel:   __ieee80211_xmit_fast.constprop.66+0x834/0xa50 [mac80211]
kernel:   __ieee80211_subif_start_xmit+0x217/0x530 [mac80211]
kernel:   ieee80211_subif_start_xmit+0x60/0x580 [mac80211]
kernel:   dev_hard_start_xmit+0xb5/0x260
kernel:   __dev_queue_xmit+0xdbe/0x1200
kernel:   neigh_resolve_output+0x166/0x260
kernel:   ip_finish_output2+0x216/0xb80
kernel:   __ip_finish_output+0x2a4/0x4d0
kernel:   ip_finish_output+0x2d/0xd0
kernel:   ip_output+0x82/0x2b0
kernel:   ip_local_out+0xec/0x110
kernel:   igmpv3_sendpack+0x5c/0x90
kernel:   igmp_ifc_timer_expire+0x26e/0x4e0
kernel:   call_timer_fn+0xa5/0x230
kernel:   run_timer_softirq+0x27f/0x550
kernel:   __do_softirq+0xb4/0x3a4
kernel:   irq_exit_rcu+0x9b/0xc0
kernel:   sysvec_apic_timer_interrupt+0x80/0xa0
kernel:   asm_sysvec_apic_timer_interrupt+0x1f/0x30
kernel:   _raw_spin_unlock_irqrestore+0x3f/0x70
kernel:   free_to_partial_list+0x3d6/0x590
kernel:   __slab_free+0x1b7/0x310
kernel:   kmem_cache_free+0x52d/0x550
kernel:   putname+0x5d/0x70
kernel:   do_sys_openat2+0x1d7/0x310
kernel:   do_sys_open+0x51/0x80
kernel:   __x64_sys_openat+0x24/0x30
kernel:   do_syscall_64+0x5c/0x90
kernel:   entry_SYSCALL_64_after_hwframe+0x72/0xdc
kernel: irq event stamp: 5120729
kernel: hardirqs last  enabled at (5120729): [<ffffffff9d149936>] trace_graph_return+0xd6/0x120
kernel: hardirqs last disabled at (5120728): [<ffffffff9d149950>] trace_graph_return+0xf0/0x120
kernel: softirqs last  enabled at (5069900): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40
kernel: softirqs last disabled at (5067555): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40
kernel:
        other info that might help us debug this:
kernel:  Possible unsafe locking scenario:
kernel:        CPU0
kernel:        ----
kernel:   lock(&local->queue_stop_reason_lock);
kernel:   <Interrupt>
kernel:     lock(&local->queue_stop_reason_lock);
kernel:
         *** DEADLOCK ***
kernel: 8 locks held by kworker/5:0/25656:
kernel:  #0: ffff9d618009d138 ((wq_completion)events_freezable){+.+.}-{0:0}, at: process_one_work+0x1ca/0x530
kernel:  #1: ffffb1ef4637fe68 ((work_completion)(&local->restart_work)){+.+.}-{0:0}, at: process_one_work+0x1ce/0x530
kernel:  #2: ffffffff9f166548 (rtnl_mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #3: ffff9d6190778728 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #4: ffff9d619077b480 (&mvm->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #5: ffff9d61907bacd8 (&trans_pcie->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #6: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_queue_state_change+0x59/0x3a0 [iwlmvm]
kernel:  #7: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm]
kernel:
        stack backtrace:
kernel: CPU: 5 PID: 25656 Comm: kworker/5:0 Tainted: G        W          6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4
kernel: Hardware name: LENOVO 82H8/LNVNB161216, BIOS GGCN51WW 11/16/2022
kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211]
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  dump_stack_lvl+0x5f/0xa0
kernel:  dump_stack+0x14/0x20
kernel:  print_usage_bug.part.46+0x208/0x2a0
kernel:  mark_lock.part.47+0x605/0x630
kernel:  ? sched_clock+0xd/0x20
kernel:  ? trace_clock_local+0x14/0x30
kernel:  ? __rb_reserve_next+0x5f/0x490
kernel:  ? _raw_spin_lock+0x1b/0x50
kernel:  __lock_acquire+0x464/0x1990
kernel:  ? mark_held_locks+0x4e/0x80
kernel:  lock_acquire+0xc7/0x2d0
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ? ftrace_return_to_handler+0x8b/0x100
kernel:  ? preempt_count_add+0x4/0x70
kernel:  _raw_spin_lock+0x36/0x50
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ieee80211_tx_dequeue+0xb4/0x1330 [mac80211]
kernel:  ? prepare_ftrace_return+0xc5/0x190
kernel:  ? ftrace_graph_func+0x16/0x20
kernel:  ? 0xffffffffc02ab0b1
kernel:  ? lock_acquire+0xc7/0x2d0
kernel:  ? iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm]
kernel:  ? ieee80211_tx_dequeue+0x9/0x1330 [mac80211]
kernel:  ? __rcu_read_lock+0x4/0x40
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_queue_state_change+0x311/0x3a0 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_wake_sw_queue+0x17/0x20 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_txq_gen2_unmap+0x1c9/0x1f0 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_txq_gen2_free+0x55/0x130 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_txq_gen2_tx_free+0x63/0x80 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  _iwl_trans_pcie_gen2_stop_device+0x3f3/0x5b0 [iwlwifi]
kernel:  ? _iwl_trans_pcie_gen2_stop_device+0x9/0x5b0 [iwlwifi]
kernel:  ? mutex_lock_nested+0x4/0x30
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_trans_pcie_gen2_stop_device+0x5f/0x90 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_stop_device+0x78/0xd0 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  __iwl_mvm_mac_start+0x114/0x210 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_mac_start+0x76/0x150 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  drv_start+0x79/0x180 [mac80211]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ieee80211_reconfig+0x1523/0x1ce0 [mac80211]
kernel:  ? synchronize_net+0x4/0x50
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ieee80211_restart_work+0x108/0x170 [mac80211]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  process_one_work+0x250/0x530
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  worker_thread+0x48/0x3a0
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0x10f/0x140
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x29/0x50
kernel:  </TASK>

Fixes: 4444bc2116 ("wifi: mac80211: Proper mark iTXQs for resumption")
Link: https://lore.kernel.org/all/1f58a0d1-d2b9-d851-73c3-93fcc607501c@alu.unizg.hr/
Reported-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Cc: Gregory Greenman <gregory.greenman@intel.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/all/cdc80531-f25f-6f9d-b15f-25e16130b53a@alu.unizg.hr/
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: tag, or it goes automatically?
Link: https://lore.kernel.org/r/20230425164005.25272-1-mirsad.todorovac@alu.unizg.hr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:11:38 -07:00
Ilan Peer 1b6b4ed014 wifi: cfg80211: Drop entries with invalid BSSIDs in RNR
Ignore AP information for entries that include an invalid
BSSID in the TBTT information field, e.g., all zeros BSSIDs.

Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20230424103224.5e65d04d1448.Ic10c8577ae4a85272c407106c9d0a2ecb5372743@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-05-16 10:09:50 -07:00
Oliver Hartkopp 1db080cbdb can: j1939: recvmsg(): allow MSG_CMSG_COMPAT flag
The control message provided by J1939 support MSG_CMSG_COMPAT but
blocked recvmsg() syscalls that have set this flag, i.e. on 32bit user
space on 64 bit kernels.

Link: https://github.com/hartkopp/can-isotp/issues/59
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Link: https://lore.kernel.org/20230505110308.81087-3-mkl@pengutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-05-15 22:24:46 +02:00
Oliver Hartkopp db2773d65b can: isotp: recvmsg(): allow MSG_CMSG_COMPAT flag
The control message provided by isotp support MSG_CMSG_COMPAT but
blocked recvmsg() syscalls that have set this flag, i.e. on 32bit user
space on 64 bit kernels.

Link: https://github.com/hartkopp/can-isotp/issues/59
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Fixes: 42bf50a179 ("can: isotp: support MSG_TRUNC flag when reading from socket")
Link: https://lore.kernel.org/20230505110308.81087-2-mkl@pengutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-05-15 22:24:46 +02:00
Krzysztof Kozlowski 0d9b41daa5 nfc: llcp: fix possible use of uninitialized variable in nfc_llcp_send_connect()
If sock->service_name is NULL, the local variable
service_name_tlv_length will not be assigned by nfc_llcp_build_tlv(),
later leading to using value frmo the stack.  Smatch warning:

  net/nfc/llcp_commands.c:442 nfc_llcp_send_connect() error: uninitialized symbol 'service_name_tlv_length'.

Fixes: de9e5aeb4f ("NFC: llcp: Fix usage of llcp_add_tlv()")
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 13:03:34 +01:00
Xin Long 35a089b5d7 tipc: check the bearer min mtu properly when setting it by netlink
Checking the bearer min mtu with tipc_udp_mtu_bad() only works for
IPv4 UDP bearer, and IPv6 UDP bearer has a different value for the
min mtu. This patch checks with encap_hlen + TIPC_MIN_BEARER_MTU
for min mtu, which works for both IPv4 and IPv6 UDP bearer.

Note that tipc_udp_mtu_bad() is still used to check media min mtu
in __tipc_nl_media_set(), as m->mtu currently is only used by the
IPv4 UDP bearer as its default mtu value.

Fixes: 682cd3cf94 ("tipc: confgiure and apply UDP bearer MTU on running links")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 10:21:20 +01:00
Xin Long 56077b56cd tipc: do not update mtu if msg_max is too small in mtu negotiation
When doing link mtu negotiation, a malicious peer may send Activate msg
with a very small mtu, e.g. 4 in Shuang's testing, without checking for
the minimum mtu, l->mtu will be set to 4 in tipc_link_proto_rcv(), then
n->links[bearer_id].mtu is set to 4294967228, which is a overflow of
'4 - INT_H_SIZE - EMSG_OVERHEAD' in tipc_link_mss().

With tipc_link.mtu = 4, tipc_link_xmit() kept printing the warning:

 tipc: Too large msg, purging xmit list 1 5 0 40 4!
 tipc: Too large msg, purging xmit list 1 15 0 60 4!

And with tipc_link_entry.mtu 4294967228, a huge skb was allocated in
named_distribute(), and when purging it in tipc_link_xmit(), a crash
was even caused:

  general protection fault, probably for non-canonical address 0x2100001011000dd: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 6.3.0.neta #19
  RIP: 0010:kfree_skb_list_reason+0x7e/0x1f0
  Call Trace:
   <IRQ>
   skb_release_data+0xf9/0x1d0
   kfree_skb_reason+0x40/0x100
   tipc_link_xmit+0x57a/0x740 [tipc]
   tipc_node_xmit+0x16c/0x5c0 [tipc]
   tipc_named_node_up+0x27f/0x2c0 [tipc]
   tipc_node_write_unlock+0x149/0x170 [tipc]
   tipc_rcv+0x608/0x740 [tipc]
   tipc_udp_recv+0xdc/0x1f0 [tipc]
   udp_queue_rcv_one_skb+0x33e/0x620
   udp_unicast_rcv_skb.isra.72+0x75/0x90
   __udp4_lib_rcv+0x56d/0xc20
   ip_protocol_deliver_rcu+0x100/0x2d0

This patch fixes it by checking the new mtu against tipc_bearer_min_mtu(),
and not updating mtu if it is too small.

Fixes: ed193ece26 ("tipc: simplify link mtu negotiation")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 10:21:20 +01:00
Xin Long 3ae6d66b60 tipc: add tipc_bearer_min_mtu to calculate min mtu
As different media may requires different min mtu, and even the
same media with different net family requires different min mtu,
add tipc_bearer_min_mtu() to calculate min mtu accordingly.

This API will be used to check the new mtu when doing the link
mtu negotiation in the next patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 10:21:20 +01:00
Naveen Mamindlapalli 12e7789ad5 sch_htb: Allow HTB priority parameter in offload mode
The current implementation of HTB offload returns the EINVAL error
for unsupported parameters like prio and quantum. This patch removes
the error returning checks for 'prio' parameter and populates its
value to tc_htb_qopt_offload structure such that driver can use the
same.

Add prio parameter check in mlx5 driver, as mlx5 devices are not capable
of supporting the prio parameter when htb offload is used. Report error
if prio parameter is set to a non-default value.

Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
Co-developed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 09:31:07 +01:00
Angus Chen b2cbac9b9b net: Remove low_thresh in ip defrag
As low_thresh has no work in fragment reassembles,del it.
And Mark it deprecated in sysctl Document.

Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 08:42:07 +01:00
Dong Chenchen c83b49383b net: nsh: Use correct mac_offset to unwind gso skb in nsh_gso_segment()
As the call trace shows, skb_panic was caused by wrong skb->mac_header
in nsh_gso_segment():

invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 3 PID: 2737 Comm: syz Not tainted 6.3.0-next-20230505 #1
RIP: 0010:skb_panic+0xda/0xe0
call Trace:
 skb_push+0x91/0xa0
 nsh_gso_segment+0x4f3/0x570
 skb_mac_gso_segment+0x19e/0x270
 __skb_gso_segment+0x1e8/0x3c0
 validate_xmit_skb+0x452/0x890
 validate_xmit_skb_list+0x99/0xd0
 sch_direct_xmit+0x294/0x7c0
 __dev_queue_xmit+0x16f0/0x1d70
 packet_xmit+0x185/0x210
 packet_snd+0xc15/0x1170
 packet_sendmsg+0x7b/0xa0
 sock_sendmsg+0x14f/0x160

The root cause is:
nsh_gso_segment() use skb->network_header - nhoff to reset mac_header
in skb_gso_error_unwind() if inner-layer protocol gso fails.
However, skb->network_header may be reset by inner-layer protocol
gso function e.g. mpls_gso_segment. skb->mac_header reset by the
inaccurate network_header will be larger than skb headroom.

nsh_gso_segment
    nhoff = skb->network_header - skb->mac_header;
    __skb_pull(skb,nsh_len)
    skb_mac_gso_segment
        mpls_gso_segment
            skb_reset_network_header(skb);//skb->network_header+=nsh_len
            return -EINVAL;
    skb_gso_error_unwind
        skb_push(skb, nsh_len);
        skb->mac_header = skb->network_header - nhoff;
        // skb->mac_header > skb->headroom, cause skb_push panic

Use correct mac_offset to restore mac_header and get rid of nhoff.

Fixes: c411ed8545 ("nsh: add GSO support")
Reported-by: syzbot+632b5d9964208bfef8c0@syzkaller.appspotmail.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 08:40:27 +01:00
Chuck Lever 07a2730593 SUNRPC: Fix trace_svc_register() call site
The trace event recorded incorrect values for the registered family,
protocol, and port because the arguments are in the wrong order.

Fixes: b4af59328c ("SUNRPC: Trace server-side rpcbind registration events")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14 15:58:14 -04:00
NeilBrown 948f072ada SUNRPC: always free ctxt when freeing deferred request
Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not
been sufficient to use kfree() to free a deferred request.  We may need
to free the ctxt as well.

As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose
it to explicit do that even when the ctxt is not stored in an rqst.
So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt,
which may have been taken either from an rqst or from a dreq.  The
caller is now responsible for clearing that pointer after the call to
->xpo_release_ctxt.

We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when
revisiting a deferred request.  This ensures there is only one pointer
to the ctxt, so the risk of double freeing in future is reduced.  The
new code in svc_xprt_release which releases both the ctxt and any
rq_deferred depends on this.

Fixes: 773f91b2cf ("SUNRPC: Fix NFSD's request deferral on RDMA transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14 15:55:02 -04:00
NeilBrown eb8d3a2c80 SUNRPC: double free xprt_ctxt while still in use
When an RPC request is deferred, the rq_xprt_ctxt pointer is moved out
of the svc_rqst into the svc_deferred_req.
When the deferred request is revisited, the pointer is copied into
the new svc_rqst - and also remains in the svc_deferred_req.

In the (rare?) case that the request is deferred a second time, the old
svc_deferred_req is reused - it still has all the correct content.
However in that case the rq_xprt_ctxt pointer is NOT cleared so that
when xpo_release_xprt is called, the ctxt is freed (UDP) or possible
added to a free list (RDMA).
When the deferred request is revisited for a second time, it will
reference this ctxt which may be invalid, and the free the object a
second time which is likely to oops.

So change svc_defer() to *always* clear rq_xprt_ctxt, and assert that
the value is now stored in the svc_deferred_req.

Fixes: 773f91b2cf ("SUNRPC: Fix NFSD's request deferral on RDMA transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14 15:55:01 -04:00
Kuniyuki Iwashima f1b5dfe63f ping: Convert hlist_nulls to plain hlist.
Since introduced in commit c319b4d76b ("net: ipv4: add IPPROTO_ICMP
socket kind"), ping socket does not use SLAB_TYPESAFE_BY_RCU nor check
nulls marker in loops.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 20:04:31 +01:00
Yunsheng Lin b51f4113eb net: introduce and use skb_frag_fill_page_desc()
Most users use __skb_frag_set_page()/skb_frag_off_set()/
skb_frag_size_set() to fill the page desc for a skb frag.

Introduce skb_frag_fill_page_desc() to do that.

net/bpf/test_run.c does not call skb_frag_off_set() to
set the offset, "copy_from_user(page_address(page), ...)"
and 'shinfo' being part of the 'data' kzalloced in
bpf_test_init() suggest that it is assuming offset to be
initialized as zero, so call skb_frag_fill_page_desc()
with offset being zero for this case.

Also, skb_frag_set_page() is not used anymore, so remove
it.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 19:47:56 +01:00
Xin Long d80fc101d2 erspan: get the proto with the md version for collect_md
In commit 20704bd163 ("erspan: build the header with the right proto
according to erspan_ver"), it gets the proto with t->parms.erspan_ver,
but t->parms.erspan_ver is not used by collect_md branch, and instead
it should get the proto with md->version for collect_md.

Thanks to Kevin for pointing this out.

Fixes: 20704bd163 ("erspan: build the header with the right proto according to erspan_ver")
Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Reported-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 16:58:58 +01:00
Alexander Mikhalitsyn 2598619e01 sctp: add bpf_bypass_getsockopt proto callback
Implement ->bpf_bypass_getsockopt proto callback and filter out
SCTP_SOCKOPT_PEELOFF, SCTP_SOCKOPT_PEELOFF_FLAGS and SCTP_SOCKOPT_CONNECTX3
socket options from running eBPF hook on them.

SCTP_SOCKOPT_PEELOFF and SCTP_SOCKOPT_PEELOFF_FLAGS options do fd_install(),
and if BPF_CGROUP_RUN_PROG_GETSOCKOPT hook returns an error after success of
the original handler sctp_getsockopt(...), userspace will receive an error
from getsockopt syscall and will be not aware that fd was successfully
installed into a fdtable.

As pointed by Marcelo Ricardo Leitner it seems reasonable to skip
bpf getsockopt hook for SCTP_SOCKOPT_CONNECTX3 sockopt too.
Because internaly, it triggers connect() and if error is masked
then userspace will be confused.

This patch was born as a result of discussion around a new SCM_PIDFD interface:
https://lore.kernel.org/all/20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com/

Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Suggested-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 10:07:16 +01:00
Eric Dumazet 1e306ec49a tcp: fix possible sk_priority leak in tcp_v4_send_reset()
When tcp_v4_send_reset() is called with @sk == NULL,
we do not change ctl_sk->sk_priority, which could have been
set from a prior invocation.

Change tcp_v4_send_reset() to set sk_priority and sk_mark
fields before calling ip_send_unicast_reply().

This means tcp_v4_send_reset() and tcp_v4_send_ack()
no longer have to clear ctl_sk->sk_mark after
their call to ip_send_unicast_reply().

Fixes: f6c0f5d209 ("tcp: honor SO_PRIORITY in TIME_WAIT state")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 10:05:50 +01:00
Zhuang Shengen 6d4486efe9 vsock: avoid to close connected socket after the timeout
When client and server establish a connection through vsock,
the client send a request to the server to initiate the connection,
then start a timer to wait for the server's response. When the server's
RESPONSE message arrives, the timer also times out and exits. The
server's RESPONSE message is processed first, and the connection is
established. However, the client's timer also times out, the original
processing logic of the client is to directly set the state of this vsock
to CLOSE and return ETIMEDOUT. It will not notify the server when the port
is released, causing the server port remain.
when client's vsock_connect timeout,it should check sk state is
ESTABLISHED or not. if sk state is ESTABLISHED, it means the connection
is established, the client should not set the sk state to CLOSE

Note: I encountered this issue on kernel-4.18, which can be fixed by
this patch. Then I checked the latest code in the community
and found similar issue.

Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 10:04:10 +01:00
Chuck Lever eefca7ec51 net/handshake: Enable the SNI extension to work properly
Enable the upper layer protocol to specify the SNI peername. This
avoids the need for tlshd to use a DNS lookup, which can return a
hostname that doesn't match the incoming certificate's SubjectName.

Fixes: 2fd5532044 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:24:08 +01:00
Chuck Lever f921bd4100 net/handshake: Unpin sock->file if a handshake is cancelled
If user space never calls DONE, sock->file's reference count remains
elevated. Enable sock->file to be freed eventually in this case.

Reported-by: Jakub Kacinski <kuba@kernel.org>
Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:24:08 +01:00
Chuck Lever e36a93e172 net/handshake: handshake_genl_notify() shouldn't ignore @flags
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:24:08 +01:00
Chuck Lever 7301034026 net/handshake: Fix uninitialized local variable
trace_handshake_cmd_done_err() simply records the pointer in @req,
so initializing it to NULL is sufficient and safe.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:24:07 +01:00
Chuck Lever 2200c1a870 net/handshake: Fix handshake_dup() ref counting
If get_unused_fd_flags() fails, we ended up calling fput(sock->file)
twice.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:24:07 +01:00
Chuck Lever b16d76fe9a net/handshake: Remove unneeded check from handshake_dup()
handshake_req_submit() now verifies that the socket has a file.

Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:24:07 +01:00
Eric Dumazet ef1148d448 ipv6: remove nexthop_fib6_nh_bh()
After blamed commit, nexthop_fib6_nh_bh() and nexthop_fib6_nh()
are the same.

Delete nexthop_fib6_nh_bh(), and convert /proc/net/ipv6_route
to standard rcu to avoid this splat:

[ 5723.180080] WARNING: suspicious RCU usage
[ 5723.180083] -----------------------------
[ 5723.180084] include/net/nexthop.h:516 suspicious rcu_dereference_check() usage!
[ 5723.180086]
other info that might help us debug this:

[ 5723.180087]
rcu_scheduler_active = 2, debug_locks = 1
[ 5723.180089] 2 locks held by cat/55856:
[ 5723.180091] #0: ffff9440a582afa8 (&p->lock){+.+.}-{3:3}, at: seq_read_iter (fs/seq_file.c:188)
[ 5723.180100] #1: ffffffffaac07040 (rcu_read_lock_bh){....}-{1:2}, at: rcu_lock_acquire (include/linux/rcupdate.h:326)
[ 5723.180109]
stack backtrace:
[ 5723.180111] CPU: 14 PID: 55856 Comm: cat Tainted: G S        I        6.3.0-dbx-DEV #528
[ 5723.180115] Call Trace:
[ 5723.180117]  <TASK>
[ 5723.180119] dump_stack_lvl (lib/dump_stack.c:107)
[ 5723.180124] dump_stack (lib/dump_stack.c:114)
[ 5723.180126] lockdep_rcu_suspicious (include/linux/context_tracking.h:122)
[ 5723.180132] ipv6_route_seq_show (include/net/nexthop.h:?)
[ 5723.180135] ? ipv6_route_seq_next (net/ipv6/ip6_fib.c:2605)
[ 5723.180140] seq_read_iter (fs/seq_file.c:272)
[ 5723.180145] seq_read (fs/seq_file.c:163)
[ 5723.180151] proc_reg_read (fs/proc/inode.c:316 fs/proc/inode.c:328)
[ 5723.180155] vfs_read (fs/read_write.c:468)
[ 5723.180160] ? up_read (kernel/locking/rwsem.c:1617)
[ 5723.180164] ksys_read (fs/read_write.c:613)
[ 5723.180168] __x64_sys_read (fs/read_write.c:621)
[ 5723.180170] do_syscall_64 (arch/x86/entry/common.c:?)
[ 5723.180174] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 5723.180177] RIP: 0033:0x7fa455677d2a

Fixes: 09eed1192c ("neighbour: switch to standard rcu, instead of rcu_bh")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230510154646.370659-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:07:05 -07:00
Jiri Pirko e93c9378e3 devlink: change per-devlink netdev notifier to static one
The commit 565b4824c3 ("devlink: change port event netdev notifier
from per-net to global") changed original per-net notifier to be
per-devlink instance. That fixed the issue of non-receiving events
of netdev uninit if that moved to a different namespace.
That worked fine in -net tree.

However, later on when commit ee75f1fc44 ("net/mlx5e: Create
separate devlink instance for ethernet auxiliary device") and
commit 72ed5d5624 ("net/mlx5: Suspend auxiliary devices only in
case of PCI device suspend") were merged, a deadlock was introduced
when removing a namespace with devlink instance with another nested
instance.

Here there is the bad flow example resulting in deadlock with mlx5:
net_cleanup_work -> cleanup_net (takes down_read(&pernet_ops_rwsem) ->
devlink_pernet_pre_exit() -> devlink_reload() ->
mlx5_devlink_reload_down() -> mlx5_unload_one_devl_locked() ->
mlx5_detach_device() -> del_adev() -> mlx5e_remove() ->
mlx5e_destroy_devlink() -> devlink_free() ->
unregister_netdevice_notifier() (takes down_write(&pernet_ops_rwsem)

Steps to reproduce:
$ modprobe mlx5_core
$ ip netns add ns1
$ devlink dev reload pci/0000:08:00.0 netns ns1
$ ip netns del ns1

Resolve this by converting the notifier from per-devlink instance to
a static one registered during init phase and leaving it registered
forever. Use this notifier for all devlink port instances created
later on.

Note what a tree needs this fix only in case all of the cited fixes
commits are present.

Reported-by: Moshe Shemesh <moshe@nvidia.com>
Fixes: 565b4824c3 ("devlink: change port event netdev notifier from per-net to global")
Fixes: ee75f1fc44 ("net/mlx5e: Create separate devlink instance for ethernet auxiliary device")
Fixes: 72ed5d5624 ("net/mlx5: Suspend auxiliary devices only in case of PCI device suspend")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230510144621.932017-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:06:50 -07:00
Jakub Kicinski bc88ba0cad Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes. No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 09:06:55 -07:00
David Morley ccce324dab tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.

We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf

This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts

This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val

For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...

Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-11 10:31:16 +02:00
Jakub Kicinski cceac92678 netfilter pull request 23-05-10
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmRbUnQACgkQ1V2XiooU
 IOQNVw//eoCiid3J4TuNdzHaBHXUlZvln1n5Z6K5fIz5ytrY1T7oKC8Y9StkzzWR
 29ToFLOJt1iRAcgxsghWRIwUzNwpuUdqgd9cUZMHMQxT0BJItp6FXUql2+1LkF/I
 b/gnnb90zyE7lBS/VSRyOiqMiJlP+Som22d7Nn5k2KfTYEdXKwfzjsWAu3W3Sb0s
 Lv/MA9DE42qcwiZubmFmDtOtAunPJFZm3HgkcAVeXoNkBDrSfkvxLIMYG6VfFNhQ
 AkKMyzX293wpwVxfOuQfJr4QVlxAgOQUko+FqajoWMBtfA3yldZjJ8RC7c9Af9uI
 ciOP11vHBCG84KrTabC5kdqOcvadreDiM/oIvk57ztQhCr3e+po+vIz6Cv4p9I30
 m5GXfgbtMRl6hM2S5lrRc5fNRkYJHE4aNvesFTGaLpK3LogusH1E9mH5jdjnbU42
 TwkIe250qJJelNn9ZxS5Jt0BgyogNfeiA9lOXmaQBpYmwIahrjDf0g8z2I85nPhF
 PDukjSXsxi4uHwpSF5wFrlqkAPiEX4vC95uSUTVbbBgZrHhNJdGgu4FJSHRM4mVo
 6Awxk2O2bvcDpXEfTBdD4EJF71bZ/aH2i8ddU1oupIl88O01TWTtkO4SYHqMX+tC
 fPnpGzBN8KJvHgeFxO4p0v1oelv2RtiITv1gi7YJOnq/+Vd2yy0=
 =HmfP
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter updates for net

The following patchset contains Netfilter fixes for net:

1) Fix UAF when releasing netnamespace, from Florian Westphal.

2) Fix possible BUG_ON when nf_conntrack is enabled with enable_hooks,
   from Florian Westphal.

3) Fixes for nft_flowtable.sh selftest, from Boris Sukholitko.

4) Extend nft_flowtable.sh selftest to cover integration with
   ingress/egress hooks, from Florian Westphal.

* tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: nft_flowtable.sh: check ingress/egress chain too
  selftests: nft_flowtable.sh: monitor result file sizes
  selftests: nft_flowtable.sh: wait for specific nc pids
  selftests: nft_flowtable.sh: no need for ps -x option
  selftests: nft_flowtable.sh: use /proc for pid checking
  netfilter: conntrack: fix possible bug_on with enable_hooks=1
  netfilter: nf_tables: always release netdev hooks from notifier
====================

Link: https://lore.kernel.org/r/20230510083313.152961-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:08:58 -07:00
Kuniyuki Iwashima e1d09c2c2f af_unix: Fix data races around sk->sk_shutdown.
KCSAN found a data race around sk->sk_shutdown where unix_release_sock()
and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll()
and unix_dgram_poll() read it locklessly.

We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE().

BUG: KCSAN: data-race in unix_poll / unix_release_sock

write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0:
 unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
 unix_release+0x59/0x80 net/unix/af_unix.c:1042
 __sock_release+0x7d/0x170 net/socket.c:653
 sock_close+0x19/0x30 net/socket.c:1397
 __fput+0x179/0x5e0 fs/file_table.c:321
 ____fput+0x15/0x20 fs/file_table.c:349
 task_work_run+0x116/0x1a0 kernel/task_work.c:179
 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
 syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1:
 unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170
 sock_poll+0xcf/0x2b0 net/socket.c:1385
 vfs_poll include/linux/poll.h:88 [inline]
 ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855
 ep_send_events fs/eventpoll.c:1694 [inline]
 ep_poll fs/eventpoll.c:1823 [inline]
 do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258
 __do_sys_epoll_wait fs/eventpoll.c:2270 [inline]
 __se_sys_epoll_wait fs/eventpoll.c:2265 [inline]
 __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x00 -> 0x03

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 3c73419c09 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:06:53 -07:00
Kuniyuki Iwashima 679ed006d4 af_unix: Fix a data race of sk->sk_receive_queue->qlen.
KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg()
updates qlen under the queue lock and sendmsg() checks qlen under
unix_state_sock(), not the queue lock, so the reader side needs
READ_ONCE().

BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer

write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0:
 __skb_unlink include/linux/skbuff.h:2347 [inline]
 __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197
 __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263
 __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452
 unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549
 sock_recvmsg_nosec net/socket.c:1019 [inline]
 ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720
 ___sys_recvmsg+0xc8/0x150 net/socket.c:2764
 do_recvmmsg+0x182/0x560 net/socket.c:2858
 __sys_recvmmsg net/socket.c:2937 [inline]
 __do_sys_recvmmsg net/socket.c:2960 [inline]
 __se_sys_recvmmsg net/socket.c:2953 [inline]
 __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1:
 skb_queue_len include/linux/skbuff.h:2127 [inline]
 unix_recvq_full net/unix/af_unix.c:229 [inline]
 unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445
 unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048
 sock_sendmsg_nosec net/socket.c:724 [inline]
 sock_sendmsg+0x148/0x160 net/socket.c:747
 ____sys_sendmsg+0x20e/0x620 net/socket.c:2503
 ___sys_sendmsg+0xc6/0x140 net/socket.c:2557
 __sys_sendmmsg+0x11d/0x370 net/socket.c:2643
 __do_sys_sendmmsg net/socket.c:2672 [inline]
 __se_sys_sendmmsg net/socket.c:2669 [inline]
 __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x0000000b -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:06:53 -07:00
Eric Dumazet 5bca1d081f net: datagram: fix data-races in datagram_poll()
datagram_poll() runs locklessly, we should add READ_ONCE()
annotations while reading sk->sk_err, sk->sk_shutdown and sk->sk_state.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230509173131.3263780-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:06:49 -07:00
Ilia.Gavrilov 059fa49202 sctp: fix a potential OOB access in sctp_sched_set_sched()
The 'sched' index value must be checked before accessing an element
of the 'sctp_sched_ops' array. Otherwise, it can lead to OOB access.

Note that it's harmless since the 'sched' parameter is checked before
calling 'sctp_sched_set_sched'.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 12:10:15 +01:00
Eric Dumazet e14cadfd80 tcp: add annotations around sk->sk_shutdown accesses
Now sk->sk_shutdown is no longer a bitfield, we can add
standard READ_ONCE()/WRITE_ONCE() annotations to silence
KCSAN reports like the following:

BUG: KCSAN: data-race in tcp_disconnect / tcp_poll

write to 0xffff88814588582c of 1 bytes by task 3404 on cpu 1:
tcp_disconnect+0x4d6/0xdb0 net/ipv4/tcp.c:3121
__inet_stream_connect+0x5dd/0x6e0 net/ipv4/af_inet.c:715
inet_stream_connect+0x48/0x70 net/ipv4/af_inet.c:727
__sys_connect_file net/socket.c:2001 [inline]
__sys_connect+0x19b/0x1b0 net/socket.c:2018
__do_sys_connect net/socket.c:2028 [inline]
__se_sys_connect net/socket.c:2025 [inline]
__x64_sys_connect+0x41/0x50 net/socket.c:2025
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88814588582c of 1 bytes by task 3374 on cpu 0:
tcp_poll+0x2e6/0x7d0 net/ipv4/tcp.c:562
sock_poll+0x253/0x270 net/socket.c:1383
vfs_poll include/linux/poll.h:88 [inline]
io_poll_check_events io_uring/poll.c:281 [inline]
io_poll_task_func+0x15a/0x820 io_uring/poll.c:333
handle_tw_list io_uring/io_uring.c:1184 [inline]
tctx_task_work+0x1fe/0x4d0 io_uring/io_uring.c:1246
task_work_run+0x123/0x160 kernel/task_work.c:179
get_signal+0xe64/0xff0 kernel/signal.c:2635
arch_do_signal_or_restart+0x89/0x2a0 arch/x86/kernel/signal.c:306
exit_to_user_mode_loop+0x6f/0xe0 kernel/entry/common.c:168
exit_to_user_mode_prepare+0x6c/0xb0 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297
do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x03 -> 0x00

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 10:27:31 +01:00
Eric Dumazet 4063384ef7 net: add vlan_get_protocol_and_depth() helper
Before blamed commit, pskb_may_pull() was used instead
of skb_header_pointer() in __vlan_get_protocol() and friends.

Few callers depended on skb->head being populated with MAC header,
syzbot caught one of them (skb_mac_gso_segment())

Add vlan_get_protocol_and_depth() to make the intent clearer
and use it where sensible.

This is a more generic fix than commit e9d3f80935
("net/af_packet: make sure to pull mac header") which was
dealing with a similar issue.

kernel BUG at include/linux/skbuff.h:2655 !
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline]
RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136
Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286
RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9
R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012
R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7
FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
[<ffffffff847018dd>] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419
[<ffffffff8470398a>] skb_gso_segment include/linux/netdevice.h:4819 [inline]
[<ffffffff8470398a>] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725
[<ffffffff84707042>] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313
[<ffffffff851a9ec7>] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029
[<ffffffff851b4a82>] packet_snd net/packet/af_packet.c:3111 [inline]
[<ffffffff851b4a82>] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142
[<ffffffff84669a12>] sock_sendmsg_nosec net/socket.c:716 [inline]
[<ffffffff84669a12>] sock_sendmsg net/socket.c:736 [inline]
[<ffffffff84669a12>] __sys_sendto+0x472/0x5f0 net/socket.c:2139
[<ffffffff84669c75>] __do_sys_sendto net/socket.c:2151 [inline]
[<ffffffff84669c75>] __se_sys_sendto net/socket.c:2147 [inline]
[<ffffffff84669c75>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147
[<ffffffff8551d40f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff8551d40f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80
[<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 469aceddfa ("vlan: consolidate VLAN parsing code and limit max parsing depth")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 10:25:55 +01:00
Eric Dumazet d0ac89f6f9 net: deal with most data-races in sk_wait_event()
__condition is evaluated twice in sk_wait_event() macro.

First invocation is lockless, and reads can race with writes,
as spotted by syzbot.

BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect

write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1:
tcp_disconnect+0x2cd/0xdb0
inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911
__sys_shutdown_sock net/socket.c:2343 [inline]
__sys_shutdown net/socket.c:2355 [inline]
__do_sys_shutdown net/socket.c:2363 [inline]
__se_sys_shutdown+0xf8/0x140 net/socket.c:2361
__x64_sys_shutdown+0x31/0x40 net/socket.c:2361
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0:
sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75
tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
__sys_sendto+0x246/0x300 net/socket.c:2142
__do_sys_sendto net/socket.c:2154 [inline]
__se_sys_sendto net/socket.c:2150 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2150
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x00000000 -> 0x00000068

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 10:03:32 +01:00
Eric Dumazet e05a5f510f net: annotate sk->sk_err write from do_recvmmsg()
do_recvmmsg() can write to sk->sk_err from multiple threads.

As said before, many other points reading or writing sk_err
need annotations.

Fixes: 34b88a68f2 ("net: Fix use after free in the recvmmsg exit path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 09:58:29 +01:00
Eric Dumazet a939d14919 netlink: annotate accesses to nlk->cb_running
Both netlink_recvmsg() and netlink_native_seq_show() read
nlk->cb_running locklessly. Use READ_ONCE() there.

Add corresponding WRITE_ONCE() to netlink_dump() and
__netlink_dump_start()

syzbot reported:
BUG: KCSAN: data-race in __netlink_dump_start / netlink_recvmsg

write to 0xffff88813ea4db59 of 1 bytes by task 28219 on cpu 0:
__netlink_dump_start+0x3af/0x4d0 net/netlink/af_netlink.c:2399
netlink_dump_start include/linux/netlink.h:308 [inline]
rtnetlink_rcv_msg+0x70f/0x8c0 net/core/rtnetlink.c:6130
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2577
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6192
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
sock_write_iter+0x1aa/0x230 net/socket.c:1138
call_write_iter include/linux/fs.h:1851 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x463/0x760 fs/read_write.c:584
ksys_write+0xeb/0x1a0 fs/read_write.c:637
__do_sys_write fs/read_write.c:649 [inline]
__se_sys_write fs/read_write.c:646 [inline]
__x64_sys_write+0x42/0x50 fs/read_write.c:646
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88813ea4db59 of 1 bytes by task 28222 on cpu 1:
netlink_recvmsg+0x3b4/0x730 net/netlink/af_netlink.c:2022
sock_recvmsg_nosec+0x4c/0x80 net/socket.c:1017
____sys_recvmsg+0x2db/0x310 net/socket.c:2718
___sys_recvmsg net/socket.c:2762 [inline]
do_recvmmsg+0x2e5/0x710 net/socket.c:2856
__sys_recvmmsg net/socket.c:2935 [inline]
__do_sys_recvmmsg net/socket.c:2958 [inline]
__se_sys_recvmmsg net/socket.c:2951 [inline]
__x64_sys_recvmmsg+0xe2/0x160 net/socket.c:2951
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x00 -> 0x01

Fixes: 16b304f340 ("netlink: Eliminate kmalloc in netlink dump operation.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 09:28:38 +01:00
Martin Wetterwald 81ac2722fa net: ipconfig: Allow DNS to be overwritten by DHCPACK
Some DHCP server implementations only send the important requested DHCP
options in the final BOOTP reply (DHCPACK).
One example is systemd-networkd.
However, RFC2131, in section 4.3.1 states:

> The server MUST return to the client:
> [...]
> o Parameters requested by the client, according to the following
>   rules:
>
>      -- IF the server has been explicitly configured with a default
>         value for the parameter, the server MUST include that value
>         in an appropriate option in the 'option' field, ELSE

I've reported the issue here:
https://github.com/systemd/systemd/issues/27471

Linux PNP DHCP client implementation only takes into account the DNS
servers received in the first BOOTP reply (DHCPOFFER).
This usually isn't an issue as servers are required to put the same
values in the DHCPOFFER and DHCPACK.
However, RFC2131, in section 4.3.2 states:

> Any configuration parameters in the DHCPACK message SHOULD NOT
> conflict with those in the earlier DHCPOFFER message to which the
> client is responding.  The client SHOULD use the parameters in the
> DHCPACK message for configuration.

When making Linux PNP DHCP client (cmdline ip=dhcp) interact with
systemd-networkd DHCP server, an interesting "protocol misunderstanding"
happens:
Because DNS servers were only specified in the DHCPACK and not in the
DHCPOFFER, Linux will not catch the correct DNS servers: in the first
BOOTP reply (DHCPOFFER), it sees that there is no DNS, and sets as
fallback the IP of the DHCP server itself. When the second BOOTP reply
comes (DHCPACK), it's already too late: the kernel will not overwrite
the fallback setting it has set previously.

This patch makes the kernel overwrite its DNS fallback by DNS servers
specified in the DHCPACK if any.

Signed-off-by: Martin Wetterwald <martin@wetterwald.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 09:20:47 +01:00
Florian Westphal e72eeab542 netfilter: conntrack: fix possible bug_on with enable_hooks=1
I received a bug report (no reproducer so far) where we trip over

712         rcu_read_lock();
713         ct_hook = rcu_dereference(nf_ct_hook);
714         BUG_ON(ct_hook == NULL);  // here

In nf_conntrack_destroy().

First turn this BUG_ON into a WARN.  I think it was triggered
via enable_hooks=1 flag.

When this flag is turned on, the conntrack hooks are registered
before nf_ct_hook pointer gets assigned.
This opens a short window where packets enter the conntrack machinery,
can have skb->_nfct set up and a subsequent kfree_skb might occur
before nf_ct_hook is set.

Call nf_conntrack_init_end() to set nf_ct_hook before we register the
pernet ops.

Fixes: ba3fbe6636 ("netfilter: nf_conntrack: provide modparam to always register conntrack hooks")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-05-10 08:50:39 +02:00
Florian Westphal dc1c9fd4a8 netfilter: nf_tables: always release netdev hooks from notifier
This reverts "netfilter: nf_tables: skip netdev events generated on netns removal".

The problem is that when a veth device is released, the veth release
callback will also queue the peer netns device for removal.

Its possible that the peer netns is also slated for removal.  In this
case, the device memory is already released before the pre_exit hook of
the peer netns runs:

BUG: KASAN: slab-use-after-free in nf_hook_entry_head+0x1b8/0x1d0
Read of size 8 at addr ffff88812c0124f0 by task kworker/u8:1/45
Workqueue: netns cleanup_net
Call Trace:
 nf_hook_entry_head+0x1b8/0x1d0
 __nf_unregister_net_hook+0x76/0x510
 nft_netdev_unregister_hooks+0xa0/0x220
 __nft_release_hook+0x184/0x490
 nf_tables_pre_exit_net+0x12f/0x1b0
 ..

Order is:
1. First netns is released, veth_dellink() queues peer netns device
   for removal
2. peer netns is queued for removal
3. peer netns device is released, unreg event is triggered
4. unreg event is ignored because netns is going down
5. pre_exit hook calls nft_netdev_unregister_hooks but device memory
   might be free'd already.

Fixes: 68a3765c65 ("netfilter: nf_tables: skip netdev events generated on netns removal")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-05-10 08:50:18 +02:00
Benedict Wong 8680407b6f xfrm: Check if_id in inbound policy/secpath match
This change ensures that if configured in the policy, the if_id set in
the policy and secpath states match during the inbound policy check.
Without this, there is potential for ambiguity where entries in the
secpath differing by only the if_id could be mismatched.

Notably, this is checked in the outbound direction when resolving
templates to SAs, but not on the inbound path when matching SAs and
policies.

Test: Tested against Android kernel unit tests & CTS
Signed-off-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-05-10 07:56:05 +02:00
Tobias Brunner cf3128a7ac af_key: Reject optional tunnel/BEET mode templates in outbound policies
xfrm_state_find() uses `encap_family` of the current template with
the passed local and remote addresses to find a matching state.
If an optional tunnel or BEET mode template is skipped in a mixed-family
scenario, there could be a mismatch causing an out-of-bounds read as
the addresses were not replaced to match the family of the next template.

While there are theoretical use cases for optional templates in outbound
policies, the only practical one is to skip IPComp states in inbound
policies if uncompressed packets are received that are handled by an
implicitly created IPIP state instead.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-05-10 07:04:51 +02:00
Tobias Brunner 3d776e31c8 xfrm: Reject optional tunnel/BEET mode templates in outbound policies
xfrm_state_find() uses `encap_family` of the current template with
the passed local and remote addresses to find a matching state.
If an optional tunnel or BEET mode template is skipped in a mixed-family
scenario, there could be a mismatch causing an out-of-bounds read as
the addresses were not replaced to match the family of the next template.

While there are theoretical use cases for optional templates in outbound
policies, the only practical one is to skip IPComp states in inbound
policies if uncompressed packets are received that are handled by an
implicitly created IPIP state instead.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-05-10 07:04:50 +02:00
Lukas Bulwahn 559ae55cfc net: skbuff: remove special handling for SLOB
Commit c9929f0e34 ("mm/slob: remove CONFIG_SLOB") removes CONFIG_SLOB.
Now, we can also remove special handling for socket buffers with the SLOB
allocator. The code with HAVE_SKB_SMALL_HEAD_CACHE=1 is now the default
behavior for all allocators.

Remove an unnecessary distinction between SLOB and SLAB/SLUB allocator
after the SLOB allocator is gone.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509071207.28942-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-09 19:40:32 -07:00
Eric Dumazet 424f8416bb net: skb_partial_csum_set() fix against transport header magic value
skb->transport_header uses the special 0xFFFF value
to mark if the transport header was set or not.

We must prevent callers to accidentaly set skb->transport_header
to 0xFFFF. Note that only fuzzers can possibly do this today.

syzbot reported:

WARNING: CPU: 0 PID: 2340 at include/linux/skbuff.h:2847 skb_transport_offset include/linux/skbuff.h:2956 [inline]
WARNING: CPU: 0 PID: 2340 at include/linux/skbuff.h:2847 virtio_net_hdr_to_skb+0xbcc/0x10c0 include/linux/virtio_net.h:103
Modules linked in:
CPU: 0 PID: 2340 Comm: syz-executor.0 Not tainted 6.3.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
RIP: 0010:skb_transport_header include/linux/skbuff.h:2847 [inline]
RIP: 0010:skb_transport_offset include/linux/skbuff.h:2956 [inline]
RIP: 0010:virtio_net_hdr_to_skb+0xbcc/0x10c0 include/linux/virtio_net.h:103
Code: 41 39 df 0f 82 c3 04 00 00 48 8b 7c 24 10 44 89 e6 e8 08 6e 59 ff 48 85 c0 74 54 e8 ce 36 7e fc e9 37 f8 ff ff e8 c4 36 7e fc <0f> 0b e9 93 f8 ff ff 44 89 f7 44 89 e6 e8 32 38 7e fc 45 39 e6 0f
RSP: 0018:ffffc90004497880 EFLAGS: 00010293
RAX: ffffffff84fea55c RBX: 000000000000ffff RCX: ffff888120be2100
RDX: 0000000000000000 RSI: 000000000000ffff RDI: 000000000000ffff
RBP: ffffc90004497990 R08: ffffffff84fe9de5 R09: 0000000000000034
R10: ffffea00048ebd80 R11: 0000000000000034 R12: ffff88811dc2d9c8
R13: dffffc0000000000 R14: ffff88811dc2d9ae R15: 1ffff11023b85b35
FS: 00007f9211a59700(0000) GS:ffff8881f6c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000200002c0 CR3: 00000001215a5000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
packet_snd net/packet/af_packet.c:3076 [inline]
packet_sendmsg+0x4590/0x61a0 net/packet/af_packet.c:3115
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
__sys_sendto+0x472/0x630 net/socket.c:2144
__do_sys_sendto net/socket.c:2156 [inline]
__se_sys_sendto net/socket.c:2152 [inline]
__x64_sys_sendto+0xe5/0x100 net/socket.c:2152
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f9210c8c169
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f9211a59168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f9210dabf80 RCX: 00007f9210c8c169
RDX: 000000000000ffed RSI: 00000000200000c0 RDI: 0000000000000003
RBP: 00007f9210ce7ca1 R08: 0000000020000540 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe135d65cf R14: 00007f9211a59300 R15: 0000000000022000

Fixes: 66e4c8d950 ("net: warn if transport header was not set")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-07 14:55:37 +01:00
Linus Torvalds ed23734c23 Including fixes from netfilter.
Current release - regressions:
 
  - sched: act_pedit: free pedit keys on bail from offset check
 
 Current release - new code bugs:
 
  - pds_core:
   - Kconfig fixes (DEBUGFS and AUXILIARY_BUS)
   - fix mutex double unlock in error path
 
 Previous releases - regressions:
 
  - sched: cls_api: remove block_cb from driver_list before freeing
 
  - nf_tables: fix ct untracked match breakage
 
  - eth: mtk_eth_soc: drop generic vlan rx offload
 
  - sched: flower: fix error handler on replace
 
 Previous releases - always broken:
 
  - tcp: fix skb_copy_ubufs() vs BIG TCP
 
  - ipv6: fix skb hash for some RST packets
 
  - af_packet: don't send zero-byte data in packet_sendmsg_spkt()
 
  - rxrpc: timeout handling fixes after moving client call connection
    to the I/O thread
 
  - ixgbe: fix panic during XDP_TX with > 64 CPUs
 
  - igc: RMW the SRRCTL register to prevent losing timestamp config
 
  - dsa: mt7530: fix corrupt frames using TRGMII on 40 MHz XTAL MT7621
 
  - r8152:
    - fix flow control issue of RTL8156A
    - fix the poor throughput for 2.5G devices
    - move setting r8153b_rx_agg_chg_indicate() to fix coalescing
    - enable autosuspend
 
  - ncsi: clear Tx enable mode when handling a Config required AEN
 
  - octeontx2-pf: macsec: fixes for CN10KB ASIC rev
 
 Misc:
 
  - 9p: remove INET dependency
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmRVeUIACgkQMUZtbf5S
 IrtTug/9Hhg/L0PTSwrfuGh4W1/cjheMWppNLkwyWQUiKG7FcZQ9vu9PxceE3VRu
 2fTqHyvgDMZ8jACovXObeda8z1+g3s/tIPaXELephBIjVlF/h3kG2OaIzlU4jDb4
 A4vklwf8eLbfyVBG22QgKl/I70zVMtnmnOo6c6CPuIOTcMPzslndFO9tB0nCg99F
 DCgCM1BBP1tz+OUch2rLnSzYcqkWqS49BhRk6dhYSliawUFU/5+1tDGDjwWolkfm
 0jqP9DjBOSpZKO8m7SpsUNz7NFRIfYErWZ+YebWbggNxj/6TRJTP83MM0tGoK1rE
 /mz2xpuOki59frlwVOAD6gb/qefjHUp21P4NA7bnhizxFlQL5MHpCeGQ9yLHBSmY
 9Q4ArJkM4jXQ0oDA2nII/pz+cDZGEWFGQ14WW3kYUb7WFmISH4I9OiA9i0TBW6OL
 r1Y/rqzkUvtKWzh9RpiAF9lsdHAm3SX9ES5RfMxzv0x886VOZR4jaMmokRDdPRzq
 0r2Oyj75b62+X0r44Fe22Pl/kPS/uh3642xo9h85aAv/EvhT9JNzMvomJm9d6tkb
 966I085AVbwxPAy+rl5SWyAq60EWDExNTjZvPv0mSMlmSsQ9iK5//xOF2Saw2zai
 /44zQ27tVGkCC44Ou5KmfJN3u4OrKkhcuyxtcDr9QeoOdKZRkMg=
 =9xND
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter.

  Current release - regressions:

   - sched: act_pedit: free pedit keys on bail from offset check

  Current release - new code bugs:

   - pds_core:
      - Kconfig fixes (DEBUGFS and AUXILIARY_BUS)
      - fix mutex double unlock in error path

  Previous releases - regressions:

   - sched: cls_api: remove block_cb from driver_list before freeing

   - nf_tables: fix ct untracked match breakage

   - eth: mtk_eth_soc: drop generic vlan rx offload

   - sched: flower: fix error handler on replace

  Previous releases - always broken:

   - tcp: fix skb_copy_ubufs() vs BIG TCP

   - ipv6: fix skb hash for some RST packets

   - af_packet: don't send zero-byte data in packet_sendmsg_spkt()

   - rxrpc: timeout handling fixes after moving client call connection
     to the I/O thread

   - ixgbe: fix panic during XDP_TX with > 64 CPUs

   - igc: RMW the SRRCTL register to prevent losing timestamp config

   - dsa: mt7530: fix corrupt frames using TRGMII on 40 MHz XTAL MT7621

   - r8152:
      - fix flow control issue of RTL8156A
      - fix the poor throughput for 2.5G devices
      - move setting r8153b_rx_agg_chg_indicate() to fix coalescing
      - enable autosuspend

   - ncsi: clear Tx enable mode when handling a Config required AEN

   - octeontx2-pf: macsec: fixes for CN10KB ASIC rev

  Misc:

   - 9p: remove INET dependency"

* tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  net: bcmgenet: Remove phy_stop() from bcmgenet_netif_stop()
  pds_core: fix mutex double unlock in error path
  net/sched: flower: fix error handler on replace
  Revert "net/sched: flower: Fix wrong handle assignment during filter change"
  net/sched: flower: fix filter idr initialization
  net: fec: correct the counting of XDP sent frames
  bonding: add xdp_features support
  net: enetc: check the index of the SFI rather than the handle
  sfc: Add back mailing list
  virtio_net: suppress cpu stall when free_unused_bufs
  ice: block LAN in case of VF to VF offload
  net: dsa: mt7530: fix network connectivity with multiple CPU ports
  net: dsa: mt7530: fix corrupt frames using trgmii on 40 MHz XTAL MT7621
  9p: Remove INET dependency
  netfilter: nf_tables: fix ct untracked match breakage
  af_packet: Don't send zero-byte data in packet_sendmsg_spkt().
  igc: read before write to SRRCTL register
  pds_core: add AUXILIARY_BUS and NET_DEVLINK to Kconfig
  pds_core: remove CONFIG_DEBUG_FS from makefile
  ionic: catch failure from devlink_alloc
  ...
2023-05-05 19:12:01 -07:00
Chuck Lever 319050d430 SUNRPC: Fix error handling in svc_setup_socket()
Dan points out that sock_alloc_file() releases @sock on error, but
so do all of svc_setup_socket's callers, resulting in a double-
release if sock_alloc_file() returns an error.

Rather than allocating a struct file for all new sockets, allocate
one only for sockets created during a TCP accept. For the moment,
those are the only ones that will ever be used with RPC-with-TLS.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: ae0d77708a ("SUNRPC: Ensure server-side sockets have a sock->file")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-05 20:51:11 -04:00
Jakub Kicinski 644bca1d48 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
There's a fix which landed in net-next, pull it in along
with the couple of minor cleanups.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-05 13:35:45 -07:00
Vlad Buslov fd741f0d9f net/sched: flower: fix error handler on replace
When replacing a filter (i.e. 'fold' pointer is not NULL) the insertion of
new filter to idr is postponed until later in code since handle is already
provided by the user. However, the error handling code in fl_change()
always assumes that the new filter had been inserted into idr. If error
handler is reached when replacing existing filter it may remove it from idr
therefore making it unreachable for delete or dump afterwards. Fix the
issue by verifying that 'fold' argument wasn't provided by caller before
calling idr_remove().

Fixes: 08a0063df3 ("net/sched: flower: Move filter handle initialization earlier")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-05 10:01:31 +01:00
Vlad Buslov 5110f3ff6d Revert "net/sched: flower: Fix wrong handle assignment during filter change"
This reverts commit 32eff6bace.

Superseded by the following commit in this series.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-05 10:01:31 +01:00
Vlad Buslov dd4f6bbfa6 net/sched: flower: fix filter idr initialization
The cited commit moved idr initialization too early in fl_change() which
allows concurrent users to access the filter that is still being
initialized and is in inconsistent state, which, in turn, can cause NULL
pointer dereference [0]. Since there is no obvious way to fix the ordering
without reverting the whole cited commit, alternative approach taken to
first insert NULL pointer into idr in order to allocate the handle but
still cause fl_get() to return NULL and prevent concurrent users from
seeing the filter while providing miss-to-action infrastructure with valid
handle id early in fl_change().

[  152.434728] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN
[  152.436163] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[  152.437269] CPU: 4 PID: 3877 Comm: tc Not tainted 6.3.0-rc4+ #5
[  152.438110] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  152.439644] RIP: 0010:fl_dump_key+0x8b/0x1d10 [cls_flower]
[  152.440461] Code: 01 f2 02 f2 c7 40 08 04 f2 04 f2 c7 40 0c 04 f3 f3 f3 65 48 8b 04 25 28 00 00 00 48 89 84 24 00 01 00 00 48 89 c8 48 c1 e8 03 <0f> b6 04 10 84 c0 74 08 3c 03 0f 8e 98 19 00 00 8b 13 85 d2 74 57
[  152.442885] RSP: 0018:ffff88817a28f158 EFLAGS: 00010246
[  152.443851] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  152.444826] RDX: dffffc0000000000 RSI: ffffffff8500ae80 RDI: ffff88810a987900
[  152.445791] RBP: ffff888179d88240 R08: ffff888179d8845c R09: ffff888179d88240
[  152.446780] R10: ffffed102f451e48 R11: 00000000fffffff2 R12: ffff88810a987900
[  152.447741] R13: ffffffff8500ae80 R14: ffff88810a987900 R15: ffff888149b3c738
[  152.448756] FS:  00007f5eb2a34800(0000) GS:ffff88881ec00000(0000) knlGS:0000000000000000
[  152.449888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  152.450685] CR2: 000000000046ad19 CR3: 000000010b0bd006 CR4: 0000000000370ea0
[  152.451641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  152.452628] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  152.453588] Call Trace:
[  152.454032]  <TASK>
[  152.454447]  ? netlink_sendmsg+0x7a1/0xcb0
[  152.455109]  ? sock_sendmsg+0xc5/0x190
[  152.455689]  ? ____sys_sendmsg+0x535/0x6b0
[  152.456320]  ? ___sys_sendmsg+0xeb/0x170
[  152.456916]  ? do_syscall_64+0x3d/0x90
[  152.457529]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  152.458321]  ? ___sys_sendmsg+0xeb/0x170
[  152.458958]  ? __sys_sendmsg+0xb5/0x140
[  152.459564]  ? do_syscall_64+0x3d/0x90
[  152.460122]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  152.460852]  ? fl_dump_key_options.part.0+0xea0/0xea0 [cls_flower]
[  152.461710]  ? _raw_spin_lock+0x7a/0xd0
[  152.462299]  ? _raw_read_lock_irq+0x30/0x30
[  152.462924]  ? nla_put+0x15e/0x1c0
[  152.463480]  fl_dump+0x228/0x650 [cls_flower]
[  152.464112]  ? fl_tmplt_dump+0x210/0x210 [cls_flower]
[  152.464854]  ? __kmem_cache_alloc_node+0x1a7/0x330
[  152.465592]  ? nla_put+0x15e/0x1c0
[  152.466160]  tcf_fill_node+0x515/0x9a0
[  152.466766]  ? tc_setup_offload_action+0xf0/0xf0
[  152.467463]  ? __alloc_skb+0x13c/0x2a0
[  152.468067]  ? __build_skb_around+0x330/0x330
[  152.468814]  ? fl_get+0x107/0x1a0 [cls_flower]
[  152.469503]  tc_del_tfilter+0x718/0x1330
[  152.470115]  ? is_bpf_text_address+0xa/0x20
[  152.470765]  ? tc_ctl_chain+0xee0/0xee0
[  152.471335]  ? __kernel_text_address+0xe/0x30
[  152.471948]  ? unwind_get_return_address+0x56/0xa0
[  152.472639]  ? __thaw_task+0x150/0x150
[  152.473218]  ? arch_stack_walk+0x98/0xf0
[  152.473839]  ? __stack_depot_save+0x35/0x4c0
[  152.474501]  ? stack_trace_save+0x91/0xc0
[  152.475119]  ? security_capable+0x51/0x90
[  152.475741]  rtnetlink_rcv_msg+0x2c1/0x9d0
[  152.476387]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
[  152.477042]  ? __sys_sendmsg+0xb5/0x140
[  152.477664]  ? do_syscall_64+0x3d/0x90
[  152.478255]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  152.479010]  ? __stack_depot_save+0x35/0x4c0
[  152.479679]  ? __stack_depot_save+0x35/0x4c0
[  152.480346]  netlink_rcv_skb+0x12c/0x360
[  152.480929]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
[  152.481517]  ? do_syscall_64+0x3d/0x90
[  152.482061]  ? netlink_ack+0x1550/0x1550
[  152.482612]  ? rhashtable_walk_peek+0x170/0x170
[  152.483262]  ? kmem_cache_alloc_node+0x1af/0x390
[  152.483875]  ? _copy_from_iter+0x3d6/0xc70
[  152.484528]  netlink_unicast+0x553/0x790
[  152.485168]  ? netlink_attachskb+0x6a0/0x6a0
[  152.485848]  ? unwind_next_frame+0x11cc/0x1a10
[  152.486538]  ? arch_stack_walk+0x61/0xf0
[  152.487169]  netlink_sendmsg+0x7a1/0xcb0
[  152.487799]  ? netlink_unicast+0x790/0x790
[  152.488355]  ? iovec_from_user.part.0+0x4d/0x220
[  152.488990]  ? _raw_spin_lock+0x7a/0xd0
[  152.489598]  ? netlink_unicast+0x790/0x790
[  152.490236]  sock_sendmsg+0xc5/0x190
[  152.490796]  ____sys_sendmsg+0x535/0x6b0
[  152.491394]  ? import_iovec+0x7/0x10
[  152.491964]  ? kernel_sendmsg+0x30/0x30
[  152.492561]  ? __copy_msghdr+0x3c0/0x3c0
[  152.493160]  ? do_syscall_64+0x3d/0x90
[  152.493706]  ___sys_sendmsg+0xeb/0x170
[  152.494283]  ? may_open_dev+0xd0/0xd0
[  152.494858]  ? copy_msghdr_from_user+0x110/0x110
[  152.495541]  ? __handle_mm_fault+0x2678/0x4ad0
[  152.496205]  ? copy_page_range+0x2360/0x2360
[  152.496862]  ? __fget_light+0x57/0x520
[  152.497449]  ? mas_find+0x1c0/0x1c0
[  152.498026]  ? sockfd_lookup_light+0x1a/0x140
[  152.498703]  __sys_sendmsg+0xb5/0x140
[  152.499306]  ? __sys_sendmsg_sock+0x20/0x20
[  152.499951]  ? do_user_addr_fault+0x369/0xd80
[  152.500595]  do_syscall_64+0x3d/0x90
[  152.501185]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  152.501917] RIP: 0033:0x7f5eb294f887
[  152.502494] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[  152.505008] RSP: 002b:00007ffd2c708f78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  152.506152] RAX: ffffffffffffffda RBX: 00000000642d9472 RCX: 00007f5eb294f887
[  152.507134] RDX: 0000000000000000 RSI: 00007ffd2c708fe0 RDI: 0000000000000003
[  152.508113] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[  152.509119] R10: 00007f5eb2808708 R11: 0000000000000246 R12: 0000000000000001
[  152.510068] R13: 0000000000000000 R14: 00007ffd2c70d1b8 R15: 0000000000485400
[  152.511031]  </TASK>
[  152.511444] Modules linked in: cls_flower sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay zram zsmalloc fuse [last unloaded: mlx5_core]
[  152.515720] ---[ end trace 0000000000000000 ]---

Fixes: 08a0063df3 ("net/sched: flower: Move filter handle initialization earlier")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-05 10:01:31 +01:00
Linus Torvalds 8e15605be8 9p patches for 6.4 merge window
This pull request includes a number of patches that didn't quite make
 the cut last merge window while we addressed some outstanding issues
 and review comments.  It includes some new caching modes for those that
 only want readahead caches and reworks how we do writeback caching so we
 are not keeping extra references around which both causes performance problems
 and uses lots of additional resources on the server.
 
 It also includes a new flag to force disabling of xattrs which can also
 cause major performance issues, particularly if the underlying filesystem
 on the server doesn't support them.
 
 Finally it adds a couple of additional mount options to better support
 directio and enabling caches when the server doesn't support qid.version.
 
 There was one late-breaking bug report that has also been included as its
 own patch where I forgot to propagate an embarassing bit-logic fix to the
 various variations of open.  Since that was only added to for-next a week
 ago, if you would like to not include it, I can include it in the first
 round of fixes for -rc2.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmRS/LEACgkQiP/V+0pf
 /5i84RAAqFtTYPYxg0VMLvjix0BCCE7CYR+vr7UsoGFeuVRyFsKh7G6fhmRXUhpG
 2kf+1CK+xzQd9PEKiQnVmGhib9SCdeWqb9EUpCgLZEmrSci0qUenzjf3Jg5Qgwhx
 u6xu9tipzHeHMFBD6n0f+j2fZEsDAv5IzgL14F9YfJQueVsL+HUOifLncruWnUEn
 rJAf7omhE1x+FfWsNaB22AksvYUXtQfoG3MgCPWql/XKlo/6xeW/8/eprutIO06M
 LUp3ie4RbSRZiE63SfPhxFMJCZ7g+R1JSKe1J8i9bbbwzYsVO8dovwdMCw0tP7ZQ
 QjVq+Ng4x0Rn5Bmekj18ua7zRsbl96DaoqcnYWKUOHkRVJleTG9t31MeZQaRC+gh
 F3321XkqDBHMXPVVLVQ1Gb1Pxt8dk9iFwmTMxktTrM4n5Zwv8ldMDKQgOhDIv9WA
 dDTjZ7mYpk2f9atxA0oys5oebQlT3D0CM/3p6n/PsXXpk30tat99kgcKlpN8SrL+
 Fs0UkXTQuu0Vin8zaMarQ2TW2UGQVM8qRD2gTooRnOItdqItx17czaE2ALOdIuVs
 LCbDxOXNPP/YbzakNUxh5ldI3Z3eahbilVfGa8ILfvprKnH7MaMmSo0rkP4XHj1h
 JDUiN7JOCOW4dvfaXa4knSIjs62Y9oTS0MWrO9Ajq8bg4ku6U7c=
 =qXIe
 -----END PGP SIGNATURE-----

Merge tag '9p-6.4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p updates from Eric Van Hensbergen:
 "This includes a number of patches that didn't quite make the cut last
  merge window while we addressed some outstanding issues and review
  comments. It includes some new caching modes for those that only want
  readahead caches and reworks how we do writeback caching so we are not
  keeping extra references around which both causes performance problems
  and uses lots of additional resources on the server.

  It also includes a new flag to force disabling of xattrs which can
  also cause major performance issues, particularly if the underlying
  filesystem on the server doesn't support them.

  Finally it adds a couple of additional mount options to better support
  directio and enabling caches when the server doesn't support
  qid.version.

  There was one late-breaking bug report that has also been included as
  its own patch where I forgot to propagate an embarassing bit-logic fix
  to the various variations of open"

* tag '9p-6.4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: Fix bit operation logic error
  fs/9p: Rework cache modes and add new options to Documentation
  fs/9p: remove writeback fid and fix per-file modes
  fs/9p: Add new mount modes
  9p: Add additional debug flags and open modes
  fs/9p: allow disable of xattr support on mount
  fs/9p: Remove unnecessary superblock flags
  fs/9p: Consolidate file operations and add readahead and writeback
2023-05-04 14:37:53 -07:00
Jason Andryuk d7385ba137 9p: Remove INET dependency
9pfs can run over assorted transports, so it doesn't have an INET
dependency.  Drop it and remove the includes of linux/inet.h.

NET_9P_FD/trans_fd.o builds without INET or UNIX and is usable over
plain file descriptors.  However, tcp and unix functionality is still
built and would generate runtime failures if used.  Add imply INET and
UNIX to NET_9P_FD, so functionality is enabled by default but can still
be explicitly disabled.

This allows configuring 9pfs over Xen with INET and UNIX disabled.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-04 21:46:57 +01:00
Chuck Lever 29cd2927fb SUNRPC: Fix encoding of accepted but unsuccessful RPC replies
Jiri Slaby says:
> I bisected to this ... as it breaks nfs3-only servers in 6.3.
> I.e. /etc/nfs.conf containing:
> [nfsd]
> vers4=no
>
> The client sees:
>  mount("10.0.2.15:/tmp", "/mnt", "nfs", 0, "vers=4.2,addr=10.0.2.15,clientad"...) = -1 EIO (Input/output error)
>  write(2, "mount.nfs: mount system call fai"..., 45
>  mount.nfs: mount system call failed for /mnt
>
> And the kernel says:
>  nfs4_discover_server_trunking unhandled error -5. Exiting with error EIO

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1210995
Fixes: 4bcf0343e8 ("SUNRPC: Set rq_accept_statp inside ->accept methods")
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-03 09:35:51 -04:00
Florian Westphal f057b63bc1 netfilter: nf_tables: fix ct untracked match breakage
"ct untracked" no longer works properly due to erroneous NFT_BREAK.
We have to check ctinfo enum first.

Fixes: d9e7891476 ("netfilter: nf_tables: avoid retpoline overhead for some ct expression calls")
Reported-by: Rvfg <i@rvf6.com>
Link: https://marc.info/?l=netfilter&m=168294996212038&w=2
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-05-03 13:49:08 +02:00
Kuniyuki Iwashima 6a341729fb af_packet: Don't send zero-byte data in packet_sendmsg_spkt().
syzkaller reported a warning below [0].

We can reproduce it by sending 0-byte data from the (AF_PACKET,
SOCK_PACKET) socket via some devices whose dev->hard_header_len
is 0.

    struct sockaddr_pkt addr = {
        .spkt_family = AF_PACKET,
        .spkt_device = "tun0",
    };
    int fd;

    fd = socket(AF_PACKET, SOCK_PACKET, 0);
    sendto(fd, NULL, 0, 0, (struct sockaddr *)&addr, sizeof(addr));

We have a similar fix for the (AF_PACKET, SOCK_RAW) socket as
commit dc633700f0 ("net/af_packet: check len when min_header_len
equals to 0").

Let's add the same test for the SOCK_PACKET socket.

[0]:
skb_assert_len
WARNING: CPU: 1 PID: 19945 at include/linux/skbuff.h:2552 skb_assert_len include/linux/skbuff.h:2552 [inline]
WARNING: CPU: 1 PID: 19945 at include/linux/skbuff.h:2552 __dev_queue_xmit+0x1f26/0x31d0 net/core/dev.c:4159
Modules linked in:
CPU: 1 PID: 19945 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:skb_assert_len include/linux/skbuff.h:2552 [inline]
RIP: 0010:__dev_queue_xmit+0x1f26/0x31d0 net/core/dev.c:4159
Code: 89 de e8 1d a2 85 fd 84 db 75 21 e8 64 a9 85 fd 48 c7 c6 80 2a 1f 86 48 c7 c7 c0 06 1f 86 c6 05 23 cf 27 04 01 e8 fa ee 56 fd <0f> 0b e8 43 a9 85 fd 0f b6 1d 0f cf 27 04 31 ff 89 de e8 e3 a1 85
RSP: 0018:ffff8880217af6e0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc90001133000
RDX: 0000000000040000 RSI: ffffffff81186922 RDI: 0000000000000001
RBP: ffff8880217af8b0 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff888030045640
R13: ffff8880300456b0 R14: ffff888030045650 R15: ffff888030045718
FS:  00007fc5864da640(0000) GS:ffff88806cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020005740 CR3: 000000003f856003 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 dev_queue_xmit include/linux/netdevice.h:3085 [inline]
 packet_sendmsg_spkt+0xc4b/0x1230 net/packet/af_packet.c:2066
 sock_sendmsg_nosec net/socket.c:724 [inline]
 sock_sendmsg+0x1b4/0x200 net/socket.c:747
 ____sys_sendmsg+0x331/0x970 net/socket.c:2503
 ___sys_sendmsg+0x11d/0x1c0 net/socket.c:2557
 __sys_sendmmsg+0x18c/0x430 net/socket.c:2643
 __do_sys_sendmmsg net/socket.c:2672 [inline]
 __se_sys_sendmmsg net/socket.c:2669 [inline]
 __x64_sys_sendmmsg+0x9c/0x100 net/socket.c:2669
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3c/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fc58791de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007fc5864d9cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007fc58791de5d
RDX: 0000000000000001 RSI: 0000000020005740 RDI: 0000000000000004
RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fc58797e530 R15: 0000000000000000
 </TASK>
---[ end trace 0000000000000000 ]---
skb len=0 headroom=16 headlen=0 tailroom=304
mac=(16,0) net=(16,-1) trans=-1
shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))
csum(0x0 ip_summed=0 complete_sw=0 valid=0 level=0)
hash(0x0 sw=0 l4=0) proto=0x0000 pkttype=0 iif=0
dev name=sit0 feat=0x00000006401d7869
sk family=17 type=10 proto=0

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-03 09:20:18 +01:00
Ido Schimmel 9ad685dbfe ethtool: Fix uninitialized number of lanes
It is not possible to set the number of lanes when setting link modes
using the legacy IOCTL ethtool interface. Since 'struct
ethtool_link_ksettings' is not initialized in this path, drivers receive
an uninitialized number of lanes in 'struct
ethtool_link_ksettings::lanes'.

When this information is later queried from drivers, it results in the
ethtool code making decisions based on uninitialized memory, leading to
the following KMSAN splat [1]. In practice, this most likely only
happens with the tun driver that simply returns whatever it got in the
set operation.

As far as I can tell, this uninitialized memory is not leaked to user
space thanks to the 'ethtool_ops->cap_link_lanes_supported' check in
linkmodes_prepare_data().

Fix by initializing the structure in the IOCTL path. Did not find any
more call sites that pass an uninitialized structure when calling
'ethtool_ops::set_link_ksettings()'.

[1]
BUG: KMSAN: uninit-value in ethnl_update_linkmodes net/ethtool/linkmodes.c:273 [inline]
BUG: KMSAN: uninit-value in ethnl_set_linkmodes+0x190b/0x19d0 net/ethtool/linkmodes.c:333
 ethnl_update_linkmodes net/ethtool/linkmodes.c:273 [inline]
 ethnl_set_linkmodes+0x190b/0x19d0 net/ethtool/linkmodes.c:333
 ethnl_default_set_doit+0x88d/0xde0 net/ethtool/netlink.c:640
 genl_family_rcv_msg_doit net/netlink/genetlink.c:968 [inline]
 genl_family_rcv_msg net/netlink/genetlink.c:1048 [inline]
 genl_rcv_msg+0x141a/0x14c0 net/netlink/genetlink.c:1065
 netlink_rcv_skb+0x3f8/0x750 net/netlink/af_netlink.c:2577
 genl_rcv+0x40/0x60 net/netlink/genetlink.c:1076
 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
 netlink_unicast+0xf41/0x1270 net/netlink/af_netlink.c:1365
 netlink_sendmsg+0x127d/0x1430 net/netlink/af_netlink.c:1942
 sock_sendmsg_nosec net/socket.c:724 [inline]
 sock_sendmsg net/socket.c:747 [inline]
 ____sys_sendmsg+0xa24/0xe40 net/socket.c:2501
 ___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2555
 __sys_sendmsg net/socket.c:2584 [inline]
 __do_sys_sendmsg net/socket.c:2593 [inline]
 __se_sys_sendmsg net/socket.c:2591 [inline]
 __x64_sys_sendmsg+0x36b/0x540 net/socket.c:2591
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Uninit was stored to memory at:
 tun_get_link_ksettings+0x37/0x60 drivers/net/tun.c:3544
 __ethtool_get_link_ksettings+0x17b/0x260 net/ethtool/ioctl.c:441
 ethnl_set_linkmodes+0xee/0x19d0 net/ethtool/linkmodes.c:327
 ethnl_default_set_doit+0x88d/0xde0 net/ethtool/netlink.c:640
 genl_family_rcv_msg_doit net/netlink/genetlink.c:968 [inline]
 genl_family_rcv_msg net/netlink/genetlink.c:1048 [inline]
 genl_rcv_msg+0x141a/0x14c0 net/netlink/genetlink.c:1065
 netlink_rcv_skb+0x3f8/0x750 net/netlink/af_netlink.c:2577
 genl_rcv+0x40/0x60 net/netlink/genetlink.c:1076
 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
 netlink_unicast+0xf41/0x1270 net/netlink/af_netlink.c:1365
 netlink_sendmsg+0x127d/0x1430 net/netlink/af_netlink.c:1942
 sock_sendmsg_nosec net/socket.c:724 [inline]
 sock_sendmsg net/socket.c:747 [inline]
 ____sys_sendmsg+0xa24/0xe40 net/socket.c:2501
 ___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2555
 __sys_sendmsg net/socket.c:2584 [inline]
 __do_sys_sendmsg net/socket.c:2593 [inline]
 __se_sys_sendmsg net/socket.c:2591 [inline]
 __x64_sys_sendmsg+0x36b/0x540 net/socket.c:2591
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Uninit was stored to memory at:
 tun_set_link_ksettings+0x37/0x60 drivers/net/tun.c:3553
 ethtool_set_link_ksettings+0x600/0x690 net/ethtool/ioctl.c:609
 __dev_ethtool net/ethtool/ioctl.c:3024 [inline]
 dev_ethtool+0x1db9/0x2a70 net/ethtool/ioctl.c:3078
 dev_ioctl+0xb07/0x1270 net/core/dev_ioctl.c:524
 sock_do_ioctl+0x295/0x540 net/socket.c:1213
 sock_ioctl+0x729/0xd90 net/socket.c:1316
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:870 [inline]
 __se_sys_ioctl+0x222/0x400 fs/ioctl.c:856
 __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:856
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Local variable link_ksettings created at:
 ethtool_set_link_ksettings+0x54/0x690 net/ethtool/ioctl.c:577
 __dev_ethtool net/ethtool/ioctl.c:3024 [inline]
 dev_ethtool+0x1db9/0x2a70 net/ethtool/ioctl.c:3078

Fixes: 012ce4dd31 ("ethtool: Extend link modes settings uAPI with lanes")
Reported-and-tested-by: syzbot+ef6edd9f1baaa54d6235@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/0000000000004bb41105fa70f361@google.com/
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-03 09:13:20 +01:00
Pablo Neira Ayuso c1592a8994 netfilter: nf_tables: deactivate anonymous set from preparation phase
Toggle deleted anonymous sets as inactive in the next generation, so
users cannot perform any update on it. Clear the generation bitmask
in case the transaction is aborted.

The following KASAN splat shows a set element deletion for a bound
anonymous set that has been already removed in the same transaction.

[   64.921510] ==================================================================
[   64.923123] BUG: KASAN: wild-memory-access in nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.924745] Write of size 8 at addr dead000000000122 by task test/890
[   64.927903] CPU: 3 PID: 890 Comm: test Not tainted 6.3.0+ #253
[   64.931120] Call Trace:
[   64.932699]  <TASK>
[   64.934292]  dump_stack_lvl+0x33/0x50
[   64.935908]  ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.937551]  kasan_report+0xda/0x120
[   64.939186]  ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.940814]  nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.942452]  ? __kasan_slab_alloc+0x2d/0x60
[   64.944070]  ? nf_tables_setelem_notify+0x190/0x190 [nf_tables]
[   64.945710]  ? kasan_set_track+0x21/0x30
[   64.947323]  nfnetlink_rcv_batch+0x709/0xd90 [nfnetlink]
[   64.948898]  ? nfnetlink_rcv_msg+0x480/0x480 [nfnetlink]

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-05-03 08:24:32 +02:00
Pablo Neira Ayuso 8509f62b0b netfilter: nf_tables: hit ENOENT on unexisting chain/flowtable update with missing attributes
If user does not specify hook number and priority, then assume this is
a chain/flowtable update. Therefore, report ENOENT which provides a
better hint than EINVAL. Set on extended netlink error report to refer
to the chain name.

Fixes: 5b6743fb2c ("netfilter: nf_tables: skip flowtable hooknum and priority on device updates")
Fixes: 5efe72698a97 ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-05-03 08:24:31 +02:00
Yonghong Song bf6882aebd bpf: Emit struct bpf_tcp_sock type in vmlinux BTF
In one of our internal testing, we found a case where
  - uapi struct bpf_tcp_sock is in vmlinux.h where vmlinux.h is not
    generated from the testing kernel
  - struct bpf_tcp_sock is not in vmlinux BTF

The above combination caused bpf load failure as the following
memory access
  struct bpf_tcp_sock *tcp_sock = ...;
  ... tcp_sock->snd_cwnd ...
needs CORE relocation but the relocation cannot be resolved since
the kernel BTF does not have corresponding type.

Similar to other previous cases (nf_conn___init, tcp6_sock, mctcp_sock, etc.),
add the type to vmlinux BTF with BTF_EMIT_TYPE macro.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230502180543.1832140-1-yhs@fb.com
2023-05-02 16:19:06 -07:00
Ard Biesheuvel af97b7dfb0 SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
Scott reports SUNRPC self-test failures regarding the output IV on arm64
when using the SIMD accelerated implementation of AES in CBC mode with
ciphertext stealing ("cts(cbc(aes))" in crypto API speak).

These failures are the result of the fact that, while RFC 3962 does
specify what the output IV should be and includes test vectors for it,
the general concept of an output IV is poorly defined, and generally,
not specified by the various algorithms implemented by the crypto API.
Only algorithms that support transparent chaining (e.g., CBC mode on a
block boundary) have requirements on the output IV, but ciphertext
stealing (CTS) is fundamentally about how to encapsulate CBC in a way
where the length of the entire message may not be an integral multiple
of the cipher block size, and the concept of an output IV does not exist
here because it has no defined purpose past the end of the message.

The generic CTS template takes advantage of this chaining capability of
the CBC implementations, and as a result, happens to return an output
IV, simply because it passes its IV buffer directly to the encapsulated
CBC implementation, which operates on full blocks only, and always
returns an IV. This output IV happens to match how RFC 3962 defines it,
even though the CTS template itself does not contain any output IV logic
whatsoever, and, for this reason, lacks any test vectors that exercise
this accidental output IV generation.

The arm64 SIMD implementation of cts(cbc(aes)) does not use the generic
CTS template at all, but instead, implements the CBC mode and ciphertext
stealing directly, and therefore does not encapsule a CBC implementation
that returns an output IV in the same way. The arm64 SIMD implementation
complies with the specification and passes all internal tests, but when
invoked by the SUNRPC code, fails to produce the expected output IV and
causes its selftests to fail.

Given that the output IV is defined as the penultimate block (where the
final block may smaller than the block size), we can quite easily derive
it in the caller by copying the appropriate slice of ciphertext after
encryption.

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Reported-by: Scott Mayhew <smayhew@redhat.com>
Tested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-02 11:35:04 -04:00
David Howells db099c625b rxrpc: Fix timeout of a call that hasn't yet been granted a channel
afs_make_call() calls rxrpc_kernel_begin_call() to begin a call (which may
get stalled in the background waiting for a connection to become
available); it then calls rxrpc_kernel_set_max_life() to set the timeouts -
but that starts the call timer so the call timer might then expire before
we get a connection assigned - leading to the following oops if the call
stalled:

	BUG: kernel NULL pointer dereference, address: 0000000000000000
	...
	CPU: 1 PID: 5111 Comm: krxrpcio/0 Not tainted 6.3.0-rc7-build3+ #701
	RIP: 0010:rxrpc_alloc_txbuf+0xc0/0x157
	...
	Call Trace:
	 <TASK>
	 rxrpc_send_ACK+0x50/0x13b
	 rxrpc_input_call_event+0x16a/0x67d
	 rxrpc_io_thread+0x1b6/0x45f
	 ? _raw_spin_unlock_irqrestore+0x1f/0x35
	 ? rxrpc_input_packet+0x519/0x519
	 kthread+0xe7/0xef
	 ? kthread_complete_and_exit+0x1b/0x1b
	 ret_from_fork+0x22/0x30

Fix this by noting the timeouts in struct rxrpc_call when the call is
created.  The timer will be started when the first packet is transmitted.

It shouldn't be possible to trigger this directly from userspace through
AF_RXRPC as sendmsg() will return EBUSY if the call is in the
waiting-for-conn state if it dropped out of the wait due to a signal.

Fixes: 9d35d880e0 ("rxrpc: Move client call connection to the I/O thread")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-01 07:43:19 +01:00
David Howells 0eb362d254 rxrpc: Make it so that a waiting process can be aborted
When sendmsg() creates an rxrpc call, it queues it to wait for a connection
and channel to be assigned and then waits before it can start shovelling
data as the encrypted DATA packet content includes a summary of the
connection parameters.

However, sendmsg() may get interrupted before a connection gets assigned
and further sendmsg() calls will fail with EBUSY until an assignment is
made.

Fix this so that the call can at least be aborted without failing on
EBUSY.  We have to be careful here as sendmsg() mustn't be allowed to start
the call timer if the call doesn't yet have a connection assigned as an
oops may follow shortly thereafter.

Fixes: 540b1c48c3 ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-01 07:43:19 +01:00
David Howells 0d098d83c5 rxrpc: Fix hard call timeout units
The hard call timeout is specified in the RXRPC_SET_CALL_TIMEOUT cmsg in
seconds, so fix the point at which sendmsg() applies it to the call to
convert to jiffies from seconds, not milliseconds.

Fixes: a158bdd324 ("rxrpc: Fix timeout of a call that hasn't yet been granted a channel")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-01 07:43:19 +01:00
Victor Nogueira 526f28bd0f net/sched: act_mirred: Add carrier check
There are cases where the device is adminstratively UP, but operationally
down. For example, we have a physical device (Nvidia ConnectX-6 Dx, 25Gbps)
who's cable was pulled out, here is its ip link output:

5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether b8:ce:f6:4b:68:35 brd ff:ff:ff:ff:ff:ff
    altname enp179s0f1np1

As you can see, it's administratively UP but operationally down.
In this case, sending a packet to this port caused a nasty kernel hang (so
nasty that we were unable to capture it). Aborting a transmit based on
operational status (in addition to administrative status) fixes the issue.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
v1->v2: Add fixes tag
v2->v3: Remove blank line between tags + add change log, suggested by Leon
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-01 07:26:10 +01:00
Linus Torvalds 4e1c80ae5c NFSD 6.4 Release Notes
The big ticket item for this release is support for RPC-with-TLS
 [RFC 9289] has been added to the Linux NFS server. The goal is to
 provide a simple-to-deploy, low-overhead in-transit confidentiality
 and peer authentication mechanism. It can supplement NFS Kerberos
 and it can protect the use of legacy non-cryptographic user
 authentication flavors such as AUTH_SYS. The TLS Record protocol is
 handled entirely by kTLS, meaning it can use either software
 encryption or offload encryption to smart NICs.
 
 Work continues on improving NFSD's open file cache. Among the many
 clean-ups in that area is a patch to convert the rhashtable to use
 the list-hashing version of that data structure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
 f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
 06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
 /LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
 Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
 /qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
 J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
 Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
 1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
 R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
 C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
 ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
 =lA2+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The big ticket item for this release is that support for RPC-with-TLS
  [RFC 9289] has been added to the Linux NFS server.

  The goal is to provide a simple-to-deploy, low-overhead in-transit
  confidentiality and peer authentication mechanism. It can supplement
  NFS Kerberos and it can protect the use of legacy non-cryptographic
  user authentication flavors such as AUTH_SYS. The TLS Record protocol
  is handled entirely by kTLS, meaning it can use either software
  encryption or offload encryption to smart NICs.

  Aside from that, work continues on improving NFSD's open file cache.
  Among the many clean-ups in that area is a patch to convert the
  rhashtable to use the list-hashing version of that data structure"

* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
  NFSD: Handle new xprtsec= export option
  SUNRPC: Support TLS handshake in the server-side TCP socket code
  NFSD: Clean up xattr memory allocation flags
  NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
  SUNRPC: Clear rq_xid when receiving a new RPC Call
  SUNRPC: Recognize control messages in server-side TCP socket code
  SUNRPC: Be even lazier about releasing pages
  SUNRPC: Convert svc_xprt_release() to the release_pages() API
  SUNRPC: Relocate svc_free_res_pages()
  nfsd: simplify the delayed disposal list code
  SUNRPC: Ignore return value of ->xpo_sendto
  SUNRPC: Ensure server-side sockets have a sock->file
  NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
  sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
  SUNRPC: return proper error from get_expiry()
  lockd: add some client-side tracepoints
  nfs: move nfs_fhandle_hash to common include file
  lockd: server should unlock lock if client rejects the grant
  lockd: fix races in client GRANTED_MSG wait logic
  lockd: move struct nlm_wait to lockd.h
  ...
2023-04-29 11:04:14 -07:00
Linus Torvalds 0127f25b5d NFS Client Updates for Linux 6.4
New Features:
   * Convert the readdir path to use folios
   * Convert the NFS fscache code to use netfs
 
 Bugfixes and Cleanups:
   * Always send a RECLAIM_COMPLETE after establishing a lease
   * Simplify sysctl registrations and other cleanups
   * Handle out-of-order write replies on NFS v3
   * Have sunrpc call_bind_status use standard hard/soft task semantics
   * Other minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRMI04ACgkQ18tUv7Cl
 QOuCNQ//SkQm8aOM4DkYFeDIObye6xMzgtWrB25grYNG4a/DcYqb5kNcbmI5l1tE
 Tus8KMZAWSpwa0m8ALctzp+pZQWQkY/svsqqHrKIGUHBI8F0OinVCqc2MzNN75WX
 m/1wELW6ek9RBL5BoJtAPt+Qu8/jP6KD64Zot7snBeUrzreaZDcz0HM+EcQhi7X7
 qd5XS0/cA2eLEBBQcQdFpRhHvgW12BMYM/zp3/ER5H52L2iAlZunGWw+Nqs8ueOR
 D7K2+CF1sV1k6hYbLWNoaF2J6PZr5dRpc6gSq4fLP4WUKjqQwmQp8cm9iLpf6jGa
 a+Y7t8aj7vup8jVCVGWYWZA2G2gi6jWmxxWudkJwfAa1E45t1B4/C0udwlxR20OO
 XI2Bhe5YwTURgSOvOS9QTZJpQN4qfpEL0NoAmAT5fAHBQ2CXDrMlSIxPS7U6LO9q
 YqwIHcAHvYVnbD45IUh2Zjbp65mRb1VkU6WzOyK1/sNHEyYpubIWXB/yLaA3oGge
 V3xUgvlTzLVzzyQfwiRfzAD1P5/USaXE/B36c4itfCr5rJnAfsiBP3gk0o9yq18J
 3Yb6olrmc9CzeA7PN88uEus4VZHbaE9OktRFIjJ22jlLQEY4xougdE5asY1XX8F+
 OKLLLeeCrsbvrANB9XcLVsLqdMYvsd0VaCX9HtN3UP+7Lod5T10=
 =gpBC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Convert the readdir path to use folios

   - Convert the NFS fscache code to use netfs

  Bugfixes and Cleanups:

   - Always send a RECLAIM_COMPLETE after establishing a lease

   - Simplify sysctl registrations and other cleanups

   - Handle out-of-order write replies on NFS v3

   - Have sunrpc call_bind_status use standard hard/soft task semantics

   - Other minor cleanups"

* tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.2: Rework scratch handling for READ_PLUS
  NFS: Cleanup unused rpc_clnt variable
  NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
  SUNRPC: remove the maximum number of retries in call_bind_status
  NFS: Convert readdir page array functions to use a folio
  NFS: Convert the readdir array-of-pages into an array-of-folios
  NFSv3: handle out-of-order write replies.
  NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
  NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API
  NFS: Convert buffered read paths to use netfs when fscache is enabled
  NFS: Configure support for netfs when NFS fscache is configured
  NFS: Rename readpage_async_filler to nfs_read_add_folio
  sunrpc: simplify one-level sysctl registration for debug_table
  sunrpc: move sunrpc_table and proc routines above
  sunrpc: simplify one-level sysctl registration for xs_tunables_table
  sunrpc: simplify one-level sysctl registration for xr_tunables_table
  nfs: simplify two-level sysctl registration for nfs_cb_sysctls
  nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
  lockd: simplify two-level sysctl registration for nlm_sysctls
  NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease
2023-04-29 10:58:44 -07:00
Antoine Tenart dc6456e938 net: ipv6: fix skb hash for some RST packets
The skb hash comes from sk->sk_txhash when using TCP, except for some
IPv6 RST packets. This is because in tcp_v6_send_reset when not in
TIME_WAIT the hash is taken from sk->sk_hash, while it should come from
sk->sk_txhash as those two hashes are not computed the same way.

Packetdrill script to test the above,

   0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
  +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
  +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

  +0 > (flowlabel 0x1) S 0:0(0) <...>

  // Wrong ack seq, trigger a rst.
  +0 < S. 0:0(0) ack 0 win 4000

  // Check the flowlabel matches prior one from SYN.
  +0 > (flowlabel 0x1) R 0:0(0) <...>

Fixes: 9258b8b1be ("ipv6: tcp: send consistent autoflowlabel in RST packets")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-28 09:53:43 +01:00
Cong Wang c88f8d5cd9 sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()
When a tunnel device is bound with the underlying device, its
dev->needed_headroom needs to be updated properly. IPv4 tunnels
already do the same in ip_tunnel_bind_dev(). Otherwise we may
not have enough header room for skb, especially after commit
b17f709a24 ("gue: TX support for using remote checksum offload option").

Fixes: 32b8a8e59c ("sit: add IPv4 over IPv4 support")
Reported-by: Palash Oswal <oswalpalash@gmail.com>
Link: https://lore.kernel.org/netdev/CAGyP=7fDcSPKu6nttbGwt7RXzE3uyYxLjCSE97J64pRxJP8jPA@mail.gmail.com/
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-28 09:48:14 +01:00
Vlad Buslov da94a7781f net/sched: cls_api: remove block_cb from driver_list before freeing
Error handler of tcf_block_bind() frees the whole bo->cb_list on error.
However, by that time the flow_block_cb instances are already in the driver
list because driver ndo_setup_tc() callback is called before that up the
call chain in tcf_block_offload_cmd(). This leaves dangling pointers to
freed objects in the list and causes use-after-free[0]. Fix it by also
removing flow_block_cb instances from driver_list before deallocating them.

[0]:
[  279.868433] ==================================================================
[  279.869964] BUG: KASAN: slab-use-after-free in flow_block_cb_setup_simple+0x631/0x7c0
[  279.871527] Read of size 8 at addr ffff888147e2bf20 by task tc/2963

[  279.873151] CPU: 6 PID: 2963 Comm: tc Not tainted 6.3.0-rc6+ #4
[  279.874273] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  279.876295] Call Trace:
[  279.876882]  <TASK>
[  279.877413]  dump_stack_lvl+0x33/0x50
[  279.878198]  print_report+0xc2/0x610
[  279.878987]  ? flow_block_cb_setup_simple+0x631/0x7c0
[  279.879994]  kasan_report+0xae/0xe0
[  279.880750]  ? flow_block_cb_setup_simple+0x631/0x7c0
[  279.881744]  ? mlx5e_tc_reoffload_flows_work+0x240/0x240 [mlx5_core]
[  279.883047]  flow_block_cb_setup_simple+0x631/0x7c0
[  279.884027]  tcf_block_offload_cmd.isra.0+0x189/0x2d0
[  279.885037]  ? tcf_block_setup+0x6b0/0x6b0
[  279.885901]  ? mutex_lock+0x7d/0xd0
[  279.886669]  ? __mutex_unlock_slowpath.constprop.0+0x2d0/0x2d0
[  279.887844]  ? ingress_init+0x1c0/0x1c0 [sch_ingress]
[  279.888846]  tcf_block_get_ext+0x61c/0x1200
[  279.889711]  ingress_init+0x112/0x1c0 [sch_ingress]
[  279.890682]  ? clsact_init+0x2b0/0x2b0 [sch_ingress]
[  279.891701]  qdisc_create+0x401/0xea0
[  279.892485]  ? qdisc_tree_reduce_backlog+0x470/0x470
[  279.893473]  tc_modify_qdisc+0x6f7/0x16d0
[  279.894344]  ? tc_get_qdisc+0xac0/0xac0
[  279.895213]  ? mutex_lock+0x7d/0xd0
[  279.896005]  ? __mutex_lock_slowpath+0x10/0x10
[  279.896910]  rtnetlink_rcv_msg+0x5fe/0x9d0
[  279.897770]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
[  279.898672]  ? __sys_sendmsg+0xb5/0x140
[  279.899494]  ? do_syscall_64+0x3d/0x90
[  279.900302]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  279.901337]  ? kasan_save_stack+0x2e/0x40
[  279.902177]  ? kasan_save_stack+0x1e/0x40
[  279.903058]  ? kasan_set_track+0x21/0x30
[  279.903913]  ? kasan_save_free_info+0x2a/0x40
[  279.904836]  ? ____kasan_slab_free+0x11a/0x1b0
[  279.905741]  ? kmem_cache_free+0x179/0x400
[  279.906599]  netlink_rcv_skb+0x12c/0x360
[  279.907450]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
[  279.908360]  ? netlink_ack+0x1550/0x1550
[  279.909192]  ? rhashtable_walk_peek+0x170/0x170
[  279.910135]  ? kmem_cache_alloc_node+0x1af/0x390
[  279.911086]  ? _copy_from_iter+0x3d6/0xc70
[  279.912031]  netlink_unicast+0x553/0x790
[  279.912864]  ? netlink_attachskb+0x6a0/0x6a0
[  279.913763]  ? netlink_recvmsg+0x416/0xb50
[  279.914627]  netlink_sendmsg+0x7a1/0xcb0
[  279.915473]  ? netlink_unicast+0x790/0x790
[  279.916334]  ? iovec_from_user.part.0+0x4d/0x220
[  279.917293]  ? netlink_unicast+0x790/0x790
[  279.918159]  sock_sendmsg+0xc5/0x190
[  279.918938]  ____sys_sendmsg+0x535/0x6b0
[  279.919813]  ? import_iovec+0x7/0x10
[  279.920601]  ? kernel_sendmsg+0x30/0x30
[  279.921423]  ? __copy_msghdr+0x3c0/0x3c0
[  279.922254]  ? import_iovec+0x7/0x10
[  279.923041]  ___sys_sendmsg+0xeb/0x170
[  279.923854]  ? copy_msghdr_from_user+0x110/0x110
[  279.924797]  ? ___sys_recvmsg+0xd9/0x130
[  279.925630]  ? __perf_event_task_sched_in+0x183/0x470
[  279.926656]  ? ___sys_sendmsg+0x170/0x170
[  279.927529]  ? ctx_sched_in+0x530/0x530
[  279.928369]  ? update_curr+0x283/0x4f0
[  279.929185]  ? perf_event_update_userpage+0x570/0x570
[  279.930201]  ? __fget_light+0x57/0x520
[  279.931023]  ? __switch_to+0x53d/0xe70
[  279.931846]  ? sockfd_lookup_light+0x1a/0x140
[  279.932761]  __sys_sendmsg+0xb5/0x140
[  279.933560]  ? __sys_sendmsg_sock+0x20/0x20
[  279.934436]  ? fpregs_assert_state_consistent+0x1d/0xa0
[  279.935490]  do_syscall_64+0x3d/0x90
[  279.936300]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  279.937311] RIP: 0033:0x7f21c814f887
[  279.938085] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[  279.941448] RSP: 002b:00007fff11efd478 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  279.942964] RAX: ffffffffffffffda RBX: 0000000064401979 RCX: 00007f21c814f887
[  279.944337] RDX: 0000000000000000 RSI: 00007fff11efd4e0 RDI: 0000000000000003
[  279.945660] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[  279.947003] R10: 00007f21c8008708 R11: 0000000000000246 R12: 0000000000000001
[  279.948345] R13: 0000000000409980 R14: 000000000047e538 R15: 0000000000485400
[  279.949690]  </TASK>

[  279.950706] Allocated by task 2960:
[  279.951471]  kasan_save_stack+0x1e/0x40
[  279.952338]  kasan_set_track+0x21/0x30
[  279.953165]  __kasan_kmalloc+0x77/0x90
[  279.954006]  flow_block_cb_setup_simple+0x3dd/0x7c0
[  279.955001]  tcf_block_offload_cmd.isra.0+0x189/0x2d0
[  279.956020]  tcf_block_get_ext+0x61c/0x1200
[  279.956881]  ingress_init+0x112/0x1c0 [sch_ingress]
[  279.957873]  qdisc_create+0x401/0xea0
[  279.958656]  tc_modify_qdisc+0x6f7/0x16d0
[  279.959506]  rtnetlink_rcv_msg+0x5fe/0x9d0
[  279.960392]  netlink_rcv_skb+0x12c/0x360
[  279.961216]  netlink_unicast+0x553/0x790
[  279.962044]  netlink_sendmsg+0x7a1/0xcb0
[  279.962906]  sock_sendmsg+0xc5/0x190
[  279.963702]  ____sys_sendmsg+0x535/0x6b0
[  279.964534]  ___sys_sendmsg+0xeb/0x170
[  279.965343]  __sys_sendmsg+0xb5/0x140
[  279.966132]  do_syscall_64+0x3d/0x90
[  279.966908]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

[  279.968407] Freed by task 2960:
[  279.969114]  kasan_save_stack+0x1e/0x40
[  279.969929]  kasan_set_track+0x21/0x30
[  279.970729]  kasan_save_free_info+0x2a/0x40
[  279.971603]  ____kasan_slab_free+0x11a/0x1b0
[  279.972483]  __kmem_cache_free+0x14d/0x280
[  279.973337]  tcf_block_setup+0x29d/0x6b0
[  279.974173]  tcf_block_offload_cmd.isra.0+0x226/0x2d0
[  279.975186]  tcf_block_get_ext+0x61c/0x1200
[  279.976080]  ingress_init+0x112/0x1c0 [sch_ingress]
[  279.977065]  qdisc_create+0x401/0xea0
[  279.977857]  tc_modify_qdisc+0x6f7/0x16d0
[  279.978695]  rtnetlink_rcv_msg+0x5fe/0x9d0
[  279.979562]  netlink_rcv_skb+0x12c/0x360
[  279.980388]  netlink_unicast+0x553/0x790
[  279.981214]  netlink_sendmsg+0x7a1/0xcb0
[  279.982043]  sock_sendmsg+0xc5/0x190
[  279.982827]  ____sys_sendmsg+0x535/0x6b0
[  279.983703]  ___sys_sendmsg+0xeb/0x170
[  279.984510]  __sys_sendmsg+0xb5/0x140
[  279.985298]  do_syscall_64+0x3d/0x90
[  279.986076]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

[  279.987532] The buggy address belongs to the object at ffff888147e2bf00
                which belongs to the cache kmalloc-192 of size 192
[  279.989747] The buggy address is located 32 bytes inside of
                freed 192-byte region [ffff888147e2bf00, ffff888147e2bfc0)

[  279.992367] The buggy address belongs to the physical page:
[  279.993430] page:00000000550f405c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x147e2a
[  279.995182] head:00000000550f405c order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  279.996713] anon flags: 0x200000000010200(slab|head|node=0|zone=2)
[  279.997878] raw: 0200000000010200 ffff888100042a00 0000000000000000 dead000000000001
[  279.999384] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[  280.000894] page dumped because: kasan: bad access detected

[  280.002386] Memory state around the buggy address:
[  280.003338]  ffff888147e2be00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  280.004781]  ffff888147e2be80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  280.006224] >ffff888147e2bf00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  280.007700]                                ^
[  280.008592]  ffff888147e2bf80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  280.010035]  ffff888147e2c000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  280.011564] ==================================================================

Fixes: 59094b1e50 ("net: sched: use flow block API")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-28 09:44:19 +01:00
Eric Dumazet 7e692df393 tcp: fix skb_copy_ubufs() vs BIG TCP
David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy
using hugepages, and skb length bigger than ~68 KB.

skb_copy_ubufs() assumed it could copy all payload using up to
MAX_SKB_FRAGS order-0 pages.

This assumption broke when BIG TCP was able to put up to 512 KB per skb.

We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45
and limit gso_max_size to 180000.

A solution is to use higher order pages if needed.

v2: add missing __GFP_COMP, or we leak memory.

Fixes: 7c4e983c4f ("net: allow gso_max_size to exceed 65536")
Reported-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-28 09:40:38 +01:00
Cosmo Chou 6f75cd166a net/ncsi: clear Tx enable mode when handling a Config required AEN
ncsi_channel_is_tx() determines whether a given channel should be
used for Tx or not. However, when reconfiguring the channel by
handling a Configuration Required AEN, there is a misjudgment that
the channel Tx has already been enabled, which results in the Enable
Channel Network Tx command not being sent.

Clear the channel Tx enable flag before reconfiguring the channel to
avoid the misjudgment.

Fixes: 8d951a75d0 ("net/ncsi: Configure multi-package, multi-channel modes with failover")
Signed-off-by: Cosmo Chou <chou.cosmo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-28 09:35:33 +01:00
Linus Torvalds 7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Linus Torvalds b6a7828502 modules-6.4-rc1
The summary of the changes for this pull requests is:
 
  * Song Liu's new struct module_memory replacement
  * Nick Alcock's MODULE_LICENSE() removal for non-modules
  * My cleanups and enhancements to reduce the areas where we vmalloc
    module memory for duplicates, and the respective debug code which
    proves the remaining vmalloc pressure comes from userspace.
 
 Most of the changes have been in linux-next for quite some time except
 the minor fixes I made to check if a module was already loaded
 prior to allocating the final module memory with vmalloc and the
 respective debug code it introduces to help clarify the issue. Although
 the functional change is small it is rather safe as it can only *help*
 reduce vmalloc space for duplicates and is confirmed to fix a bootup
 issue with over 400 CPUs with KASAN enabled. I don't expect stable
 kernels to pick up that fix as the cleanups would have also had to have
 been picked up. Folks on larger CPU systems with modules will want to
 just upgrade if vmalloc space has been an issue on bootup.
 
 Given the size of this request, here's some more elaborate details
 on this pull request.
 
 The functional change change in this pull request is the very first
 patch from Song Liu which replaces the struct module_layout with a new
 struct module memory. The old data structure tried to put together all
 types of supported module memory types in one data structure, the new
 one abstracts the differences in memory types in a module to allow each
 one to provide their own set of details. This paves the way in the
 future so we can deal with them in a cleaner way. If you look at changes
 they also provide a nice cleanup of how we handle these different memory
 areas in a module. This change has been in linux-next since before the
 merge window opened for v6.3 so to provide more than a full kernel cycle
 of testing. It's a good thing as quite a bit of fixes have been found
 for it.
 
 Jason Baron then made dynamic debug a first class citizen module user by
 using module notifier callbacks to allocate / remove module specific
 dynamic debug information.
 
 Nick Alcock has done quite a bit of work cross-tree to remove module
 license tags from things which cannot possibly be module at my request
 so to:
 
   a) help him with his longer term tooling goals which require a
      deterministic evaluation if a piece a symbol code could ever be
      part of a module or not. But quite recently it is has been made
      clear that tooling is not the only one that would benefit.
      Disambiguating symbols also helps efforts such as live patching,
      kprobes and BPF, but for other reasons and R&D on this area
      is active with no clear solution in sight.
 
   b) help us inch closer to the now generally accepted long term goal
      of automating all the MODULE_LICENSE() tags from SPDX license tags
 
 In so far as a) is concerned, although module license tags are a no-op
 for non-modules, tools which would want create a mapping of possible
 modules can only rely on the module license tag after the commit
 8b41fc4454 ("kbuild: create modules.builtin without Makefile.modbuiltin
 or tristate.conf").  Nick has been working on this *for years* and
 AFAICT I was the only one to suggest two alternatives to this approach
 for tooling. The complexity in one of my suggested approaches lies in
 that we'd need a possible-obj-m and a could-be-module which would check
 if the object being built is part of any kconfig build which could ever
 lead to it being part of a module, and if so define a new define
 -DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've
 suggested would be to have a tristate in kconfig imply the same new
 -DPOSSIBLE_MODULE as well but that means getting kconfig symbol names
 mapping to modules always, and I don't think that's the case today. I am
 not aware of Nick or anyone exploring either of these options. Quite
 recently Josh Poimboeuf has pointed out that live patching, kprobes and
 BPF would benefit from resolving some part of the disambiguation as
 well but for other reasons. The function granularity KASLR (fgkaslr)
 patches were mentioned but Joe Lawrence has clarified this effort has
 been dropped with no clear solution in sight [1].
 
 In the meantime removing module license tags from code which could never
 be modules is welcomed for both objectives mentioned above. Some
 developers have also welcomed these changes as it has helped clarify
 when a module was never possible and they forgot to clean this up,
 and so you'll see quite a bit of Nick's patches in other pull
 requests for this merge window. I just picked up the stragglers after
 rc3. LWN has good coverage on the motivation behind this work [2] and
 the typical cross-tree issues he ran into along the way. The only
 concrete blocker issue he ran into was that we should not remove the
 MODULE_LICENSE() tags from files which have no SPDX tags yet, even if
 they can never be modules. Nick ended up giving up on his efforts due
 to having to do this vetting and backlash he ran into from folks who
 really did *not understand* the core of the issue nor were providing
 any alternative / guidance. I've gone through his changes and dropped
 the patches which dropped the module license tags where an SPDX
 license tag was missing, it only consisted of 11 drivers.  To see
 if a pull request deals with a file which lacks SPDX tags you
 can just use:
 
   ./scripts/spdxcheck.py -f \
 	$(git diff --name-only commid-id | xargs echo)
 
 You'll see a core module file in this pull request for the above,
 but that's not related to his changes. WE just need to add the SPDX
 license tag for the kernel/module/kmod.c file in the future but
 it demonstrates the effectiveness of the script.
 
 Most of Nick's changes were spread out through different trees,
 and I just picked up the slack after rc3 for the last kernel was out.
 Those changes have been in linux-next for over two weeks.
 
 The cleanups, debug code I added and final fix I added for modules
 were motivated by David Hildenbrand's report of boot failing on
 a systems with over 400 CPUs when KASAN was enabled due to running
 out of virtual memory space. Although the functional change only
 consists of 3 lines in the patch "module: avoid allocation if module is
 already present and ready", proving that this was the best we can
 do on the modules side took quite a bit of effort and new debug code.
 
 The initial cleanups I did on the modules side of things has been
 in linux-next since around rc3 of the last kernel, the actual final
 fix for and debug code however have only been in linux-next for about a
 week or so but I think it is worth getting that code in for this merge
 window as it does help fix / prove / evaluate the issues reported
 with larger number of CPUs. Userspace is not yet fixed as it is taking
 a bit of time for folks to understand the crux of the issue and find a
 proper resolution. Worst come to worst, I have a kludge-of-concept [3]
 of how to make kernel_read*() calls for modules unique / converge them,
 but I'm currently inclined to just see if userspace can fix this
 instead.
 
 [0] https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/
 [1] https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com
 [2] https://lwn.net/Articles/927569/
 [3] https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRG4m0SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinQ2oP/0xlvKwJg6Ey8fHZF0qv8VOskE80zoLF
 hMazU3xfqLA+1TQvouW1YBxt3jwS3t1Ehs+NrV+nY9Yzcm0MzRX/n3fASJVe7nRr
 oqWWQU+voYl5Pw1xsfdp6C8IXpBQorpYby3Vp0MAMoZyl2W2YrNo36NV488wM9KC
 jD4HF5Z6xpnPSZTRR7AgW9mo7FdAtxPeKJ76Bch7lH8U6omT7n36WqTw+5B1eAYU
 YTOvrjRs294oqmWE+LeebyiOOXhH/yEYx4JNQgCwPdxwnRiGJWKsk5va0hRApqF/
 WW8dIqdEnjsa84lCuxnmWgbcPK8cgmlO0rT0DyneACCldNlldCW1LJ0HOwLk9pea
 p3JFAsBL7TKue4Tos6I7/4rx1ufyBGGIigqw9/VX5g0Iif+3BhWnqKRfz+p9wiMa
 Fl7cU6u7yC68CHu1HBSisK16cYMCPeOnTSd89upHj8JU/t74O6k/ARvjrQ9qmNUt
 c5U+OY+WpNJ1nXQydhY/yIDhFdYg8SSpNuIO90r4L8/8jRQYXNG80FDd1UtvVDuy
 eq0r2yZ8C0XHSlOT9QHaua/tWV/aaKtyC/c0hDRrigfUrq8UOlGujMXbUnrmrWJI
 tLJLAc7ePWAAoZXGSHrt0U27l029GzLwRdKqJ6kkDANVnTeOdV+mmBg9zGh3/Mp6
 agiwdHUMVN7X
 =56WK
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull module updates from Luis Chamberlain:
 "The summary of the changes for this pull requests is:

   - Song Liu's new struct module_memory replacement

   - Nick Alcock's MODULE_LICENSE() removal for non-modules

   - My cleanups and enhancements to reduce the areas where we vmalloc
     module memory for duplicates, and the respective debug code which
     proves the remaining vmalloc pressure comes from userspace.

  Most of the changes have been in linux-next for quite some time except
  the minor fixes I made to check if a module was already loaded prior
  to allocating the final module memory with vmalloc and the respective
  debug code it introduces to help clarify the issue. Although the
  functional change is small it is rather safe as it can only *help*
  reduce vmalloc space for duplicates and is confirmed to fix a bootup
  issue with over 400 CPUs with KASAN enabled. I don't expect stable
  kernels to pick up that fix as the cleanups would have also had to
  have been picked up. Folks on larger CPU systems with modules will
  want to just upgrade if vmalloc space has been an issue on bootup.

  Given the size of this request, here's some more elaborate details:

  The functional change change in this pull request is the very first
  patch from Song Liu which replaces the 'struct module_layout' with a
  new 'struct module_memory'. The old data structure tried to put
  together all types of supported module memory types in one data
  structure, the new one abstracts the differences in memory types in a
  module to allow each one to provide their own set of details. This
  paves the way in the future so we can deal with them in a cleaner way.
  If you look at changes they also provide a nice cleanup of how we
  handle these different memory areas in a module. This change has been
  in linux-next since before the merge window opened for v6.3 so to
  provide more than a full kernel cycle of testing. It's a good thing as
  quite a bit of fixes have been found for it.

  Jason Baron then made dynamic debug a first class citizen module user
  by using module notifier callbacks to allocate / remove module
  specific dynamic debug information.

  Nick Alcock has done quite a bit of work cross-tree to remove module
  license tags from things which cannot possibly be module at my request
  so to:

   a) help him with his longer term tooling goals which require a
      deterministic evaluation if a piece a symbol code could ever be
      part of a module or not. But quite recently it is has been made
      clear that tooling is not the only one that would benefit.
      Disambiguating symbols also helps efforts such as live patching,
      kprobes and BPF, but for other reasons and R&D on this area is
      active with no clear solution in sight.

   b) help us inch closer to the now generally accepted long term goal
      of automating all the MODULE_LICENSE() tags from SPDX license tags

  In so far as a) is concerned, although module license tags are a no-op
  for non-modules, tools which would want create a mapping of possible
  modules can only rely on the module license tag after the commit
  8b41fc4454 ("kbuild: create modules.builtin without
  Makefile.modbuiltin or tristate.conf").

  Nick has been working on this *for years* and AFAICT I was the only
  one to suggest two alternatives to this approach for tooling. The
  complexity in one of my suggested approaches lies in that we'd need a
  possible-obj-m and a could-be-module which would check if the object
  being built is part of any kconfig build which could ever lead to it
  being part of a module, and if so define a new define
  -DPOSSIBLE_MODULE [0].

  A more obvious yet theoretical approach I've suggested would be to
  have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as
  well but that means getting kconfig symbol names mapping to modules
  always, and I don't think that's the case today. I am not aware of
  Nick or anyone exploring either of these options. Quite recently Josh
  Poimboeuf has pointed out that live patching, kprobes and BPF would
  benefit from resolving some part of the disambiguation as well but for
  other reasons. The function granularity KASLR (fgkaslr) patches were
  mentioned but Joe Lawrence has clarified this effort has been dropped
  with no clear solution in sight [1].

  In the meantime removing module license tags from code which could
  never be modules is welcomed for both objectives mentioned above. Some
  developers have also welcomed these changes as it has helped clarify
  when a module was never possible and they forgot to clean this up, and
  so you'll see quite a bit of Nick's patches in other pull requests for
  this merge window. I just picked up the stragglers after rc3. LWN has
  good coverage on the motivation behind this work [2] and the typical
  cross-tree issues he ran into along the way. The only concrete blocker
  issue he ran into was that we should not remove the MODULE_LICENSE()
  tags from files which have no SPDX tags yet, even if they can never be
  modules. Nick ended up giving up on his efforts due to having to do
  this vetting and backlash he ran into from folks who really did *not
  understand* the core of the issue nor were providing any alternative /
  guidance. I've gone through his changes and dropped the patches which
  dropped the module license tags where an SPDX license tag was missing,
  it only consisted of 11 drivers. To see if a pull request deals with a
  file which lacks SPDX tags you can just use:

    ./scripts/spdxcheck.py -f \
	$(git diff --name-only commid-id | xargs echo)

  You'll see a core module file in this pull request for the above, but
  that's not related to his changes. WE just need to add the SPDX
  license tag for the kernel/module/kmod.c file in the future but it
  demonstrates the effectiveness of the script.

  Most of Nick's changes were spread out through different trees, and I
  just picked up the slack after rc3 for the last kernel was out. Those
  changes have been in linux-next for over two weeks.

  The cleanups, debug code I added and final fix I added for modules
  were motivated by David Hildenbrand's report of boot failing on a
  systems with over 400 CPUs when KASAN was enabled due to running out
  of virtual memory space. Although the functional change only consists
  of 3 lines in the patch "module: avoid allocation if module is already
  present and ready", proving that this was the best we can do on the
  modules side took quite a bit of effort and new debug code.

  The initial cleanups I did on the modules side of things has been in
  linux-next since around rc3 of the last kernel, the actual final fix
  for and debug code however have only been in linux-next for about a
  week or so but I think it is worth getting that code in for this merge
  window as it does help fix / prove / evaluate the issues reported with
  larger number of CPUs. Userspace is not yet fixed as it is taking a
  bit of time for folks to understand the crux of the issue and find a
  proper resolution. Worst come to worst, I have a kludge-of-concept [3]
  of how to make kernel_read*() calls for modules unique / converge
  them, but I'm currently inclined to just see if userspace can fix this
  instead"

Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0]
Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1]
Link: https://lwn.net/Articles/927569/ [2]
Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3]

* tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits)
  module: add debugging auto-load duplicate module support
  module: stats: fix invalid_mod_bytes typo
  module: remove use of uninitialized variable len
  module: fix building stats for 32-bit targets
  module: stats: include uapi/linux/module.h
  module: avoid allocation if module is already present and ready
  module: add debug stats to help identify memory pressure
  module: extract patient module check into helper
  modules/kmod: replace implementation with a semaphore
  Change DEFINE_SEMAPHORE() to take a number argument
  module: fix kmemleak annotations for non init ELF sections
  module: Ignore L0 and rename is_arm_mapping_symbol()
  module: Move is_arm_mapping_symbol() to module_symbol.h
  module: Sync code of is_arm_mapping_symbol()
  scripts/gdb: use mem instead of core_layout to get the module address
  interconnect: remove module-related code
  interconnect: remove MODULE_LICENSE in non-modules
  zswap: remove MODULE_LICENSE in non-modules
  zpool: remove MODULE_LICENSE in non-modules
  x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules
  ...
2023-04-27 16:36:55 -07:00
Chuck Lever b3cbf98e2f SUNRPC: Support TLS handshake in the server-side TCP socket code
This patch adds opportunitistic RPC-with-TLS to the Linux in-kernel
NFS server. If the client requests RPC-with-TLS and the user space
handshake agent is running, the server will set up a TLS session.

There are no policy settings yet. For example, the server cannot
yet require the use of RPC-with-TLS to access its data.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Chuck Lever 695bc1f32c SUNRPC: Clear rq_xid when receiving a new RPC Call
This is an eye-catcher for tracepoints that record the XID: it means
svc_rqst() has not received a full RPC Call with an XID yet.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Chuck Lever 5e052dda12 SUNRPC: Recognize control messages in server-side TCP socket code
To support kTLS, the server-side TCP socket receive path needs to
watch for CMSGs.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Chuck Lever 6a0cdf56bf SUNRPC: Be even lazier about releasing pages
A single RPC transaction that touches only a couple of pages means
rq_pvec will not be even close to full in svc_xpt_release(). This is
a common case.

Instead, just leave the pages in rq_pvec until it is completely
full. This improves the efficiency of the batch release mechanism
on workloads that involve small RPC messages.

The rq_pvec is also fully emptied just before thread exit.

Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:23 -04:00
Kal Conley 6ec7be9a2d xsk: Use pool->dma_pages to check for DMA
Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
active DMA mapping. pool->dma_pages needs to be read anyway to access
the map so this compiles to more efficient code.

Signed-off-by: Kal Conley <kal.conley@dectris.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230423180157.93559-1-kal.conley@dectris.com
2023-04-27 22:24:51 +02:00
Linus Torvalds 556eb8b791 Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1.
 
 Once again, a busy development cycle, with lots of changes happening in
 the driver core in the quest to be able to move "struct bus" and "struct
 class" into read-only memory, a task now complete with these changes.
 
 This will make the future rust interactions with the driver core more
 "provably correct" as well as providing more obvious lifetime rules for
 all busses and classes in the kernel.
 
 The changes required for this did touch many individual classes and
 busses as many callbacks were changed to take const * parameters
 instead.  All of these changes have been submitted to the various
 subsystem maintainers, giving them plenty of time to review, and most of
 them actually did so.
 
 Other than those changes, included in here are a small set of other
 things:
   - kobject logging improvements
   - cacheinfo improvements and updates
   - obligatory fw_devlink updates and fixes
   - documentation updates
   - device property cleanups and const * changes
   - firwmare loader dependency fixes.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
 LEGadNS38k5fs+73UaxV
 =7K4B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Pedro Tammela 1b483d9f58 net/sched: act_pedit: free pedit keys on bail from offset check
Ido Schimmel reports a memleak on a syzkaller instance:
   BUG: memory leak
   unreferenced object 0xffff88803d45e400 (size 1024):
     comm "syz-executor292", pid 563, jiffies 4295025223 (age 51.781s)
     hex dump (first 32 bytes):
       28 bd 70 00 fb db df 25 02 00 14 1f ff 02 00 02  (.p....%........
       00 32 00 00 1f 00 00 00 ac 14 14 3e 08 00 07 00  .2.........>....
     backtrace:
       [<ffffffff81bd0f2c>] kmemleak_alloc_recursive include/linux/kmemleak.h:42 [inline]
       [<ffffffff81bd0f2c>] slab_post_alloc_hook mm/slab.h:772 [inline]
       [<ffffffff81bd0f2c>] slab_alloc_node mm/slub.c:3452 [inline]
       [<ffffffff81bd0f2c>] __kmem_cache_alloc_node+0x25c/0x320 mm/slub.c:3491
       [<ffffffff81a865d9>] __do_kmalloc_node mm/slab_common.c:966 [inline]
       [<ffffffff81a865d9>] __kmalloc+0x59/0x1a0 mm/slab_common.c:980
       [<ffffffff83aa85c3>] kmalloc include/linux/slab.h:584 [inline]
       [<ffffffff83aa85c3>] tcf_pedit_init+0x793/0x1ae0 net/sched/act_pedit.c:245
       [<ffffffff83a90623>] tcf_action_init_1+0x453/0x6e0 net/sched/act_api.c:1394
       [<ffffffff83a90e58>] tcf_action_init+0x5a8/0x950 net/sched/act_api.c:1459
       [<ffffffff83a96258>] tcf_action_add+0x118/0x4e0 net/sched/act_api.c:1985
       [<ffffffff83a96997>] tc_ctl_action+0x377/0x490 net/sched/act_api.c:2044
       [<ffffffff83920a8d>] rtnetlink_rcv_msg+0x46d/0xd70 net/core/rtnetlink.c:6395
       [<ffffffff83b24305>] netlink_rcv_skb+0x185/0x490 net/netlink/af_netlink.c:2575
       [<ffffffff83901806>] rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:6413
       [<ffffffff83b21cae>] netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
       [<ffffffff83b21cae>] netlink_unicast+0x5be/0x8a0 net/netlink/af_netlink.c:1365
       [<ffffffff83b2293f>] netlink_sendmsg+0x9af/0xed0 net/netlink/af_netlink.c:1942
       [<ffffffff8380c39f>] sock_sendmsg_nosec net/socket.c:724 [inline]
       [<ffffffff8380c39f>] sock_sendmsg net/socket.c:747 [inline]
       [<ffffffff8380c39f>] ____sys_sendmsg+0x3ef/0xaa0 net/socket.c:2503
       [<ffffffff838156d2>] ___sys_sendmsg+0x122/0x1c0 net/socket.c:2557
       [<ffffffff8381594f>] __sys_sendmsg+0x11f/0x200 net/socket.c:2586
       [<ffffffff83815ab0>] __do_sys_sendmsg net/socket.c:2595 [inline]
       [<ffffffff83815ab0>] __se_sys_sendmsg net/socket.c:2593 [inline]
       [<ffffffff83815ab0>] __x64_sys_sendmsg+0x80/0xc0 net/socket.c:2593

The recently added static offset check missed a free to the key buffer when
bailing out on error.

Fixes: e1201bc781 ("net/sched: act_pedit: check static offsets a priori")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20230425144725.669262-1-pctammela@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-27 11:43:29 +02:00
Ivan Vecera 32eff6bace net/sched: flower: Fix wrong handle assignment during filter change
Commit 08a0063df3 ("net/sched: flower: Move filter handle initialization
earlier") moved filter handle initialization but an assignment of
the handle to fnew->handle is done regardless of fold value. This is wrong
because if fold != NULL (so fold->handle == handle) no new handle is
allocated and passed handle is assigned to fnew->handle. Then if any
subsequent action in fl_change() fails then the handle value is
removed from IDR that is incorrect as we will have still valid old filter
instance with handle that is not present in IDR.
Fix this issue by moving the assignment so it is done only when passed
fold == NULL.

Prior the patch:
[root@machine tc-testing]# ./tdc.py -d enp1s0f0np0 -e 14be
Test 14be: Concurrently replace same range of 100k flower filters from 10 tc instances
exit: 123
exit: 0
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Command failed tmp/replace_6:1885

All test results:

1..1
not ok 1 14be - Concurrently replace same range of 100k flower filters from 10 tc instances
        Command exited with 123, expected 0
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Command failed tmp/replace_6:1885

After the patch:
[root@machine tc-testing]# ./tdc.py -d enp1s0f0np0 -e 14be
Test 14be: Concurrently replace same range of 100k flower filters from 10 tc instances

All test results:

1..1
ok 1 14be - Concurrently replace same range of 100k flower filters from 10 tc instances

Fixes: 08a0063df3 ("net/sched: flower: Move filter handle initialization earlier")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230425140604.169881-1-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-27 10:31:58 +02:00
David Howells 2b5fdc0f5c rxrpc: Fix potential data race in rxrpc_wait_to_be_connected()
Inside the loop in rxrpc_wait_to_be_connected() it checks call->error to
see if it should exit the loop without first checking the call state.  This
is probably safe as if call->error is set, the call is dead anyway, but we
should probably wait for the call state to have been set to completion
first, lest it cause surprise on the way out.

Fix this by only accessing call->error if the call is complete.  We don't
actually need to access the error inside the loop as we'll do that after.

This caused the following report:

    BUG: KCSAN: data-race in rxrpc_send_data / rxrpc_set_call_completion

    write to 0xffff888159cf3c50 of 4 bytes by task 25673 on cpu 1:
     rxrpc_set_call_completion+0x71/0x1c0 net/rxrpc/call_state.c:22
     rxrpc_send_data_packet+0xba9/0x1650 net/rxrpc/output.c:479
     rxrpc_transmit_one+0x1e/0x130 net/rxrpc/output.c:714
     rxrpc_decant_prepared_tx net/rxrpc/call_event.c:326 [inline]
     rxrpc_transmit_some_data+0x496/0x600 net/rxrpc/call_event.c:350
     rxrpc_input_call_event+0x564/0x1220 net/rxrpc/call_event.c:464
     rxrpc_io_thread+0x307/0x1d80 net/rxrpc/io_thread.c:461
     kthread+0x1ac/0x1e0 kernel/kthread.c:376
     ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

    read to 0xffff888159cf3c50 of 4 bytes by task 25672 on cpu 0:
     rxrpc_send_data+0x29e/0x1950 net/rxrpc/sendmsg.c:296
     rxrpc_do_sendmsg+0xb7a/0xc20 net/rxrpc/sendmsg.c:726
     rxrpc_sendmsg+0x413/0x520 net/rxrpc/af_rxrpc.c:565
     sock_sendmsg_nosec net/socket.c:724 [inline]
     sock_sendmsg net/socket.c:747 [inline]
     ____sys_sendmsg+0x375/0x4c0 net/socket.c:2501
     ___sys_sendmsg net/socket.c:2555 [inline]
     __sys_sendmmsg+0x263/0x500 net/socket.c:2641
     __do_sys_sendmmsg net/socket.c:2670 [inline]
     __se_sys_sendmmsg net/socket.c:2667 [inline]
     __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2667
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd

    value changed: 0x00000000 -> 0xffffffea

Fixes: 9d35d880e0 ("rxrpc: Move client call connection to the I/O thread")
Reported-by: syzbot+ebc945fdb4acd72cba78@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000e7c6d205fa10a3cd@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Dmitry Vyukov <dvyukov@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/508133.1682427395@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-27 09:17:41 +02:00
Linus Torvalds 6e98b09da9 Networking changes for 6.4.
Core
 ----
 
  - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
    default value allows for better BIG TCP performances.
 
  - Reduce compound page head access for zero-copy data transfers.
 
  - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
 
  - Threaded NAPI improvements, adding defer skb free support and unneeded
    softirq avoidance.
 
  - Address dst_entry reference count scalability issues, via false
    sharing avoidance and optimize refcount tracking.
 
  - Add lockless accesses annotation to sk_err[_soft].
 
  - Optimize again the skb struct layout.
 
  - Extends the skb drop reasons to make it usable by multiple
    subsystems.
 
  - Better const qualifier awareness for socket casts.
 
 BPF
 ---
 
  - Add skb and XDP typed dynptrs which allow BPF programs for more
    ergonomic and less brittle iteration through data and variable-sized
    accesses.
 
  - Add a new BPF netfilter program type and minimal support to hook
    BPF programs to netfilter hooks such as prerouting or forward.
 
  - Add more precise memory usage reporting for all BPF map types.
 
  - Adds support for using {FOU,GUE} encap with an ipip device operating
    in collect_md mode and add a set of BPF kfuncs for controlling encap
    params.
 
  - Allow BPF programs to detect at load time whether a particular kfunc
    exists or not, and also add support for this in light skeleton.
 
  - Bigger batch of BPF verifier improvements to prepare for upcoming BPF
    open-coded iterators allowing for less restrictive looping capabilities.
 
  - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
    programs to NULL-check before passing such pointers into kfunc.
 
  - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
    local storage maps.
 
  - Enable RCU semantics for task BPF kptrs and allow referenced kptr
    tasks to be stored in BPF maps.
 
  - Add support for refcounted local kptrs to the verifier for allowing
    shared ownership, useful for adding a node to both the BPF list and
    rbtree.
 
  - Add BPF verifier support for ST instructions in convert_ctx_access()
    which will help new -mcpu=v4 clang flag to start emitting them.
 
  - Add ARM32 USDT support to libbpf.
 
  - Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations.
 
 Protocols
 ---------
 
  - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
    indicates the provenance of the IP address.
 
  - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
 
  - Add the handshake upcall mechanism, allowing the user-space
    to implement generic TLS handshake on kernel's behalf.
 
  - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
    resilience to nodes failures.
 
  - SCTP: add support for Fair Capacity and Weighted Fair Queueing
    schedulers.
 
  - MPTCP: delay first subflow allocation up to its first usage. This
    will allow for later better LSM interaction.
 
  - xfrm: Remove inner/outer modes from input/output path. These are
    not needed anymore.
 
  - WiFi:
    - reduced neighbor report (RNR) handling for AP mode
    - HW timestamping support
    - support for randomized auth/deauth TA for PASN privacy
    - per-link debugfs for multi-link
    - TC offload support for mac80211 drivers
    - mac80211 mesh fast-xmit and fast-rx support
    - enable Wi-Fi 7 (EHT) mesh support
 
 Netfilter
 ---------
 
  - Add nf_tables 'brouting' support, to force a packet to be routed
    instead of being bridged.
 
  - Update bridge netfilter and ovs conntrack helpers to handle
    IPv6 Jumbo packets properly, i.e. fetch the packet length
    from hop-by-hop extension header. This is needed for BIT TCP
    support.
 
  - The iptables 32bit compat interface isn't compiled in by default
    anymore.
 
  - Move ip(6)tables builtin icmp matches to the udptcp one.
    This has the advantage that icmp/icmpv6 match doesn't load the
    iptables/ip6tables modules anymore when iptables-nft is used.
 
  - Extended netlink error report for netdevice in flowtables and
    netdev/chains. Allow for incrementally add/delete devices to netdev
    basechain. Allow to create netdev chain without device.
 
 Driver API
 ----------
 
  - Remove redundant Device Control Error Reporting Enable, as PCI core
    has already error reporting enabled at enumeration time.
 
  - Move Multicast DB netlink handlers to core, allowing devices other
    then bridge to use them.
 
  - Allow the page_pool to directly recycle the pages from safely
    localized NAPI.
 
  - Implement lockless TX queue stop/wake combo macros, allowing for
    further code de-duplication and sanitization.
 
  - Add YNL support for user headers and struct attrs.
 
  - Add partial YNL specification for devlink.
 
  - Add partial YNL specification for ethtool.
 
  - Add tc-mqprio and tc-taprio support for preemptible traffic classes.
 
  - Add tx push buf len param to ethtool, specifies the maximum number
    of bytes of a transmitted packet a driver can push directly to the
    underlying device.
 
  - Add basic LED support for switch/phy.
 
  - Add NAPI documentation, stop relaying on external links.
 
  - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
    work to make the hardware timestamping layer selectable by user
    space.
 
  - Add transceiver support and improve the error messages for CAN-FD
    controllers.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - AMD/Pensando core device support
    - MediaTek MT7981 SoC
    - MediaTek MT7988 SoC
    - Broadcom BCM53134 embedded switch
    - Texas Instruments CPSW9G ethernet switch
    - Qualcomm EMAC3 DWMAC ethernet
    - StarFive JH7110 SoC
    - NXP CBTX ethernet PHY
 
  - WiFi:
    - Apple M1 Pro/Max devices
    - RealTek rtl8710bu/rtl8188gu
    - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
 
  - Bluetooth:
    - Realtek RTL8821CS, RTL8851B, RTL8852BS
    - Mediatek MT7663, MT7922
    - NXP w8997
    - Actions Semi ATS2851
    - QTI WCN6855
    - Marvell 88W8997
 
  - Can:
    - STMicroelectronics bxcan stm32f429
 
 Drivers
 -------
  - Ethernet NICs:
    - Intel (1G, icg):
      - add tracking and reporting of QBV config errors.
      - add support for configuring max SDU for each Tx queue.
    - Intel (100G, ice):
      - refactor mailbox overflow detection to support Scalable IOV
      - GNSS interface optimization
    - Intel (i40e):
      - support XDP multi-buffer
    - nVidia/Mellanox:
      - add the support for linux bridge multicast offload
      - enable TC offload for egress and engress MACVLAN over bond
      - add support for VxLAN GBP encap/decap flows offload
      - extend packet offload to fully support libreswan
      - support tunnel mode in mlx5 IPsec packet offload
      - extend XDP multi-buffer support
      - support MACsec VLAN offload
      - add support for dynamic msix vectors allocation
      - drop RX page_cache and fully use page_pool
      - implement thermal zone to report NIC temperature
    - Netronome/Corigine:
      - add support for multi-zone conntrack offload
    - Solarflare/Xilinx:
      - support offloading TC VLAN push/pop actions to the MAE
      - support TC decap rules
      - support unicast PTP
 
  - Other NICs:
    - Broadcom (bnxt): enforce software based freq adjustments only
 		on shared PHC NIC
    - RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
    - Micrel (lan8841): add support for PTP_PF_PEROUT
    - Cadence (macb): enable PTP unicast
    - Engleder (tsnep): add XDP socket zero-copy support
    - virtio-net: implement exact header length guest feature
    - veth: add page_pool support for page recycling
    - vxlan: add MDB data path support
    - gve: add XDP support for GQI-QPL format
    - geneve: accept every ethertype
    - macvlan: allow some packets to bypass broadcast queue
    - mana: add support for jumbo frame
 
  - Ethernet high-speed switches:
    - Microchip (sparx5): Add support for TC flower templates.
 
  - Ethernet embedded switches:
    - Broadcom (b54):
      - configure 6318 and 63268 RGMII ports
    - Marvell (mv88e6xxx):
      - faster C45 bus scan
    - Microchip:
      - lan966x:
        - add support for IS1 VCAP
        - better TX/RX from/to CPU performances
      - ksz9477: add ETS Qdisc support
      - ksz8: enhance static MAC table operations and error handling
      - sama7g5: add PTP capability
    - NXP (ocelot):
      - add support for external ports
      - add support for preemptible traffic classes
    - Texas Instruments:
      - add CPSWxG SGMII support for J7200 and J721E
 
  - Intel WiFi (iwlwifi):
    - preparation for Wi-Fi 7 EHT and multi-link support
    - EHT (Wi-Fi 7) sniffer support
    - hardware timestamping support for some devices/firwmares
    - TX beacon protection on newer hardware
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - MU-MIMO parameters support
    - ack signal support for management packets
 
  - RealTek WiFi (rtw88):
    - SDIO bus support
    - better support for some SDIO devices
      (e.g. MAC address from efuse)
 
  - RealTek WiFi (rtw89):
    - HW scan support for 8852b
    - better support for 6 GHz scanning
    - support for various newer firmware APIs
    - framework firmware backwards compatibility
 
  - MediaTek WiFi (mt76):
    - P2P support
    - mesh A-MSDU support
    - EHT (Wi-Fi 7) support
    - coredump support
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
 9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
 Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
 cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
 Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
 Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
 HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
 eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
 pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
 F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
 aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
 vP7ugFnttneg
 =ITVa
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
     default value allows for better BIG TCP performances

   - Reduce compound page head access for zero-copy data transfers

   - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
     possible

   - Threaded NAPI improvements, adding defer skb free support and
     unneeded softirq avoidance

   - Address dst_entry reference count scalability issues, via false
     sharing avoidance and optimize refcount tracking

   - Add lockless accesses annotation to sk_err[_soft]

   - Optimize again the skb struct layout

   - Extends the skb drop reasons to make it usable by multiple
     subsystems

   - Better const qualifier awareness for socket casts

  BPF:

   - Add skb and XDP typed dynptrs which allow BPF programs for more
     ergonomic and less brittle iteration through data and
     variable-sized accesses

   - Add a new BPF netfilter program type and minimal support to hook
     BPF programs to netfilter hooks such as prerouting or forward

   - Add more precise memory usage reporting for all BPF map types

   - Adds support for using {FOU,GUE} encap with an ipip device
     operating in collect_md mode and add a set of BPF kfuncs for
     controlling encap params

   - Allow BPF programs to detect at load time whether a particular
     kfunc exists or not, and also add support for this in light
     skeleton

   - Bigger batch of BPF verifier improvements to prepare for upcoming
     BPF open-coded iterators allowing for less restrictive looping
     capabilities

   - Rework RCU enforcement in the verifier, add kptr_rcu and enforce
     BPF programs to NULL-check before passing such pointers into kfunc

   - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
     in local storage maps

   - Enable RCU semantics for task BPF kptrs and allow referenced kptr
     tasks to be stored in BPF maps

   - Add support for refcounted local kptrs to the verifier for allowing
     shared ownership, useful for adding a node to both the BPF list and
     rbtree

   - Add BPF verifier support for ST instructions in
     convert_ctx_access() which will help new -mcpu=v4 clang flag to
     start emitting them

   - Add ARM32 USDT support to libbpf

   - Improve bpftool's visual program dump which produces the control
     flow graph in a DOT format by adding C source inline annotations

  Protocols:

   - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
     indicates the provenance of the IP address

   - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition

   - Add the handshake upcall mechanism, allowing the user-space to
     implement generic TLS handshake on kernel's behalf

   - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
     resilience to nodes failures

   - SCTP: add support for Fair Capacity and Weighted Fair Queueing
     schedulers

   - MPTCP: delay first subflow allocation up to its first usage. This
     will allow for later better LSM interaction

   - xfrm: Remove inner/outer modes from input/output path. These are
     not needed anymore

   - WiFi:
      - reduced neighbor report (RNR) handling for AP mode
      - HW timestamping support
      - support for randomized auth/deauth TA for PASN privacy
      - per-link debugfs for multi-link
      - TC offload support for mac80211 drivers
      - mac80211 mesh fast-xmit and fast-rx support
      - enable Wi-Fi 7 (EHT) mesh support

  Netfilter:

   - Add nf_tables 'brouting' support, to force a packet to be routed
     instead of being bridged

   - Update bridge netfilter and ovs conntrack helpers to handle IPv6
     Jumbo packets properly, i.e. fetch the packet length from
     hop-by-hop extension header. This is needed for BIT TCP support

   - The iptables 32bit compat interface isn't compiled in by default
     anymore

   - Move ip(6)tables builtin icmp matches to the udptcp one. This has
     the advantage that icmp/icmpv6 match doesn't load the
     iptables/ip6tables modules anymore when iptables-nft is used

   - Extended netlink error report for netdevice in flowtables and
     netdev/chains. Allow for incrementally add/delete devices to netdev
     basechain. Allow to create netdev chain without device

  Driver API:

   - Remove redundant Device Control Error Reporting Enable, as PCI core
     has already error reporting enabled at enumeration time

   - Move Multicast DB netlink handlers to core, allowing devices other
     then bridge to use them

   - Allow the page_pool to directly recycle the pages from safely
     localized NAPI

   - Implement lockless TX queue stop/wake combo macros, allowing for
     further code de-duplication and sanitization

   - Add YNL support for user headers and struct attrs

   - Add partial YNL specification for devlink

   - Add partial YNL specification for ethtool

   - Add tc-mqprio and tc-taprio support for preemptible traffic classes

   - Add tx push buf len param to ethtool, specifies the maximum number
     of bytes of a transmitted packet a driver can push directly to the
     underlying device

   - Add basic LED support for switch/phy

   - Add NAPI documentation, stop relaying on external links

   - Convert dsa_master_ioctl() to netdev notifier. This is a
     preparatory work to make the hardware timestamping layer selectable
     by user space

   - Add transceiver support and improve the error messages for CAN-FD
     controllers

  New hardware / drivers:

   - Ethernet:
      - AMD/Pensando core device support
      - MediaTek MT7981 SoC
      - MediaTek MT7988 SoC
      - Broadcom BCM53134 embedded switch
      - Texas Instruments CPSW9G ethernet switch
      - Qualcomm EMAC3 DWMAC ethernet
      - StarFive JH7110 SoC
      - NXP CBTX ethernet PHY

   - WiFi:
      - Apple M1 Pro/Max devices
      - RealTek rtl8710bu/rtl8188gu
      - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset

   - Bluetooth:
      - Realtek RTL8821CS, RTL8851B, RTL8852BS
      - Mediatek MT7663, MT7922
      - NXP w8997
      - Actions Semi ATS2851
      - QTI WCN6855
      - Marvell 88W8997

   - Can:
      - STMicroelectronics bxcan stm32f429

  Drivers:

   - Ethernet NICs:
      - Intel (1G, icg):
         - add tracking and reporting of QBV config errors
         - add support for configuring max SDU for each Tx queue
      - Intel (100G, ice):
         - refactor mailbox overflow detection to support Scalable IOV
         - GNSS interface optimization
      - Intel (i40e):
         - support XDP multi-buffer
      - nVidia/Mellanox:
         - add the support for linux bridge multicast offload
         - enable TC offload for egress and engress MACVLAN over bond
         - add support for VxLAN GBP encap/decap flows offload
         - extend packet offload to fully support libreswan
         - support tunnel mode in mlx5 IPsec packet offload
         - extend XDP multi-buffer support
         - support MACsec VLAN offload
         - add support for dynamic msix vectors allocation
         - drop RX page_cache and fully use page_pool
         - implement thermal zone to report NIC temperature
      - Netronome/Corigine:
         - add support for multi-zone conntrack offload
      - Solarflare/Xilinx:
         - support offloading TC VLAN push/pop actions to the MAE
         - support TC decap rules
         - support unicast PTP

   - Other NICs:
      - Broadcom (bnxt): enforce software based freq adjustments only on
        shared PHC NIC
      - RealTek (r8169): refactor to addess ASPM issues during NAPI poll
      - Micrel (lan8841): add support for PTP_PF_PEROUT
      - Cadence (macb): enable PTP unicast
      - Engleder (tsnep): add XDP socket zero-copy support
      - virtio-net: implement exact header length guest feature
      - veth: add page_pool support for page recycling
      - vxlan: add MDB data path support
      - gve: add XDP support for GQI-QPL format
      - geneve: accept every ethertype
      - macvlan: allow some packets to bypass broadcast queue
      - mana: add support for jumbo frame

   - Ethernet high-speed switches:
      - Microchip (sparx5): Add support for TC flower templates

   - Ethernet embedded switches:
      - Broadcom (b54):
         - configure 6318 and 63268 RGMII ports
      - Marvell (mv88e6xxx):
         - faster C45 bus scan
      - Microchip:
         - lan966x:
            - add support for IS1 VCAP
            - better TX/RX from/to CPU performances
         - ksz9477: add ETS Qdisc support
         - ksz8: enhance static MAC table operations and error handling
         - sama7g5: add PTP capability
      - NXP (ocelot):
         - add support for external ports
         - add support for preemptible traffic classes
      - Texas Instruments:
         - add CPSWxG SGMII support for J7200 and J721E

   - Intel WiFi (iwlwifi):
      - preparation for Wi-Fi 7 EHT and multi-link support
      - EHT (Wi-Fi 7) sniffer support
      - hardware timestamping support for some devices/firwmares
      - TX beacon protection on newer hardware

   - Qualcomm 802.11ax WiFi (ath11k):
      - MU-MIMO parameters support
      - ack signal support for management packets

   - RealTek WiFi (rtw88):
      - SDIO bus support
      - better support for some SDIO devices (e.g. MAC address from
        efuse)

   - RealTek WiFi (rtw89):
      - HW scan support for 8852b
      - better support for 6 GHz scanning
      - support for various newer firmware APIs
      - framework firmware backwards compatibility

   - MediaTek WiFi (mt76):
      - P2P support
      - mesh A-MSDU support
      - EHT (Wi-Fi 7) support
      - coredump support"

* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
  net: phy: hide the PHYLIB_LEDS knob
  net: phy: marvell-88x2222: remove unnecessary (void*) conversions
  tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  net: amd: Fix link leak when verifying config failed
  net: phy: marvell: Fix inconsistent indenting in led_blink_set
  lan966x: Don't use xdp_frame when action is XDP_TX
  tsnep: Add XDP socket zero-copy TX support
  tsnep: Add XDP socket zero-copy RX support
  tsnep: Move skb receive action to separate function
  tsnep: Add functions for queue enable/disable
  tsnep: Rework TX/RX queue initialization
  tsnep: Replace modulo operation with mask
  net: phy: dp83867: Add led_brightness_set support
  net: phy: Fix reading LED reg property
  drivers: nfc: nfcsim: remove return value check of `dev_dir`
  net: phy: dp83867: Remove unnecessary (void*) conversions
  net: ethtool: coalesce: try to make user settings stick twice
  net: mana: Check if netdev/napi_alloc_frag returns single page
  net: mana: Rename mana_refill_rxoob and remove some empty lines
  net: veth: add page_pool stats
  ...
2023-04-26 16:07:23 -07:00
Chuck Lever 647a2a6428 SUNRPC: Convert svc_xprt_release() to the release_pages() API
Instead of invoking put_page() one-at-a-time, pass the "response"
portion of rq_pages directly to release_pages() to reduce the number
of times each nfsd thread invokes a page allocator API.

Since svc_xprt_release() is not invoked while a client is waiting
for an RPC Reply, this is not expected to directly impact mean
request latencies on a lightly or moderately loaded server. However
as workload intensity increases, I expect somewhat better
scalability: the same number of server threads should be able to
handle more work.

Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:02 -04:00
Chuck Lever b20cb39def SUNRPC: Relocate svc_free_res_pages()
Clean-up: There doesn't seem to be a reason why this function is
stuck in a header. One thing it prevents is the convenient addition
of tracing. Moving it to a source file also makes the rq_respages
clean-up logic easier to find.

Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:01 -04:00
Chuck Lever 55fcc7d915 SUNRPC: Ignore return value of ->xpo_sendto
Clean up: All callers of svc_process() ignore its return value, so
svc_process() can safely be converted to return void. Ditto for
svc_send().

The return value of ->xpo_sendto() is now used only as part of a
trace event.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:01 -04:00
Chuck Lever ae0d77708a SUNRPC: Ensure server-side sockets have a sock->file
The TLS handshake upcall mechanism requires a non-NULL sock->file on
the socket it hands to user space. svc_sock_free() already releases
sock->file properly if one exists.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:01 -04:00
Chuck Lever 0f5162480b NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
There have been several bugs over the years where the NFSD splice
actor has attempted to write outside the rq_pages array.

This is a "should never happen" condition, but if for some reason
the pipe splice actor should attempt to walk past the end of
rq_pages, it needs to terminate the READ operation to prevent
corruption of the pointer addresses in the fields just beyond the
array.

A server crash is thus prevented. Since the code is not behaving,
the READ operation returns -EIO to the client. None of the READ
payload data can be trusted if the splice actor isn't operating as
expected.

Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2023-04-26 09:05:01 -04:00
Luis Chamberlain 376bcd9b37 sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:01 -04:00
NeilBrown cf64b9bce9 SUNRPC: return proper error from get_expiry()
The get_expiry() function currently returns a timestamp, and uses the
special return value of 0 to indicate an error.

Unfortunately this causes a problem when 0 is the correct return value.

On a system with no RTC it is possible that the boot time will be seen
to be "3".  When exportfs probes to see if a particular filesystem
supports NFS export it tries to cache information with an expiry time of
"3".  The intention is for this to be "long in the past".  Even with no
RTC it will not be far in the future (at most a second or two) so this
is harmless.
But if the boot time happens to have been calculated to be "3", then
get_expiry will fail incorrectly as it converts the number to "seconds
since bootime" - 0.

To avoid this problem we change get_expiry() to report the error quite
separately from the expiry time.  The error is now the return value.
The expiry time is reported through a by-reference parameter.

Reported-by: Jerry Zhang <jerry@skydio.com>
Tested-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Paolo Abeni c248b27cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-26 10:17:46 +02:00
Linus Torvalds 793582ff47 ACPI updates for 6.4-rc1
- Update the ACPICA code in the kernel to upstream revision 20230331
    including the following changes:
    * Delete bogus node_array array of pointers from AEST table (Jessica
      Clarke).
    * Add support for trace buffer extension in GICC to the ACPI MADT
      parser (Xiongfeng Wang).
    * Add missing macro ACPI_FUNCTION_TRACE() for acpi_ns_repair_HID()
      (Xiongfeng Wang).
    * Add missing tables to astable (Pedro Falcato).
    * Add support for 64 bit loong_arch compilation to ACPICA (Huacai
      Chen).
    * Add support for ASPT table in disassembler to ACPICA (Jeremi
      Piotrowski).
    * Add support for Arm's MPAM ACPI table version 2 (Hesham Almatary).
    * Update all copyrights/signons in ACPICA to 2023 (Bob Moore).
    * Add support for ClockInput resource (v6.5) (Niyas Sait).
    * Add RISC-V INTC interrupt controller definition to the list of
      supported interrupt controllers for MADT (Sunil V L).
    * Add structure definitions for the RISC-V RHCT ACPI table (Sunil V L).
    * Address several cases in which the ACPICA code might lead to
      undefined behavior (Tamir Duberstein).
    * Make ACPICA code support flexible arrays properly (Kees Cook).
    * Check null return of ACPI_ALLOCATE_ZEROED in
      acpi_db_display_objects() (void0red).
    * Add os specific support for Zephyr RTOS to ACPICA (Najumon).
    * Update version to 20230331 (Bob Moore).
 
  - Fix evaluating the _PDC ACPI control method when running as Xen
    dom0 (Roger Pau Monne).
 
  - Use platform devices to load ACPI PPC and PCC drivers (Petr Pavlu).
 
  - Check for null return of devm_kzalloc() in fch_misc_setup() (Kang
    Chen).
 
  - Log a message if enable_irq_wake() fails for the ACPI SCI (Simon
    Gaiser).
 
  - Initialize the correct IOMMU fwspec while parsing ACPI VIOT
    (Jean-Philippe Brucker).
 
  - Amend indentation and prefix error messages with FW_BUG in the ACPI
    SPCR parsing code (Andy Shevchenko).
 
  - Enable ACPI sysfs support for CCEL records (Kuppuswamy
    Sathyanarayanan).
 
  - Make the APEI error injection code warn on invalid arguments when
    explicitly indicated by platform (Shuai Xue).
 
  - Add CXL error types to the error injection code in APEI (Tony Luck).
 
  - Refactor acpi_data_prop_read_single() (Andy Shevchenko).
 
  - Fix two issues in the ACPI SBS driver (Armin Wolf).
 
  - Replace ternary operator with min_t() in the generic ACPI thermal
    zone driver (Jiangshan Yi).
 
  - Ensure that ACPI notify handlers are not running after removal and
    clean up code in acpi_sb_notify() (Rafael Wysocki).
 
  - Remove register_backlight_delay module option and code and remove
    quirks for false-positive backlight control support advertised on
    desktop boards (Hans de Goede).
 
  - Replace irqdomain.h include with struct declarations in ACPI headers
    and update several pieces of code previously including of.h
    implicitly through those headers (Rob Herring).
 
  - Fix acpi_evaluate_dsm_typed() redefinition error (Kiran K).
 
  - Update the pm_profile sysfs attribute documentation (Rafael Wysocki).
 
  - Add 80862289 ACPI _HID for second PWM controller on Cherry Trail to
    the ACPI driver for Intel SoCs (Hans de Goede).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmRGvLQSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxoV4P/jxWGAdldtgXORR58lKGbSs6lx/0Y+SF
 iI7qK88NcbcbWS+a3PqRrisNkjN17rjzajfp28Ue2CXFxzwTViyw6KYELbPJ6N/h
 /3prem++jKgf7qiueDJG/AyO8N2+Z+yciubhxdMiK1+c1dZM2ycwSyBzJgYocpXn
 fH+YFPhxE7c8Z8doBrTOZjRuU4SIEKCmxo3c5BbCuyVZkbqCRdQMIDCiBJgLTmbo
 z4pu9OFhAamB8Cth2QFfRbZWqmuY71Gt54+c4ITPPV2ALlLUYODyHZoSISBJULp3
 k0lU/hMCD+i1WRwv+Bb6of7pJPM4Lqp+wOirAtiiibjE9LRxVTNyOUAHLXbx+t2V
 PN8JKVJVCLaZO6TRELgFIL4nh4aBdOtr4BuaLnClZho9bG68jEkc8grnOZYhFYtM
 66BuJBW30rwwGY4N5VSZGzFFR7l2qaHIOSHdq681bxQ3e6erFEeIc5jQVEOKgCqd
 XWdELVkqf3CnCX0lgonj+AgoeCqOpYdrNcWqMsJ+6OyQRoFhLFltDSPeJm9gHGO7
 X+qCQru4ZgEDKexWKpGgH9x8AllDKbh/ApyyumXgsQOsRocVdoNaf+yCBlaaDyqu
 UYif6hgFYnIxF2Fg1r/POgHDXFobE4iUTHcUU1V2QhuByc4PkN9ljKsHeC2FgVUz
 JityWRiMABNv
 =O61K
 -----END PGP SIGNATURE-----

Merge tag 'acpi-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
 "These update the ACPICA code in the kernel to upstream revision
  20230331, fix the ACPI SBS driver and the evaluation of the _PDC
  method on Xen dom0 in the ACPI processor driver, update the ACPI
  driver for Intel SoCs and clean up code in multiple places.

  Specifics:

   - Update the ACPICA code in the kernel to upstream revision 20230331
     including the following changes:
       * Delete bogus node_array array of pointers from AEST table
         (Jessica Clarke)
       * Add support for trace buffer extension in GICC to the ACPI MADT
         parser (Xiongfeng Wang)
       * Add missing macro ACPI_FUNCTION_TRACE() for
         acpi_ns_repair_HID() (Xiongfeng Wang)
       * Add missing tables to astable (Pedro Falcato)
       * Add support for 64 bit loong_arch compilation to ACPICA (Huacai
         Chen)
       * Add support for ASPT table in disassembler to ACPICA (Jeremi
         Piotrowski)
       * Add support for Arm's MPAM ACPI table version 2 (Hesham
         Almatary)
       * Update all copyrights/signons in ACPICA to 2023 (Bob Moore)
       * Add support for ClockInput resource (v6.5) (Niyas Sait)
       * Add RISC-V INTC interrupt controller definition to the list of
         supported interrupt controllers for MADT (Sunil V L)
       * Add structure definitions for the RISC-V RHCT ACPI table (Sunil
         V L)
       * Address several cases in which the ACPICA code might lead to
         undefined behavior (Tamir Duberstein)
       * Make ACPICA code support flexible arrays properly (Kees Cook)
       * Check null return of ACPI_ALLOCATE_ZEROED in
         acpi_db_display_objects() (void0red)
       * Add os specific support for Zephyr RTOS to ACPICA (Najumon)
       * Update version to 20230331 (Bob Moore)

   - Fix evaluating the _PDC ACPI control method when running as Xen
     dom0 (Roger Pau Monne)

   - Use platform devices to load ACPI PPC and PCC drivers (Petr Pavlu)

   - Check for null return of devm_kzalloc() in fch_misc_setup() (Kang
     Chen)

   - Log a message if enable_irq_wake() fails for the ACPI SCI (Simon
     Gaiser)

   - Initialize the correct IOMMU fwspec while parsing ACPI VIOT
     (Jean-Philippe Brucker)

   - Amend indentation and prefix error messages with FW_BUG in the ACPI
     SPCR parsing code (Andy Shevchenko)

   - Enable ACPI sysfs support for CCEL records (Kuppuswamy
     Sathyanarayanan)

   - Make the APEI error injection code warn on invalid arguments when
     explicitly indicated by platform (Shuai Xue)

   - Add CXL error types to the error injection code in APEI (Tony Luck)

   - Refactor acpi_data_prop_read_single() (Andy Shevchenko)

   - Fix two issues in the ACPI SBS driver (Armin Wolf)

   - Replace ternary operator with min_t() in the generic ACPI thermal
     zone driver (Jiangshan Yi)

   - Ensure that ACPI notify handlers are not running after removal and
     clean up code in acpi_sb_notify() (Rafael Wysocki)

   - Remove register_backlight_delay module option and code and remove
     quirks for false-positive backlight control support advertised on
     desktop boards (Hans de Goede)

   - Replace irqdomain.h include with struct declarations in ACPI
     headers and update several pieces of code previously including of.h
     implicitly through those headers (Rob Herring)

   - Fix acpi_evaluate_dsm_typed() redefinition error (Kiran K)

   - Update the pm_profile sysfs attribute documentation (Rafael
     Wysocki)

   - Add 80862289 ACPI _HID for second PWM controller on Cherry Trail to
     the ACPI driver for Intel SoCs (Hans de Goede)"

* tag 'acpi-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits)
  ACPI: LPSS: Add 80862289 ACPI _HID for second PWM controller on Cherry Trail
  ACPI: bus: Ensure that notify handlers are not running after removal
  ACPI: bus: Add missing braces to acpi_sb_notify()
  ACPI: video: Remove desktops without backlight DMI quirks
  ACPI: video: Remove register_backlight_delay module option and code
  ACPI: Replace irqdomain.h include with struct declarations
  fpga: lattice-sysconfig-spi: Add explicit include for of.h
  tpm: atmel: Add explicit include for of.h
  virtio-mmio: Add explicit include for of.h
  pata: ixp4xx: Add explicit include for of.h
  ata: pata_macio: Add explicit include of irqdomain.h
  serial: 8250_tegra: Add explicit include for of.h
  net: rfkill-gpio: Add explicit include for of.h
  staging: iio: resolver: ad2s1210: Add explicit include for of.h
  iio: adc: ad7292: Add explicit include for of.h
  ACPICA: Update version to 20230331
  ACPICA: add os specific support for Zephyr RTOS
  ACPICA: ACPICA: check null return of ACPI_ALLOCATE_ZEROED in acpi_db_display_objects
  ACPICA: acpi_resource_irq: Replace 1-element arrays with flexible array
  ACPICA: acpi_madt_oem_data: Fix flexible array member definition
  ...
2023-04-25 18:37:41 -07:00
Linus Torvalds 53b5e72b9d asm-generic updates for 6.4
These are various cleanups, fixing a number of uapi header files to no
 longer reference CONFIG_* symbols, and one patch that introduces the
 new CONFIG_HAS_IOPORT symbol for architectures that provide working
 inb()/outb() macros, as a preparation for adding driver dependencies
 on those in the following release.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRG8IkACgkQYKtH/8kJ
 Uid15Q/9E/neIIEqEk6IvtyhUicrJiIZUM0rGoYtWXiz75ggk6Kx9+3I+j8zIQ/E
 kf2TzAG7q9Md7nfTDFLr4FSr0IcNDj+VG4nYxUyDHdKGcARO+g9Kpdvscxip3lgU
 Rw5w74Gyd30u4iUKGS39OYuxcCgl9LaFjMA9Gh402Oiaoh+OYLmgQS9h/goUD5KN
 Nd+AoFvkdbnHl0/SpxthLRyL5rFEATBmAY7apYViPyMvfjS3gfDJwXJR9jkKgi6X
 Qs4t8Op8BA3h84dCuo6VcFqgAJs2Wiq3nyTSUnkF8NxJ2RFTpeiVgfsLOzXHeDgz
 SKDB4Lp14o3mlyZyj00MWq1uMJRRetUgNiVb6iHOoKQ/E4demBdh+mhIFRybjM5B
 XNTWFcg9PWFCMa4W9jnLfZBc881X4+7T+qUF8I0W/1AbRJUmyGj8HO6jLceC4yGD
 UYLn5oFPM6OWXHp6DqJrCr9Yw8h6fuviQZFEbl/ARlgVGt+J4KbYweJYk8DzfX6t
 PZIj8LskOqyIpRuC2oDA1PHxkaJ1/z+N5oRBHq1uicSh4fxY5HW7HnyzgF08+R3k
 cf+fjAhC3TfGusHkBwQKQJvpxrxZjPuvYXDZ0GxTvNKJRB8eMeiTm1n41E5oTVwQ
 swSblSCjZj/fMVVPXLcjxEW4SBNWRxa9Lz3tIPXb3RheU10Lfy8=
 =H3k4
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "These are various cleanups, fixing a number of uapi header files to no
  longer reference CONFIG_* symbols, and one patch that introduces the
  new CONFIG_HAS_IOPORT symbol for architectures that provide working
  inb()/outb() macros, as a preparation for adding driver dependencies
  on those in the following release"

* tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  Kconfig: introduce HAS_IOPORT option and select it as necessary
  scripts: Update the CONFIG_* ignore list in headers_install.sh
  pktcdvd: Remove CONFIG_CDROM_PKTCDVD_WCACHE from uapi header
  Move bp_type_idx to include/linux/hw_breakpoint.h
  Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
  Move COMPAT_ATM_ADDPARTY to net/atm/svc.c
2023-04-25 12:22:11 -07:00
Kuniyuki Iwashima 50749f2dd6 tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
skbs.  We can reproduce the problem with these sequences:

  sk = socket(AF_INET, SOCK_DGRAM, 0)
  sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
  sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
  sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
  sk.close()

sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
skb is sent, __skb_tstamp_tx() clones it and puts the clone into
the socket's error queue with the TX timestamp.

When the original skb is received locally, skb_copy_ubufs() calls
skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
This additional count is decremented while freeing the skb, but struct
ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
not called.

The last refcnt is not released unless we retrieve the TX timestamped
skb by recvmsg().  Since we clear the error queue in inet_sock_destruct()
after the socket's refcnt reaches 0, there is a circular dependency.
If we close() the socket holding such skbs, we never call sock_put()
and leak the count, sk, and skb.

TCP has the same problem, and commit e0c8bccd40 ("net: stream:
purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
by calling skb_queue_purge() during close().  However, there is a
small chance that skb queued in a qdisc or device could be put
into the error queue after the skb_queue_purge() call.

In __skb_tstamp_tx(), the cloned skb should not have a reference
to the ubuf to remove the circular dependency, but skb_clone() does
not call skb_copy_ubufs() for zerocopy skb.  So, we need to call
skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().

[0]:
BUG: memory leak
unreferenced object 0xffff88800c6d2d00 (size 1152):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00  ................
    02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
    [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
    [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
    [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
    [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
    [<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
    [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
    [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
    [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
    [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

BUG: memory leak
unreferenced object 0xffff888017633a00 (size 240):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff  .........-m.....
  backtrace:
    [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
    [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
    [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
    [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
    [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
    [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
    [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
    [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
    [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
    [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
    [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
    [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
    [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
    [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
    [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Fixes: b5947e5d1e ("udp: msg_zerocopy")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-25 09:42:35 +01:00
Martin Willi 5fc46f9421 Revert "Fix XFRM-I support for nested ESP tunnels"
This reverts commit b0355dbbf1.

The reverted commit clears the secpath on packets received via xfrm interfaces
to support nested IPsec tunnels. This breaks Netfilter policy matching using
xt_policy in the FORWARD chain, as the secpath is missing during forwarding.
Additionally, Benedict Wong reports that it breaks Transport-in-Tunnel mode.

Fix this regression by reverting the commit until we have a better approach
for nested IPsec tunnels.

Fixes: b0355dbbf1 ("Fix XFRM-I support for nested ESP tunnels")
Link: https://lore.kernel.org/netdev/20230412085615.124791-1-martin@strongswan.org/
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-04-25 09:50:34 +02:00
Linus Torvalds ef36b9afc2 fget() to fdget() conversions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ
 64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6
 27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM=
 =GA9W
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fget updates from Al Viro:
 "fget() to fdget() conversions"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse_dev_ioctl(): switch to fdget()
  cgroup_get_from_fd(): switch to fdget_raw()
  bpf: switch to fdget_raw()
  build_mount_idmapped(): switch to fdget()
  kill the last remaining user of proc_ns_fget()
  SVM-SEV: convert the rest of fget() uses to fdget() in there
  convert sgx_set_attribute() to fdget()/fdput()
  convert setns(2) to fdget()/fdput()
2023-04-24 19:14:20 -07:00
Jakub Kicinski 00d0f31a1e net: ethtool: coalesce: try to make user settings stick twice
SET_COALESCE may change operation mode and parameters in one call.
Changing operation mode may cause the driver to reset the parameter
values to what is a reasonable default for new operation mode.

Since driver does not know which parameters come from user and which
are echoed back from ->get, driver may ignore the parameters when
switching operation modes.

This used to be inevitable for ioctl() but in netlink we know which
parameters are actually specified by the user.

We could inform which parameters were set by the user but this would
lead to a lot of code duplication in the drivers. Instead try to call
the drivers twice if both mode and params are changed. The set method
already checks if any params need updating so in case the driver did
the right thing the first time around - there will be no second call
to it's ->set method (only an extra call to ->get()).

For mlx5 for example before this patch we'd see:

 # ethtool -C eth0 adaptive-rx on  adaptive-tx on
 # ethtool -C eth0 adaptive-rx off adaptive-tx off \
		   tx-usecs 123 rx-usecs 123
 Adaptive RX: off  TX: off
 rx-usecs: 3
 rx-frames: 32
 tx-usecs: 16
 tx-frames: 32
 [...]

After the change:

 # ethtool -C eth0 adaptive-rx on  adaptive-tx on
 # ethtool -C eth0 adaptive-rx off adaptive-tx off \
		   tx-usecs 123 rx-usecs 123
 Adaptive RX: off  TX: off
 rx-usecs: 123
 rx-frames: 32
 tx-usecs: 123
 tx-frames: 32
 [...]

This only works for netlink, so it's a small discrepancy between
netlink and ioctl(). Since we anticipate most users to move to
netlink I believe it's worth making their lives easier.

Link: https://lore.kernel.org/r/20230420233302.944382-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24 18:09:49 -07:00
Kuniyuki Iwashima d913d32cc2 netlink: Use copy_to_user() for optval in netlink_getsockopt().
Brad Spencer provided a detailed report [0] that when calling getsockopt()
for AF_NETLINK, some SOL_NETLINK options set only 1 byte even though such
options require at least sizeof(int) as length.

The options return a flag value that fits into 1 byte, but such behaviour
confuses users who do not initialise the variable before calling
getsockopt() and do not strictly check the returned value as char.

Currently, netlink_getsockopt() uses put_user() to copy data to optlen and
optval, but put_user() casts the data based on the pointer, char *optval.
As a result, only 1 byte is set to optval.

To avoid this behaviour, we need to use copy_to_user() or cast optval for
put_user().

Note that this changes the behaviour on big-endian systems, but we document
that the size of optval is int in the man page.

  $ man 7 netlink
  ...
  Socket options
       To set or get a netlink socket option, call getsockopt(2) to read
       or setsockopt(2) to write the option with the option level argument
       set to SOL_NETLINK.  Unless otherwise noted, optval is a pointer to
       an int.

Fixes: 9a4595bc7e ("[NETLINK]: Add set/getsockopt options to support more than 32 groups")
Fixes: be0c22a46c ("netlink: add NETLINK_BROADCAST_ERROR socket option")
Fixes: 38938bfe34 ("netlink: add NETLINK_NO_ENOBUFS socket flag")
Fixes: 0a6a3a23ea ("netlink: add NETLINK_CAP_ACK socket option")
Fixes: 2d4bc93368 ("netlink: extended ACK reporting")
Fixes: 89d35528d1 ("netlink: Add new socket option to enable strict checking on dumps")
Reported-by: Brad Spencer <bspencer@blackberry.com>
Link: https://lore.kernel.org/netdev/ZD7VkNWFfp22kTDt@datsun.rim.net/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/r/20230421185255.94606-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24 18:05:36 -07:00
Jakub Kicinski ffcddcaed6 netfilter pull request 23-04-22
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmRDIGwACgkQ1V2XiooU
 IOTR2xAAgV2g59s0pPO0VnAs9WTKiMgm+9JZz2uLwOT1/nvVzdz8P7wXTJK47m57
 qziX//P0S37D6e5UkdSHpGBSfeIAMwciDd5WVJgsZYhpZ/QMS3y4GEUXVybMhV+p
 oqX86Kc1u91+kLbgFF1CdbMDvw+oNW43QrIYR1Ok2MqlnZ3dlDajkw5KLYWFvb6t
 9BCmkqZ1xheuN1l8CPZ0La3a9fwqTuC0P7TgSlO1BZF5nhADajcGt8lF8Td98zEZ
 Mrb4VyOrhT6UZgGd+UW1ZlNyduO2Cb6Tj7jx9Es5gX8fDjmnZyAJf1qW9Iinoo46
 4Yr+61oSNNGdE63m1lLm2ZQQJGDlUsuGCjUMW4hyVe8YIrqQg+cOXaTkCa40Xvos
 qvObjrsnAd9xHtBchk4IaQ84J56ifalXULAJs8IZGM5K2XUNwnF7kIOaVtV8Rq+K
 BSSprrM9Z7Lz5W5ucRLlBmDYSDd+ESUnhgJgIuf1CZ04mRBupF4IkqCU5INmVhJR
 472cSc9DKPHmsnFFdszidxBwM6mg00qQ9M4Qvl+dQIOs0a28Pr8nHPOSStgMaecd
 NAPGGEMdRLNaH5KYN1VbseUbbFhXQpXcuNCF1q8fdzpQYyo/+I/lu618sFEhy08r
 KN6+JIDdrcAdMyYgL3rQy57u58xYlwU2NLIE0SLHLqI4grtTxnw=
 =T5eS
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-23-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Reduce jumpstack footprint: Stash chain in last rule marker in blob for
   tracing. Remove last rule and chain from jumpstack. From Florian Westphal.

2) nf_tables validates all tables before committing the new rules.
   Unfortunately, this has two drawbacks:

   - Since addition of the transaction mutex pernet state gets written to
     outside of the locked section from the cleanup callback, this is
     wrong so do this cleanup directly after table has passed all checks.

   - Revalidate tables that saw no changes. This can be avoided by
     keeping the validation state per table, not per netns.

   From Florian Westphal.

3) Get rid of a few redundant pointers in the traceinfo structure.
   The three removed pointers are used in the expression evaluation loop,
   so gcc keeps them in registers. Passing them to the (inlined) helpers
   thus doesn't increase nft_do_chain text size, while stack is reduced
   by another 24 bytes on 64bit arches. From Florian Westphal.

4) IPVS cleanups in several ways without implementing any functional
   changes, aside from removing some debugging output:

   - Update width of source for ip_vs_sync_conn_options
     The operation is safe, use an annotation to describe it properly.

   - Consistently use array_size() in ip_vs_conn_init()
     It seems better to use helpers consistently.

   - Remove {Enter,Leave}Function. These seem to be well past their
     use-by date.

   - Correct spelling in comments.

   From Simon Horman.

5) Extended netlink error report for netdevice in flowtables and
   netdev/chains. Allow for incrementally add/delete devices to netdev
   basechain. Allow to create netdev chain without device.

* tag 'nf-next-23-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: allow to create netdev chain without device
  netfilter: nf_tables: support for deleting devices in an existing netdev chain
  netfilter: nf_tables: support for adding new devices to an existing netdev chain
  netfilter: nf_tables: rename function to destroy hook list
  netfilter: nf_tables: do not send complete notification of deletions
  netfilter: nf_tables: extended netlink error reporting for netdevice
  ipvs: Correct spelling in comments
  ipvs: Remove {Enter,Leave}Function
  ipvs: Consistently use array_size() in ip_vs_conn_init()
  ipvs: Update width of source for ip_vs_sync_conn_options
  netfilter: nf_tables: do not store rule in traceinfo structure
  netfilter: nf_tables: do not store verdict in traceinfo structure
  netfilter: nf_tables: do not store pktinfo in traceinfo structure
  netfilter: nf_tables: remove unneeded conditional
  netfilter: nf_tables: make validation state per table
  netfilter: nf_tables: don't write table validation state without mutex
  netfilter: nf_tables: don't store chain address on jump
  netfilter: nf_tables: don't store address of last rule on jump
  netfilter: nf_tables: merge nft_rules_old structure and end of ruleblob marker
====================

Link: https://lore.kernel.org/r/20230421235021.216950-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24 15:37:36 -07:00
Linus Torvalds 5dfb75e842 RCU Changes for 6.4:
o  MAINTAINERS files additions and changes.
  o  Fix hotplug warning in nohz code.
  o  Tick dependency changes by Zqiang.
  o  Lazy-RCU shrinker fixes by Zqiang.
  o  rcu-tasks stall reporting improvements by Neeraj.
  o  Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
     name for robustness.
  o  Documentation Updates:
  o  Significant changes to srcu_struct size.
  o  Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
  o  rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
  o  Other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
 eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
 yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
 XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
 +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
 cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
 CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
 ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
 =Rgbj
 -----END PGP SIGNATURE-----

Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux

Pull RCU updates from Joel Fernandes:

 - Updates and additions to MAINTAINERS files, with Boqun being added to
   the RCU entry and Zqiang being added as an RCU reviewer.

   I have also transitioned from reviewer to maintainer; however, Paul
   will be taking over sending RCU pull-requests for the next merge
   window.

 - Resolution of hotplug warning in nohz code, achieved by fixing
   cpu_is_hotpluggable() through interaction with the nohz subsystem.

   Tick dependency modifications by Zqiang, focusing on fixing usage of
   the TICK_DEP_BIT_RCU_EXP bitmask.

 - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
   kernels, fixed by Zqiang.

 - Improvements to rcu-tasks stall reporting by Neeraj.

 - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
   increased robustness, affecting several components like mac802154,
   drbd, vmw_vmci, tracing, and more.

   A report by Eric Dumazet showed that the API could be unknowingly
   used in an atomic context, so we'd rather make sure they know what
   they're asking for by being explicit:

      https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/

 - Documentation updates, including corrections to spelling,
   clarifications in comments, and improvements to the srcu_size_state
   comments.

 - Better srcu_struct cache locality for readers, by adjusting the size
   of srcu_struct in support of SRCU usage by Christoph Hellwig.

 - Teach lockdep to detect deadlocks between srcu_read_lock() vs
   synchronize_srcu() contributed by Boqun.

   Previously lockdep could not detect such deadlocks, now it can.

 - Integration of rcutorture and rcu-related tools, targeted for v6.4
   from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
   module parameter, and more

 - Miscellaneous changes, various code cleanups and comment improvements

* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
  checkpatch: Error out if deprecated RCU API used
  mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
  rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
  ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
  rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
  rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
  rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
  rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
  rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
  rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
  rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
  rcu/trace: use strscpy() to instead of strncpy()
  tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
  ...
2023-04-24 12:16:14 -07:00
Rafael J. Wysocki 21fc506cf5 Merge branches 'acpi-bus', 'acpi-video' and 'acpi-misc'
Merge ACPI bus type driver changes, ACPI backlight driver updates and a
series of cleanups related to of.h for 6.4-rc1:

 - Ensure that ACPI notify handlers are not running after removal and
   clean up code in acpi_sb_notify() (Rafael Wysocki).

 - Remove register_backlight_delay module option and code and remove
   quirks for false-positive backlight control support advertised on
   desktop boards (Hans de Goede).

 - Replace irqdomain.h include with struct declarations in ACPI headers
   and update several pieces of code previously including of.h
   implicitly through those headers (Rob Herring).

* acpi-bus:
  ACPI: bus: Ensure that notify handlers are not running after removal
  ACPI: bus: Add missing braces to acpi_sb_notify()

* acpi-video:
  ACPI: video: Remove desktops without backlight DMI quirks
  ACPI: video: Remove register_backlight_delay module option and code

* acpi-misc:
  ACPI: Replace irqdomain.h include with struct declarations
  fpga: lattice-sysconfig-spi: Add explicit include for of.h
  tpm: atmel: Add explicit include for of.h
  virtio-mmio: Add explicit include for of.h
  pata: ixp4xx: Add explicit include for of.h
  ata: pata_macio: Add explicit include of irqdomain.h
  serial: 8250_tegra: Add explicit include for of.h
  net: rfkill-gpio: Add explicit include for of.h
  staging: iio: resolver: ad2s1210: Add explicit include for of.h
  iio: adc: ad7292: Add explicit include for of.h
2023-04-24 18:01:57 +02:00
Luiz Augusto von Dentz d883a4669a Bluetooth: hci_sync: Only allow hci_cmd_sync_queue if running
This makes sure hci_cmd_sync_queue only queue new work if HCI_RUNNING
has been set otherwise there is a risk of commands being sent while
turning off.

Because hci_cmd_sync_queue can no longer queue work while HCI_RUNNING is
not set it cannot be used to power on adapters so instead
hci_cmd_sync_submit is introduced which bypass the HCI_RUNNING check, so
it behaves like the old implementation.

Link: https://lore.kernel.org/all/CAB4PzUpDMvdc8j2MdeSAy1KkAE-D3woprCwAdYWeOc-3v3c9Sw@mail.gmail.com/
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:07:43 -07:00
Archie Pusaka f419863588 Bluetooth: Cancel sync command before suspend and power off
Some of the sync commands might take a long time to complete, e.g.
LE Create Connection when the peer device isn't responding might take
20 seconds before it times out. If suspend command is issued during
this time, it will need to wait for completion since both commands are
using the same sync lock.

This patch cancel any running sync commands before attempting to
suspend or adapter power off.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:06:55 -07:00
Zijun Hu 0ab905c3be Bluetooth: Devcoredump: Fix storing u32 without specifying byte order issue
API hci_devcd_init() stores its u32 type parameter @dump_size into
skb, but it does not specify which byte order is used to store the
integer, let us take little endian to store and parse the integer.

Fixes: f5cc609d09d4 ("Bluetooth: Add support for hci devcoredump")
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:05:53 -07:00
Ruihan Li 25c150ac10 bluetooth: Perform careful capability checks in hci_sock_ioctl()
Previously, capability was checked using capable(), which verified that the
caller of the ioctl system call had the required capability. In addition,
the result of the check would be stored in the HCI_SOCK_TRUSTED flag,
making it persistent for the socket.

However, malicious programs can abuse this approach by deliberately sharing
an HCI socket with a privileged task. The HCI socket will be marked as
trusted when the privileged task occasionally makes an ioctl call.

This problem can be solved by using sk_capable() to check capability, which
ensures that not only the current task but also the socket opener has the
specified capability, thus reducing the risk of privilege escalation
through the previously identified vulnerability.

Cc: stable@vger.kernel.org
Fixes: f81f5b2db8 ("Bluetooth: Send control open and close messages for HCI raw sockets")
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:05:39 -07:00
Min Li 25e97f7b18 Bluetooth: L2CAP: fix "bad unlock balance" in l2cap_disconnect_rsp
conn->chan_lock isn't acquired before l2cap_get_chan_by_scid,
if l2cap_get_chan_by_scid returns NULL, then 'bad unlock balance'
is triggered.

Reported-by: syzbot+9519d6b5b79cf7787cf3@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000894f5f05f95e9f4d@google.com/
Signed-off-by: Min Li <lm0963hack@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:05:23 -07:00
Ruihan Li 000c2fa2c1 bluetooth: Add cmd validity checks at the start of hci_sock_ioctl()
Previously, channel open messages were always sent to monitors on the first
ioctl() call for unbound HCI sockets, even if the command and arguments
were completely invalid. This can leave an exploitable hole with the abuse
of invalid ioctl calls.

This commit hardens the ioctl processing logic by first checking if the
command is valid, and immediately returning with an ENOIOCTLCMD error code
if it is not. This ensures that ioctl calls with invalid commands are free
of side effects, and increases the difficulty of further exploitation by
forcing exploitation to find a way to pass a valid command first.

Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Co-developed-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:05:06 -07:00
Raul Cheleguini 91b6d02ddc Bluetooth: Add new quirk for broken set random RPA timeout for ATS2851
The ATS2851 based controller advertises support for command "LE Set Random
Private Address Timeout" but does not actually implement it, impeding the
controller initialization.

Add the quirk HCI_QUIRK_BROKEN_SET_RPA_TIMEOUT to unblock the controller
initialization.

< HCI Command: LE Set Resolvable Private... (0x08|0x002e) plen 2
        Timeout: 900 seconds
> HCI Event: Command Status (0x0f) plen 4
      LE Set Resolvable Private Address Timeout (0x08|0x002e) ncmd 1
        Status: Unknown HCI Command (0x01)

Co-developed-by: imoc <wzj9912@gmail.com>
Signed-off-by: imoc <wzj9912@gmail.com>
Signed-off-by: Raul Cheleguini <raul.cheleguini@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:04:26 -07:00
Luiz Augusto von Dentz c09b80be6f Bluetooth: hci_conn: Fix not waiting for HCI_EVT_LE_CIS_ESTABLISHED
When submitting HCI_OP_LE_CREATE_CIS the code shall wait for
HCI_EVT_LE_CIS_ESTABLISHED thus enforcing the serialization of
HCI_OP_LE_CREATE_CIS as the Core spec does not allow to send them in
parallel:

  BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 4, Part E page 2566:

  If the Host issues this command before all the HCI_LE_CIS_Established
  events from the previous use of the command have been generated, the
  Controller shall return the error code Command Disallowed (0x0C).

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:04:01 -07:00
Luiz Augusto von Dentz c14516faed Bluetooth: hci_conn: Fix not matching by CIS ID
This fixes only matching CIS by address which prevents creating new hcon
if upper layer is requesting a specific CIS ID.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:03:37 -07:00
Luiz Augusto von Dentz 06149746e7 Bluetooth: hci_conn: Add support for linking multiple hcon
Since it is required for some configurations to have multiple CIS with
the same peer which is now covered by iso-tester in the following test
cases:

    ISO AC 6(i) - Success
    ISO AC 7(i) - Success
    ISO AC 8(i) - Success
    ISO AC 9(i) - Success
    ISO AC 11(i) - Success

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:03:13 -07:00
Iulia Tanasescu 9e3c2ea67a Bluetooth: hci_conn: remove extra line in hci_le_big_create_sync
Remove extra line setting the broadcast code parameter of the
hci_cp_le_create_big struct to 0. The broadcast code is copied
from the QoS struct.

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:02:28 -07:00
Lanzhe Li 3c690a0d64 Bluetooth: fix inconsistent indenting
Fixed a wrong indentation before "return".This line uses a 7 space
indent instead of a tab.

Signed-off-by: Lanzhe Li <u202212060@hust.edu.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 22:02:14 -07:00
Luiz Augusto von Dentz 288c90224e Bluetooth: Enable all supported LE PHY by default
This enables 2M and Coded PHY by default if they are marked as supported
in the LE features bits.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:59:41 -07:00
Iulia Tanasescu 0fe8c8d071 Bluetooth: Split bt_iso_qos into dedicated structures
Split bt_iso_qos into dedicated unicast and broadcast
structures and add additional broadcast parameters.

Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:59:17 -07:00
Abhishek Pandit-Subedi 9695ef876f Bluetooth: Add support for hci devcoredump
Add devcoredump APIs to hci core so that drivers only have to provide
the dump skbs instead of managing the synchronization and timeouts.

The devcoredump APIs should be used in the following manner:
 - hci_devcoredump_init is called to allocate the dump.
 - hci_devcoredump_append is called to append any skbs with dump data
   OR hci_devcoredump_append_pattern is called to insert a pattern.
 - hci_devcoredump_complete is called when all dump packets have been
   sent OR hci_devcoredump_abort is called to indicate an error and
   cancel an ongoing dump collection.

The high level APIs just prepare some skbs with the appropriate data and
queue it for the dump to process. Packets part of the crashdump can be
intercepted in the driver in interrupt context and forwarded directly to
the devcoredump APIs.

Internally, there are 5 states for the dump: idle, active, complete,
abort and timeout. A devcoredump will only be in active state after it
has been initialized. Once active, it accepts data to be appended,
patterns to be inserted (i.e. memset) and a completion event or an abort
event to generate a devcoredump. The timeout is initialized at the same
time the dump is initialized (defaulting to 10s) and will be cleared
either when the timeout occurs or the dump is complete or aborted.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:57:59 -07:00
Vasily Khoruzhick 8194f1ef5a Bluetooth: Add new quirk for broken local ext features page 2
Some adapters (e.g. RTL8723CS) advertise that they have more than
2 pages for local ext features, but they don't support any features
declared in these pages. RTL8723CS reports max_page = 2 and declares
support for sync train and secure connection, but it responds with
either garbage or with error in status on corresponding commands.

Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Bastian Germann <bage@debian.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:50:01 -07:00
Luiz Augusto von Dentz b8b23001b8 Bluetooth: L2CAP: Delay identity address updates
This delays the identity address updates to give time for userspace to
process the new address otherwise there is a risk that userspace
creates a duplicated device if the MGMT event is delayed for some
reason.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:48:44 -07:00
Inga Stotland 9f5ae8fcae Bluetooth: hci_sync: Remove duplicate statement
This removes the following duplicate statement in
hci_le_ext_directed_advertising_sync():
cp.own_addr_type = own_addr_type;

Signed-off-by: Inga Stotland <inga.stotland@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:48:28 -07:00
Brian Gix 493ea69970 Bluetooth: Convert MSFT filter HCI cmd to hci_sync
The msft_set_filter_enable() command was using the deprecated
hci_request mechanism rather than hci_sync. This caused the warning error:
hci0: HCI_REQ-0xfcf0

Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:48:13 -07:00
Archie Pusaka f5d13029a4 Bluetooth: hci_sync: Don't wait peer's reply when powering off
Currently, when we initiate disconnection, we will wait for the peer's
reply unless when we are suspending, where we fire and forget the
disconnect request.

A similar case is when adapter is powering off. However, we still wait
for the peer's reply in this case. Therefore, if the peer is
unresponsive, the command will time out and the power off sequence
will fail, causing "bluetooth powered on by itself" to users.

This patch makes the host doesn't wait for the peer's reply when the
disconnection reason is powering off.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:47:57 -07:00
Luiz Augusto von Dentz ae051b044e Bluetooth: hci_sync: Fix smatch warning
This fixes the following new warning:

net/bluetooth/hci_sync.c:2403 hci_pause_addr_resolution() warn: missing
error code? 'err'

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/r/202302251952.xryXOegd-lkp@intel.com/
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23 21:46:25 -07:00
Pedro Tammela 25369891fc net/sched: sch_qfq: refactor parsing of netlink parameters
Two parameters can be transformed into netlink policies and
validated while parsing the netlink message.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:47:09 +01:00
Pedro Tammela c69a9b023f net/sched: sch_qfq: use extack on errors messages
Some error messages are still being printed to dmesg.
Since extack is available, provide error messages there.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:47:09 +01:00
Pedro Tammela 807cfded92 net/sched: sch_htb: use extack on errors messages
Some error messages are still being printed to dmesg.
Since extack is available, provide error messages there.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:47:09 +01:00
Pedro Tammela e3c9673e2f net/sched: act_pedit: rate limit datapath messages
Unbounded info messages in the pedit datapath can flood the printk
ring buffer quite easily depending on the action created.
As these messages are informational, usually printing some, not all,
is enough to bring attention to the real issue.

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:35:27 +01:00
Pedro Tammela 577140180b net/sched: act_pedit: remove extra check for key type
The netlink parsing already validates the key 'htype'.
Remove the datapath check as it's redundant.

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:35:27 +01:00
Pedro Tammela e1201bc781 net/sched: act_pedit: check static offsets a priori
Static key offsets should always be on 32 bit boundaries. Validate them on
create/update time for static offsets and move the datapath validation
for runtime offsets only.

iproute2 already errors out if a given offset and data size cannot be
packed to a 32 bit boundary. This change will make sure users which
create/update pedit instances directly via netlink also error out,
instead of finding out when packets are traversing.

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:35:27 +01:00
Pedro Tammela 0c83c5210e net/sched: act_pedit: use extack in 'ex' parsing errors
We have extack available when parsing 'ex' keys, so pass it to
tcf_pedit_keys_ex_parse and add more detailed error messages.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:35:27 +01:00
Pedro Tammela 5036034572 net/sched: act_pedit: use NLA_POLICY for parsing 'ex' keys
Transform two checks in the 'ex' key parsing into netlink policies
removing extra if checks.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 18:35:26 +01:00
Yajun Deng 2f0f9465ad net: sched: Print msecs when transmit queue time out
The kernel will print several warnings in a short period of time
when it stalls. Like this:

First warning:
[ 7100.097547] ------------[ cut here ]------------
[ 7100.097550] NETDEV WATCHDOG: eno2 (xxx): transmit queue 8 timed out
[ 7100.097571] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:467
                       dev_watchdog+0x260/0x270
...

Second warning:
[ 7147.756952] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 7147.756958] rcu:   24-....: (59999 ticks this GP) idle=546/1/0x400000000000000
                      softirq=367      3137/3673146 fqs=13844
[ 7147.756960]        (t=60001 jiffies g=4322709 q=133381)
[ 7147.756962] NMI backtrace for cpu 24
...

We calculate that the transmit queue start stall should occur before
7095s according to watchdog_timeo, the rcu start stall at 7087s.
These two times are close together, it is difficult to confirm which
happened first.

To let users know the exact time the stall started, print msecs when
the transmit queue time out.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:18:05 +01:00
Vladimir Oltean 0bcf2e4aca net: dsa: tag_ocelot: call only the relevant portion of __skb_vlan_pop() on TX
ocelot_xmit_get_vlan_info() calls __skb_vlan_pop() as the most
appropriate helper I could find which strips away a VLAN header.
That's all I need it to do, but __skb_vlan_pop() has more logic, which
will become incompatible with the future revert of commit 6d1ccff627
("net: reset mac header in dev_start_xmit()").

Namely, it performs a sanity check on skb_mac_header(), which will stop
being set after the above revert, so it will return an error instead of
removing the VLAN tag.

ocelot_xmit_get_vlan_info() gets called in 2 circumstances:

(1) the port is under a VLAN-aware bridge and the bridge sends
    VLAN-tagged packets

(2) the port is under a VLAN-aware bridge and somebody else (an 8021q
    upper) sends VLAN-tagged packets (using a VID that isn't in the
    bridge vlan tables)

In case (1), there is actually no bug to defend against, because
br_dev_xmit() calls skb_reset_mac_header() and things continue to work.

However, in case (2), illustrated using the commands below, it can be
seen that our intervention is needed, since __skb_vlan_pop() complains:

$ ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
$ ip link set $eth master br0 && ip link set $eth up
$ ip link add link $eth name $eth.100 type vlan id 100 && ip link set $eth.100 up
$ ip addr add 192.168.100.1/24 dev $eth.100

I could fend off the checks in __skb_vlan_pop() with some
skb_mac_header_was_set() calls, but seeing how few callers of
__skb_vlan_pop() there are from TX paths, that seems rather
unproductive.

As an alternative solution, extract the bare minimum logic to strip a
VLAN header, and move it to a new helper named vlan_remove_tag(), close
to the definition of vlan_insert_tag(). Document it appropriately and
make ocelot_xmit_get_vlan_info() call this smaller helper instead.

Seeing that it doesn't appear illegal to test skb->protocol in the TX
path, I guess it would be a good for vlan_remove_tag() to also absorb
the vlan_set_encap_proto() function call.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:16:45 +01:00
Vladimir Oltean f0a9d56306 net: dsa: update TX path comments to not mention skb_mac_header()
Once commit 6d1ccff627 ("net: reset mac header in dev_start_xmit()")
will be reverted, it will no longer be true that skb->data points at
skb_mac_header(skb) - since the skb->mac_header will not be set - so
stop saying that, and just say that it points to the MAC header.

I've reviewed vlan_insert_tag() and it does not *actually* depend on
skb_mac_header(), so reword that to avoid the confusion.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:16:45 +01:00
Vladimir Oltean b5653b157e net: dsa: tag_sja1105: replace skb_mac_header() with vlan_eth_hdr()
This is a cosmetic patch which consolidates the code to use the helper
function offered by if_vlan.h.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:16:44 +01:00
Vladimir Oltean f9346f00b5 net: dsa: tag_sja1105: don't rely on skb_mac_header() in TX paths
skb_mac_header() will no longer be available in the TX path when
reverting commit 6d1ccff627 ("net: reset mac header in
dev_start_xmit()"). As preparation for that, let's use
skb_vlan_eth_hdr() to get to the VLAN header instead, which assumes it's
located at skb->data (assumption which holds true here).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:16:44 +01:00
Vladimir Oltean 499b2491d5 net: dsa: tag_ksz: do not rely on skb_mac_header() in TX paths
skb_mac_header() will no longer be available in the TX path when
reverting commit 6d1ccff627 ("net: reset mac header in
dev_start_xmit()"). As preparation for that, let's use skb_eth_hdr() to
get to the Ethernet header's MAC DA instead, helper which assumes this
header is located at skb->data (assumption which holds true here).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:16:44 +01:00
Vladimir Oltean eabb1494c9 net: dsa: tag_ocelot: do not rely on skb_mac_header() for VLAN xmit
skb_mac_header() will no longer be available in the TX path when
reverting commit 6d1ccff627 ("net: reset mac header in
dev_start_xmit()"). As preparation for that, let's use
skb_vlan_eth_hdr() to get to the VLAN header instead, which assumes it's
located at skb->data (assumption which holds true here).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:16:44 +01:00
Vladimir Oltean 1f5020acb3 net: vlan: introduce skb_vlan_eth_hdr()
Similar to skb_eth_hdr() introduced in commit 96cc4b6958 ("macvlan: do
not assume mac_header is set in macvlan_broadcast()"), let's introduce a
skb_vlan_eth_hdr() helper which can be used in TX-only code paths to get
to the VLAN header based on skb->data rather than based on the
skb_mac_header(skb).

We also consolidate the drivers that dereference skb->data to go through
this helper.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 14:16:44 +01:00
Marc Dionne fadfc57cc8 rxrpc: Fix error when reading rxrpc tokens
When converting from ASSERTCMP to WARN_ON, the tested condition must
be inverted, which was missed for this case.

This would cause an EIO error when trying to read an rxrpc token, for
instance when trying to display tokens with AuriStor's "tokens" command.

Fixes: 84924aac08 ("rxrpc: Fix checker warning")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:38:28 +01:00
Gustavo A. R. Silva 788352191c rxrpc: Replace fake flex-array with flexible-array member
Zero-length arrays as fake flexible arrays are deprecated and we are
moving towards adopting C99 flexible-array members instead.

Transform zero-length array into flexible-array member in struct
rxrpc_ackpacket.

Address the following warnings found with GCC-13 and
-fstrict-flex-arrays=3 enabled:
net/rxrpc/call_event.c:149:38: warning: array subscript i is outside array bounds of ‘uint8_t[0]’ {aka ‘unsigned char[]’} [-Warray-bounds=]

This helps with the ongoing efforts to tighten the FORTIFY_SOURCE
routines on memcpy() and help us make progress towards globally
enabling -fstrict-flex-arrays=3 [1].

Link: https://github.com/KSPP/linux/issues/21
Link: https://github.com/KSPP/linux/issues/263
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/ZAZT11n4q5bBttW0@work/
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:36:05 +01:00
Eric Dumazet 87eff2ec57 net: optimize napi_threaded_poll() vs RPS/RFS
We use napi_threaded_poll() in order to reduce our softirq dependency.

We can add a followup of 821eba962d ("net: optimize napi_schedule_rps()")
to further remove the need of firing NET_RX_SOFTIRQ whenever
RPS/RFS are used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:35:07 +01:00
Eric Dumazet a1aaee7f8f net: make napi_threaded_poll() aware of sd->defer_list
If we call skb_defer_free_flush() from napi_threaded_poll(),
we can avoid to raise IPI from skb_attempt_defer_free()
when the list becomes too big.

This allows napi_threaded_poll() to rely less on softirqs,
and lowers latency caused by a too big list.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:35:07 +01:00
Eric Dumazet e6f50edfef net: move skb_defer_free_flush() up
We plan using skb_defer_free_flush() from napi_threaded_poll()
in the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:35:07 +01:00
Eric Dumazet 931e93bdf8 net: do not provide hard irq safety for sd->defer_lock
kfree_skb() can be called from hard irq handlers,
but skb_attempt_defer_free() is meant to be used
from process or BH contexts, and skb_defer_free_flush()
is meant to be called from BH contexts.

Not having to mask hard irq can save some cycles.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:35:07 +01:00
Eric Dumazet e8e1ce8454 net: add debugging checks in skb_attempt_defer_free()
Make sure skbs that are stored in softnet_data.defer_list
do not have a dst attached.

Also make sure the the skb was orphaned.

Link: https://lore.kernel.org/netdev/CANn89iJuEVe72bPmEftyEJHLzzN=QNR2yueFjTxYXCEpS5S8HQ@mail.gmail.com/T/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:35:07 +01:00
David Howells e0416e7d33 rxrpc: Fix potential race in error handling in afs_make_call()
If the rxrpc call set up by afs_make_call() receives an error whilst it is
transmitting the request, there's the possibility that it may get to the
point the rxrpc call is ended (after the error_kill_call label) just as the
call is queued for async processing.

This could manifest itself as call->rxcall being seen as NULL in
afs_deliver_to_call() when it tries to lock the call.

Fix this by splitting rxrpc_kernel_end_call() into a function to shut down
an rxrpc call and a function to release the caller's reference and calling
the latter only when we get to afs_put_call().

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing+fedora36_64checkkafs-build-306@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-22 15:16:39 +01:00
Ziyang Xuan 99e5acae19 ipv4: Fix potential uninit variable access bug in __ip_make_skb()
Like commit ea30388bae ("ipv6: Fix an uninit variable access bug in
__ip6_make_skb()"). icmphdr does not in skb linear region under the
scenario of SOCK_RAW socket. Access icmp_hdr(skb)->type directly will
trigger the uninit variable access bug.

Use a local variable icmp_type to carry the correct value in different
scenarios.

Fixes: 96793b4825 ("[IPV4]: Add ICMPMsgStats MIB (RFC 4293)")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-22 14:10:39 +01:00
Jakub Kicinski 9a82cdc28f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZELn8wAKCRDbK58LschI
 g1khAQC1nmXPuKjM4EAfFK8Ysb3KoF8ADmpE97n+/HEDydCagwD/bX0+NABR75Nh
 ueGcoU1TcfcbshDzrH0s+C95owZDZw4=
 =BeZM
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-21

We've added 71 non-merge commits during the last 8 day(s) which contain
a total of 116 files changed, 13397 insertions(+), 8896 deletions(-).

The main changes are:

1) Add a new BPF netfilter program type and minimal support to hook
   BPF programs to netfilter hooks such as prerouting or forward,
   from Florian Westphal.

2) Fix race between btf_put and btf_idr walk which caused a deadlock,
   from Alexei Starovoitov.

3) Second big batch to migrate test_verifier unit tests into test_progs
   for ease of readability and debugging, from Eduard Zingerman.

4) Add support for refcounted local kptrs to the verifier for allowing
   shared ownership, useful for adding a node to both the BPF list and
   rbtree, from Dave Marchevsky.

5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF
  selftests into libbpf-provided bpf_helpers.h header and improve
  kfunc handling, from Andrii Nakryiko.

6) Support 64-bit pointers to kfuncs needed for archs like s390x,
   from Ilya Leoshkevich.

7) Support BPF progs under getsockopt with a NULL optval,
   from Stanislav Fomichev.

8) Improve verifier u32 scalar equality checking in order to enable
   LLVM transformations which earlier had to be disabled specifically
   for BPF backend, from Yonghong Song.

9) Extend bpftool's struct_ops object loading to support links,
   from Kui-Feng Lee.

10) Add xsk selftest follow-up fixes for hugepage allocated umem,
    from Magnus Karlsson.

11) Support BPF redirects from tc BPF to ifb devices,
    from Daniel Borkmann.

12) Add BPF support for integer type when accessing variable length
    arrays, from Feng Zhou.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits)
  selftests/bpf: verifier/value_ptr_arith converted to inline assembly
  selftests/bpf: verifier/value_illegal_alu converted to inline assembly
  selftests/bpf: verifier/unpriv converted to inline assembly
  selftests/bpf: verifier/subreg converted to inline assembly
  selftests/bpf: verifier/spin_lock converted to inline assembly
  selftests/bpf: verifier/sock converted to inline assembly
  selftests/bpf: verifier/search_pruning converted to inline assembly
  selftests/bpf: verifier/runtime_jit converted to inline assembly
  selftests/bpf: verifier/regalloc converted to inline assembly
  selftests/bpf: verifier/ref_tracking converted to inline assembly
  selftests/bpf: verifier/map_ptr_mixing converted to inline assembly
  selftests/bpf: verifier/map_in_map converted to inline assembly
  selftests/bpf: verifier/lwt converted to inline assembly
  selftests/bpf: verifier/loops1 converted to inline assembly
  selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly
  selftests/bpf: verifier/direct_packet_access converted to inline assembly
  selftests/bpf: verifier/d_path converted to inline assembly
  selftests/bpf: verifier/ctx converted to inline assembly
  selftests/bpf: verifier/btf_ctx_access converted to inline assembly
  selftests/bpf: verifier/bpf_get_stack converted to inline assembly
  ...
====================

Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:32:37 -07:00
Maxime Bizon 418a73074d net: dst: fix missing initialization of rt_uncached
xfrm_alloc_dst() followed by xfrm4_dst_destroy(), without a
xfrm4_fill_dst() call in between, causes the following BUG:

 BUG: spinlock bad magic on CPU#0, fbxhostapd/732
  lock: 0x890b7668, .magic: 890b7668, .owner: <none>/-1, .owner_cpu: 0
 CPU: 0 PID: 732 Comm: fbxhostapd Not tainted 6.3.0-rc6-next-20230414-00613-ge8de66369925-dirty #9
 Hardware name: Marvell Kirkwood (Flattened Device Tree)
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x28/0x30
  dump_stack_lvl from do_raw_spin_lock+0x20/0x80
  do_raw_spin_lock from rt_del_uncached_list+0x30/0x64
  rt_del_uncached_list from xfrm4_dst_destroy+0x3c/0xbc
  xfrm4_dst_destroy from dst_destroy+0x5c/0xb0
  dst_destroy from rcu_process_callbacks+0xc4/0xec
  rcu_process_callbacks from __do_softirq+0xb4/0x22c
  __do_softirq from call_with_stack+0x1c/0x24
  call_with_stack from do_softirq+0x60/0x6c
  do_softirq from __local_bh_enable_ip+0xa0/0xcc

Patch "net: dst: Prevent false sharing vs. dst_entry:: __refcnt" moved
rt_uncached and rt_uncached_list fields from rtable struct to dst
struct, so they are more zeroed by memset_after(xdst, 0, u.dst) in
xfrm_alloc_dst().

Note that rt_uncached (list_head) was never properly initialized at
alloc time, but xfrm[46]_dst_destroy() is written in such a way that
it was not an issue thanks to the memset:

	if (xdst->u.rt.dst.rt_uncached_list)
		rt_del_uncached_list(&xdst->u.rt);

The route code does it the other way around: rt_uncached_list is
assumed to be valid IIF rt_uncached list_head is not empty:

void rt_del_uncached_list(struct rtable *rt)
{
        if (!list_empty(&rt->dst.rt_uncached)) {
                struct uncached_list *ul = rt->dst.rt_uncached_list;

                spin_lock_bh(&ul->lock);
                list_del_init(&rt->dst.rt_uncached);
                spin_unlock_bh(&ul->lock);
        }
}

This patch adds mandatory rt_uncached list_head initialization in
generic dst_init(), and adapt xfrm[46]_dst_destroy logic to match the
rest of the code.

Fixes: d288a162dd ("net: dst: Prevent false sharing vs. dst_entry:: __refcnt")
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304162125.18b7bcdd-oliver.sang@intel.com
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
CC: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Link: https://lore.kernel.org/r/20230420182508.2417582-1-mbizon@freebox.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:26:56 -07:00
Ivan Vecera 2cc8a008d6 net/sched: cls_api: Initialize miss_cookie_node when action miss is not used
Function tcf_exts_init_ex() sets exts->miss_cookie_node ptr only
when use_action_miss is true so it assumes in other case that
the field is set to NULL by the caller. If not then the field
contains garbage and subsequent tcf_exts_destroy() call results
in a crash.
Ensure that the field .miss_cookie_node pointer is NULL when
use_action_miss parameter is false to avoid this potential scenario.

Fixes: 80cd22c35c ("net/sched: cls_api: Support hardware miss to tc action")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230420183634.1139391-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:25:57 -07:00
Geert Uytterhoeven 6aa445e396 net/handshake: Fix section mismatch in handshake_exit
If CONFIG_NET_NS=n (e.g. m68k/defconfig):

    WARNING: modpost: vmlinux.o: section mismatch in reference: handshake_exit (section: .exit.text) -> handshake_genl_net_ops (section: .init.data)
    ERROR: modpost: Section mismatches detected.

Fix this by dropping the __net_initdata tag from handshake_genl_net_ops.

Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Reported-by: noreply@ellerman.id.au
Closes: http://kisskb.ellerman.id.au/kisskb/buildresult/14912987
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230420173723.3773434-1-geert@linux-m68k.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:24:57 -07:00
Davide Caratti 7041101ff6 net/sched: sch_fq: fix integer overflow of "credit"
if sch_fq is configured with "initial quantum" having values greater than
INT_MAX, the first assignment of "credit" does signed integer overflow to
a very negative value.
In this situation, the syzkaller script provided by Cristoph triggers the
CPU soft-lockup warning even with few sockets. It's not an infinite loop,
but "credit" wasn't probably meant to be minus 2Gb for each new flow.
Capping "initial quantum" to INT_MAX proved to fix the issue.

v2: validation of "initial quantum" is done in fq_policy, instead of open
    coding in fq_change() _ suggested by Jakub Kicinski

Reported-by: Christoph Paasch <cpaasch@apple.com>
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/377
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/7b3a3c7e36d03068707a021760a194a8eb5ad41a.1682002300.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:24:29 -07:00
Pablo Neira Ayuso 207296f1a0 netfilter: nf_tables: allow to create netdev chain without device
Relax netdev chain creation to allow for loading the ruleset, then
adding/deleting devices at a later stage. Hardware offload does not
support for this feature yet.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:42 +02:00
Pablo Neira Ayuso 7d937b1071 netfilter: nf_tables: support for deleting devices in an existing netdev chain
This patch allows for deleting devices in an existing netdev chain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:42 +02:00
Pablo Neira Ayuso b9703ed44f netfilter: nf_tables: support for adding new devices to an existing netdev chain
This patch allows users to add devices to an existing netdev chain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:42 +02:00
Pablo Neira Ayuso cdc3254663 netfilter: nf_tables: rename function to destroy hook list
Rename nft_flowtable_hooks_destroy() by nft_hooks_destroy() to prepare
for netdev chain device updates.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:42 +02:00
Pablo Neira Ayuso 28339b21a3 netfilter: nf_tables: do not send complete notification of deletions
In most cases, table, name and handle is sufficient for userspace to
identify an object that has been deleted. Skipping unneeded fields in
the netlink attributes in the message saves bandwidth (ie. less chances
of hitting ENOBUFS).

Rules are an exception: the existing userspace monitor code relies on
the rule definition. This exception can be removed by implementing a
rule cache in userspace, this is already supported by the tracing
infrastructure.

Regarding flowtables, incremental deletion of devices is possible.
Skipping a full notification allows userspace to differentiate between
flowtable removal and incremental removal of devices.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Pablo Neira Ayuso c3c060adc0 netfilter: nf_tables: extended netlink error reporting for netdevice
Flowtable and netdev chains are bound to one or several netdevice,
extend netlink error reporting to specify the the netdevice that
triggers the error.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Simon Horman 210ffe4a74 ipvs: Remove {Enter,Leave}Function
Remove EnterFunction and LeaveFunction.

These debugging macros seem well past their use-by date.  And seem to
have little value these days. Removing them allows some trivial cleanup
of some exit paths for some functions. These are also included in this
patch. There is likely scope for further cleanup of both debugging and
unwind paths. But let's leave that for another day.

Only intended to change debug output, and only when CONFIG_IP_VS_DEBUG
is enabled. Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Simon Horman 280654932e ipvs: Consistently use array_size() in ip_vs_conn_init()
Consistently use array_size() to calculate the size of ip_vs_conn_tab
in bytes.

Flagged by Coccinelle:
 WARNING: array_size is already used (line 1498) to compute the same size

No functional change intended.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Simon Horman e3478c68f6 ipvs: Update width of source for ip_vs_sync_conn_options
In ip_vs_sync_conn_v0() copy is made to struct ip_vs_sync_conn_options.
That structure looks like this:

struct ip_vs_sync_conn_options {
        struct ip_vs_seq        in_seq;
        struct ip_vs_seq        out_seq;
};

The source of the copy is the in_seq field of struct ip_vs_conn.  Whose
type is struct ip_vs_seq. Thus we can see that the source - is not as
wide as the amount of data copied, which is the width of struct
ip_vs_sync_conn_option.

The copy is safe because the next field in is another struct ip_vs_seq.
Make use of struct_group() to annotate this.

Flagged by gcc-13 as:

 In file included from ./include/linux/string.h:254,
                  from ./include/linux/bitmap.h:11,
                  from ./include/linux/cpumask.h:12,
                  from ./arch/x86/include/asm/paravirt.h:17,
                  from ./arch/x86/include/asm/cpuid.h:62,
                  from ./arch/x86/include/asm/processor.h:19,
                  from ./arch/x86/include/asm/timex.h:5,
                  from ./include/linux/timex.h:67,
                  from ./include/linux/time32.h:13,
                  from ./include/linux/time.h:60,
                  from ./include/linux/stat.h:19,
                  from ./include/linux/module.h:13,
                  from net/netfilter/ipvs/ip_vs_sync.c:38:
 In function 'fortify_memcpy_chk',
     inlined from 'ip_vs_sync_conn_v0' at net/netfilter/ipvs/ip_vs_sync.c:606:3:
 ./include/linux/fortify-string.h:529:25: error: call to '__read_overflow2_field' declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
   529 |                         __read_overflow2_field(q_size_field, size);
       |

Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Florian Westphal 46df417544 netfilter: nf_tables: do not store rule in traceinfo structure
pass it as argument instead.  This reduces size of traceinfo to
16 bytes.  Total stack usage:

 nf_tables_core.c:252 nft_do_chain    304     static

While its possible to also pass basechain as argument, doing so
increases nft_do_chaininfo function size.

Unlike pktinfo/verdict/rule the basechain info isn't used in
the expression evaluation path. gcc places it on the stack, which
results in extra push/pop when it gets passed to the trace helpers
as argument rather than as part of the traceinfo structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Florian Westphal 0a202145d5 netfilter: nf_tables: do not store verdict in traceinfo structure
Just pass it as argument to nft_trace_notify. Stack is reduced by 8 bytes:

nf_tables_core.c:256 nft_do_chain    312     static

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Florian Westphal 698bb828a6 netfilter: nf_tables: do not store pktinfo in traceinfo structure
pass it as argument.  No change in object size.

stack usage decreases by 8 byte:
 nf_tables_core.c:254  nft_do_chain       320     static

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Florian Westphal 2a1d6abd7e netfilter: nf_tables: remove unneeded conditional
This helper is inlined, so keep it as small as possible.

If the static key is true, there is only a very small chance
that info->trace is false:

1. tracing was enabled at this very moment, the static key was
   updated to active right after nft_do_table was called.

2. tracing was disabled at this very moment.
   trace->info is already false, the static key is about to
   be patched to false soon.

In both cases, no event will be sent because info->trace
is false (checked in noinline slowpath). info->nf_trace is irrelevant.

The nf_trace update is redunant in this case, but this will only
happen for short duration, when static key flips.

       text  data   bss   dec   hex filename
old:   2980   192    32  3204   c84 nf_tables_core.o
new:   2964   192    32  3188   c74i nf_tables_core.o

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:41 +02:00
Florian Westphal 00c320f9b7 netfilter: nf_tables: make validation state per table
We only need to validate tables that saw changes in the current
transaction.

The existing code revalidates all tables, but this isn't needed as
cross-table jumps are not allowed (chains have table scope).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:40 +02:00
Florian Westphal 9a32e98506 netfilter: nf_tables: don't write table validation state without mutex
The ->cleanup callback needs to be removed, this doesn't work anymore as
the transaction mutex is already released in the ->abort function.

Just do it after a successful validation pass, this either happens
from commit or abort phases where transaction mutex is held.

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:40 +02:00
Florian Westphal 63e9bbbcca netfilter: nf_tables: don't store chain address on jump
Now that the rule trailer/end marker and the rcu head reside in the
same structure, we no longer need to save/restore the chain pointer
when performing/returning from a jump.

We can simply let the trace infra walk the evaluated rule until it
hits the end marker and then fetch the chain pointer from there.

When the rule is NULL (policy tracing), then chain and basechain
pointers were already identical, so just use the basechain.

This cuts size of jumpstack in half, from 256 to 128 bytes in 64bit,
scripts/stackusage says:

nf_tables_core.c:251 nft_do_chain    328     static

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:40 +02:00
Florian Westphal d4d89e6546 netfilter: nf_tables: don't store address of last rule on jump
Walk the rule headers until the trailer one (last_bit flag set) instead
of stopping at last_rule address.

This avoids the need to store the address when jumping to another chain.

This cuts size of jumpstack array by one third, on 64bit from
384 to 256 bytes.  Still, stack usage is still quite large:

scripts/stackusage:
nf_tables_core.c:258 nft_do_chain    496     static

Next patch will also remove chain pointer.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:40 +02:00
Florian Westphal e38fbfa972 netfilter: nf_tables: merge nft_rules_old structure and end of ruleblob marker
In order to free the rules in a chain via call_rcu, the rule array used
to stash a rcu_head and space for a pointer at the end of the rule array.

When the current nft_rule_dp blob format got added in
2c865a8a28 ("netfilter: nf_tables: add rule blob layout"), this results
in a double-trailer:

  size (unsigned long)
  struct nft_rule_dp
    struct nft_expr
         ...
    struct nft_rule_dp
     struct nft_expr
         ...
    struct nft_rule_dp (is_last=1) // Trailer

The trailer, struct nft_rule_dp (is_last=1), is not accounted for in size,
so it can be located via start_addr + size.

Because the rcu_head is stored after 'start+size' as well this means the
is_last trailer is *aliased* to the rcu_head (struct nft_rules_old).

This is harmless, because at this time the nft_do_chain function never
evaluates/accesses the trailer, it only checks the address boundary:

        for (; rule < last_rule; rule = nft_rule_next(rule)) {
...

But this way the last_rule address has to be stashed in the jump
structure to restore it after returning from a chain.

nft_do_chain stack usage has become way too big, so put it on a diet.

Without this patch is impossible to use
        for (; !rule->is_last; rule = nft_rule_next(rule)) {

... because on free, the needed update of the rcu_head will clobber the
nft_rule_dp is_last bit.

Furthermore, also stash the chain pointer in the trailer, this allows
to recover the original chain structure from nf_tables_trace infra
without a need to place them in the jump struct.

After this patch it is trivial to diet the jump stack structure,
done in the next two patches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22 01:39:40 +02:00
Florian Westphal 2b99ef22e0 bpf: add test_run support for netfilter program type
add glue code so a bpf program can be run using userspace-provided
netfilter state and packet/skb.

Default is to use ipv4:output hook point, but this can be overridden by
userspace.  Userspace provided netfilter state is restricted, only hook and
protocol families can be overridden and only to ipv4/ipv6.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-7-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:50 -07:00
Florian Westphal 0bdc6da88f netfilter: disallow bpf hook attachment at same priority
This is just to avoid ordering issues between multiple bpf programs,
this could be removed later in case it turns out to be too cautious.

bpf prog could still be shared with non-bpf hook, otherwise we'd have to
make conntrack hook registration fail just because a bpf program has
same priority.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-5-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:14 -07:00
Florian Westphal 506a74db7e netfilter: nfnetlink hook: dump bpf prog id
This allows userspace ("nft list hooks") to show which bpf program
is attached to which hook.

Without this, user only knows bpf prog is attached at prio
x, y, z at INPUT and FORWARD, but can't tell which program is where.

v4: kdoc fixups (Simon Horman)

Link: https://lore.kernel.org/bpf/ZEELzpNCnYJuZyod@corigine.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-4-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:14 -07:00
Florian Westphal fd9c663b9a bpf: minimal support for programs hooked into netfilter framework
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs
that will be invoked via the NF_HOOK() points in the ip stack.

Invocation incurs an indirect call.  This is not a necessity: Its
possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the
program invocation with the same method already done for xdp progs.

This isn't done here to keep the size of this chunk down.

Verifier restricts verdicts to either DROP or ACCEPT.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:14 -07:00
Florian Westphal 84601d6ee6 bpf: add bpf_link support for BPF_NETFILTER programs
Add bpf_link support skeleton.  To keep this reviewable, no bpf program
can be invoked yet, if a program is attached only a c-stub is called and
not the actual bpf program.

Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig.

Uapi example usage:
	union bpf_attr attr = { };

	attr.link_create.prog_fd = progfd;
	attr.link_create.attach_type = 0; /* unused */
	attr.link_create.netfilter.pf = PF_INET;
	attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN;
	attr.link_create.netfilter.priority = -128;

	err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr));

... this would attach progfd to ipv4:input hook.

Such hook gets removed automatically if the calling program exits.

BPF_NETFILTER program invocation is added in followup change.

NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it
allows to tell userspace which program is attached at the given hook
when user runs 'nft hook list' command rather than just the priority
and not-very-helpful 'this hook runs a bpf prog but I can't tell which
one'.

Will also be used to disallow registration of two bpf programs with
same priority in a followup patch.

v4: arm32 cmpxchg only supports 32bit operand
    s/prio/priority/
v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if
    more use cases pop up (arptables, ebtables, netdev ingress/egress etc).

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:14 -07:00
Jakub Kicinski f9bcdcec3b netfilter pull request
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmRCa3oACgkQ1V2XiooU
 IOTw2BAAsMQ3tjhZqvQ4zxZbb770n9qawW5vR1Pjq/C63x07qQiXZf+VL41Ff77s
 6acsyD5vYod3SDTR7Flx1xxFEgrQ4i7/81Jjt4VFqFmD3owKH6lm9R3H9Ae3v//5
 uDDNrbVDwjP4ZPZx96A1Z2SLwlo8+K/IJb98rLlS2v/8IjvPZMy17Oh9/FThsxNw
 LXuNuK6HTd+s2MkT8pUv3QWb/20Sb/ZVOEY6cUx2mD8DedmZBtUbzhisnnIVA+AX
 tP/VK3cPW/PKoNCiDwXMnqpgKUk31L2fpaTGHW0CWsDq0qIEZkmEMxN7WWFMVK/G
 7cq//ZGuQU+O09sM1rG51ImjG+2QISveUAKu99kb/mnlRtxNIhfQkWQDuO9LFqxI
 kd31C2m4oJwsmWfRvitSmKmQs6i7Js+PL25FqxAXxq6VnDTG4Cg9ryBvicRkGhvO
 +Ks2syeWflSQ9NVWWad30NsDyogQ0xy+7Lk/QIzb0hEWR0vGDWfyHNgu2z3QtQjd
 ftcAw6u9LtLYV/01XFOEYMptpi8Ecdot8+rX4hz7NSPNQm1WnpbEinT4sBoDGNxe
 9PoByIJ9lBeQgpWlbe7PTXBSIYF6p8gXg44N/LOpmaUXGs/h9IrNLXoakGlljcH8
 uYLiHOh3lzgwY5Ex+UAMJlAjWoGVIVKyeRf08Bz4PBG5jBL30Qk=
 =ck6R
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Set on IPS_CONFIRMED before change_status() otherwise EBUSY is
   bogusly hit. This bug was introduced in the 6.3 release cycle.

2) Fix nfnetlink_queue conntrack support: Set/dump timeout
   accordingly for unconfirmed conntrack entries. Make sure this
   is done after IPS_CONFIRMED is set on. This is an old bug, it
   happens since the introduction of this feature.

* tag 'nf-23-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: conntrack: fix wrong ct->timeout value
  netfilter: conntrack: restore IPS_CONFIRMED out of nf_conntrack_hash_check_insert()
====================

Link: https://lore.kernel.org/r/20230421105700.325438-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 07:55:41 -07:00
Jakub Kicinski ca28896580 wireless-next patches for v6.4
Most likely the last -next pull request for v6.4. We have changes all
 over. rtw88 now supports SDIO bus and iwlwifi continues to work on
 Wi-Fi 7 support. Not much stack changes this time.
 
 Major changes:
 
 cfg80211/mac80211
 
 * fix some Fine Time Measurement (FTM) frames not being bufferable
 
 * flush frames before key removal to avoid potential unencrypted
   transmission depending on the hardware design
 
 iwlwifi
 
 * preparation for Wi-Fi 7 EHT and multi-link support
 
 rtw88
 
 * SDIO bus support
 
 * RTL8822BS, RTL8822CS and RTL8821CS SDIO chipset support
 
 rtw89
 
 * framework firmware backwards compatibility
 
 brcmfmac
 
 * Cypress 43439 SDIO support
 
 mt76
 
 * mt7921 P2P support
 
 * mt7996 mesh A-MSDU support
 
 * mt7996 EHT support
 
 * mt7996 coredump support
 
 wcn36xx
 
 * support for pronto v3 hardware
 
 ath11k
 
 * PCIe DeviceTree bindings
 
 * WCN6750: enable SAR support
 
 ath10k
 
 * convert DeviceTree bindings to YAML
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmRCaTURHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZvcRwf+NcLS4HbmqGZhBxl2LZVZ6AFCBM4ijDlO
 pxdMiC4UxT+UApY1/9YXo0VS97M7paDJH+R/g1HcTvvKURHCmsdhYHm+R1MH+/uD
 r8RfvJg4VtNnlUpsJh9jxt+e697KP15M7DF0sFlQzdIoTUl13Hp7YhI76zunAbAN
 u1FBcVVJiCcJWbLolMzqAeBMUWUEG+GtHF6Zn5kChVU/p1nmwJMPUG3Qvb61a7Yc
 BM1pQX8jQ8PBj+VrGPGvqX0BOdbxq0evauYScq2oTOhQ1fzTNWOsI1yI7AwApptR
 itwQ2t1UK/C/EWpvWIBSd0nit1uwSx0Zsu/nSZlbKbrvIFwd5XnfwQ==
 =Irrd
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2023-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.4

Most likely the last -next pull request for v6.4. We have changes all
over. rtw88 now supports SDIO bus and iwlwifi continues to work on
Wi-Fi 7 support. Not much stack changes this time.

Major changes:

cfg80211/mac80211
 - fix some Fine Time Measurement (FTM) frames not being bufferable
 - flush frames before key removal to avoid potential unencrypted
   transmission depending on the hardware design

iwlwifi
 - preparation for Wi-Fi 7 EHT and multi-link support

rtw88
 - SDIO bus support
 - RTL8822BS, RTL8822CS and RTL8821CS SDIO chipset support

rtw89
 - framework firmware backwards compatibility

brcmfmac
 - Cypress 43439 SDIO support

mt76
 - mt7921 P2P support
 - mt7996 mesh A-MSDU support
 - mt7996 EHT support
 - mt7996 coredump support

wcn36xx
 - support for pronto v3 hardware

ath11k
 - PCIe DeviceTree bindings
 - WCN6750: enable SAR support

ath10k
 - convert DeviceTree bindings to YAML

* tag 'wireless-next-2023-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (261 commits)
  wifi: rtw88: Update spelling in main.h
  wifi: airo: remove ISA_DMA_API dependency
  wifi: rtl8xxxu: Simplify setting the initial gain
  wifi: rtl8xxxu: Add rtl8xxxu_write{8,16,32}_{set,clear}
  wifi: rtl8xxxu: Don't print the vendor/product/serial
  wifi: rtw88: Fix memory leak in rtw88_usb
  wifi: rtw88: call rtw8821c_switch_rf_set() according to chip variant
  wifi: rtw88: set pkg_type correctly for specific rtw8821c variants
  wifi: rtw88: rtw8821c: Fix rfe_option field width
  wifi: rtw88: usb: fix priority queue to endpoint mapping
  wifi: rtw88: 8822c: add iface combination
  wifi: rtw88: handle station mode concurrent scan with AP mode
  wifi: rtw88: prevent scan abort with other VIFs
  wifi: rtw88: refine reserved page flow for AP mode
  wifi: rtw88: disallow PS during AP mode
  wifi: rtw88: 8822c: extend reserved page number
  wifi: rtw88: add port switch for AP mode
  wifi: rtw88: add bitmap for dynamic port settings
  wifi: rtw89: mac: use regular int as return type of DLE buffer request
  wifi: mac80211: remove return value check of debugfs_create_dir()
  ...
====================

Link: https://lore.kernel.org/r/20230421104726.800BCC433D2@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 07:35:51 -07:00
Jianfeng Tan dfc39d4026 net/packet: support mergeable feature of virtio
Packet sockets, like tap, can be used as the backend for kernel vhost.
In packet sockets, virtio net header size is currently hardcoded to be
the size of struct virtio_net_hdr, which is 10 bytes; however, it is not
always the case: some virtio features, such as mrg_rxbuf, need virtio
net header to be 12-byte long.

Mergeable buffers, as a virtio feature, is worthy of supporting: packets
that are larger than one-mbuf size will be dropped in vhost worker's
handle_rx if mrg_rxbuf feature is not used, but large packets
cannot be avoided and increasing mbuf's size is not economical.

With this virtio feature enabled by virtio-user, packet sockets with
hardcoded 10-byte virtio net header will parse mac head incorrectly in
packet_snd by taking the last two bytes of virtio net header as part of
mac header.
This incorrect mac header parsing will cause packet to be dropped due to
invalid ether head checking in later under-layer device packet receiving.

By adding extra field vnet_hdr_sz with utilizing holes in struct
packet_sock to record currently used virtio net header size and supporting
extra sockopt PACKET_VNET_HDR_SZ to set specified vnet_hdr_sz, packet
sockets can know the exact length of virtio net header that virtio user
gives.
In packet_snd, tpacket_snd and packet_recvmsg, instead of using
hardcoded virtio net header size, it can get the exact vnet_hdr_sz from
corresponding packet_sock, and parse mac header correctly based on this
information to avoid the packets being mistakenly dropped.

Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 12:01:58 +01:00
Ido Schimmel 160656d720 bridge: Allow setting per-{Port, VLAN} neighbor suppression state
Add a new bridge port attribute that allows user space to enable
per-{Port, VLAN} neighbor suppression. Example:

 # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
 false
 # bridge link set dev swp1 neigh_vlan_suppress on
 # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
 true
 # bridge link set dev swp1 neigh_vlan_suppress off
 # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
 false

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:50 +01:00
Ido Schimmel 83f6d60079 bridge: vlan: Allow setting VLAN neighbor suppression state
Add a new VLAN attribute that allows user space to set the neighbor
suppression state of the port VLAN. Example:

 # bridge -d -j -p vlan show dev swp1 vid 10 | jq '.[]["vlans"][]["neigh_suppress"]'
 false
 # bridge vlan set vid 10 dev swp1 neigh_suppress on
 # bridge -d -j -p vlan show dev swp1 vid 10 | jq '.[]["vlans"][]["neigh_suppress"]'
 true
 # bridge vlan set vid 10 dev swp1 neigh_suppress off
 # bridge -d -j -p vlan show dev swp1 vid 10 | jq '.[]["vlans"][]["neigh_suppress"]'
 false

 # bridge vlan set vid 10 dev br0 neigh_suppress on
 Error: bridge: Can't set neigh_suppress for non-port vlans.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:50 +01:00
Ido Schimmel 412614b145 bridge: Add per-{Port, VLAN} neighbor suppression data path support
When the bridge is not VLAN-aware (i.e., VLAN ID is 0), determine if
neighbor suppression is enabled on a given bridge port solely based on
the existing 'BR_NEIGH_SUPPRESS' flag.

Otherwise, if the bridge is VLAN-aware, first check if per-{Port, VLAN}
neighbor suppression is enabled on the given bridge port using the
'BR_NEIGH_VLAN_SUPPRESS' flag. If so, look up the VLAN and check whether
it has neighbor suppression enabled based on the per-VLAN
'BR_VLFLAG_NEIGH_SUPPRESS_ENABLED' flag.

If the bridge is VLAN-aware, but the bridge port does not have
per-{Port, VLAN} neighbor suppression enabled, then fallback to
determine neighbor suppression based on the 'BR_NEIGH_SUPPRESS' flag.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:50 +01:00
Ido Schimmel 3aca683e06 bridge: Encapsulate data path neighbor suppression logic
Currently, there are various places in the bridge data path that check
whether neighbor suppression is enabled on a given bridge port.

As a preparation for per-{Port, VLAN} neighbor suppression, encapsulate
this logic in a function and pass the VLAN ID of the packet as an
argument.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:50 +01:00
Ido Schimmel 6be42ed0a5 bridge: Take per-{Port, VLAN} neighbor suppression into account
The bridge driver gates the neighbor suppression code behind an internal
per-bridge flag called 'BROPT_NEIGH_SUPPRESS_ENABLED'. The flag is set
when at least one bridge port has neighbor suppression enabled.

As a preparation for per-{Port, VLAN} neighbor suppression, make sure
the global flag is also set if per-{Port, VLAN} neighbor suppression is
enabled. That is, when the 'BR_NEIGH_VLAN_SUPPRESS' flag is set on at
least one bridge port.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:49 +01:00
Ido Schimmel a714e3ec23 bridge: Add internal flags for per-{Port, VLAN} neighbor suppression
Add two internal flags that will be used to enable / disable per-{Port,
VLAN} neighbor suppression:

1. 'BR_NEIGH_VLAN_SUPPRESS': A per-port flag used to indicate that
per-{Port, VLAN} neighbor suppression is enabled on the bridge port.
When set, 'BR_NEIGH_SUPPRESS' has no effect.

2. 'BR_VLFLAG_NEIGH_SUPPRESS_ENABLED': A per-VLAN flag used to indicate
that neighbor suppression is enabled on the given VLAN.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:49 +01:00
Ido Schimmel e408336a69 bridge: Pass VLAN ID to br_flood()
Subsequent patches are going to add per-{Port, VLAN} neighbor
suppression, which will require br_flood() to potentially suppress ARP /
NS packets on a per-{Port, VLAN} basis.

As a preparation, pass the VLAN ID of the packet as another argument to
br_flood().

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:49 +01:00
Ido Schimmel 013a7ce81d bridge: Reorder neighbor suppression check when flooding
The bridge does not flood ARP / NS packets for which a reply was sent to
bridge ports that have neighbor suppression enabled.

Subsequent patches are going to add per-{Port, VLAN} neighbor
suppression, which is going to make it more expensive to check whether
neighbor suppression is enabled since a VLAN lookup will be required.

Therefore, instead of unnecessarily performing this lookup for every
packet, only perform it for ARP / NS packets for which a reply was sent.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:25:49 +01:00
Emeel Hakim abff3e5e29 vlan: Add MACsec offload operations for VLAN interface
Add support for MACsec offload operations for VLAN driver
to allow offloading MACsec when VLAN's real device supports
Macsec offload by forwarding the offload request to it.

Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:22:14 +01:00
Xin Long 2ab399a931 sctp: delete the nested flexible array hmac
This patch deletes the flexible-array hmac[] from the structure
sctp_authhdr to avoid some sparse warnings:

  # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/
  net/sctp/auth.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h):
  ./include/linux/sctp.h:735:29: warning: nested flexible array

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:19:30 +01:00
Xin Long f97278ff34 sctp: delete the nested flexible array peer_init
This patch deletes the flexible-array peer_init[] from the structure
sctp_cookie to avoid some sparse warnings:

  # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/
  net/sctp/sm_make_chunk.c: note: in included file (through include/net/sctp/sctp.h):
  ./include/net/sctp/structs.h:1588:28: warning: nested flexible array
  ./include/net/sctp/structs.h:343:28: warning: nested flexible array

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:19:30 +01:00
Xin Long 9789c1c661 sctp: delete the nested flexible array variable
This patch deletes the flexible-array variable[] from the structure
sctp_sackhdr and sctp_errhdr to avoid some sparse warnings:

  # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/
  net/sctp/sm_statefuns.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h):
  ./include/linux/sctp.h:451:28: warning: nested flexible array
  ./include/linux/sctp.h:393:29: warning: nested flexible array

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:19:29 +01:00
Xin Long 73175a0429 sctp: delete the nested flexible array skip
This patch deletes the flexible-array skip[] from the structure
sctp_ifwdtsn/fwdtsn_hdr to avoid some sparse warnings:

  # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/
  net/sctp/stream_interleave.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h):
  ./include/linux/sctp.h:611:32: warning: nested flexible array
  ./include/linux/sctp.h:628:33: warning: nested flexible array

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:19:29 +01:00
Xin Long add7370a39 sctp: delete the nested flexible array params
This patch deletes the flexible-array params[] from the structure
sctp_inithdr, sctp_addiphdr and sctp_reconf_chunk to avoid some
sparse warnings:

  # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/
  net/sctp/input.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h):
  ./include/linux/sctp.h:278:29: warning: nested flexible array
  ./include/linux/sctp.h:675:30: warning: nested flexible array

This warning is reported if a structure having a flexible array
member is included by other structures.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21 08:19:29 +01:00
Leon Romanovsky ec8f32ad9a xfrm: Fix leak of dev tracker
At the stage of direction checks, the netdev reference tracker is
already initialized, but released with wrong *_put() call.

Fixes: 919e43fad5 ("xfrm: add an interface to offload policy")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-04-21 08:54:04 +02:00
Leon Romanovsky 94b95dfaa8 xfrm: release all offloaded policy memory
Failure to add offloaded policy will cause to the following
error once user will try to reload driver.

Unregister_netdevice: waiting for eth3 to become free. Usage count = 2

This was caused by xfrm_dev_policy_add() which increments reference
to net_device. That reference was supposed to be decremented
in xfrm_dev_policy_free(). However the latter wasn't called.

 unregister_netdevice: waiting for eth3 to become free. Usage count = 2
 leaked reference.
  xfrm_dev_policy_add+0xff/0x3d0
  xfrm_policy_construct+0x352/0x420
  xfrm_add_policy+0x179/0x320
  xfrm_user_rcv_msg+0x1d2/0x3d0
  netlink_rcv_skb+0xe0/0x210
  xfrm_netlink_rcv+0x45/0x50
  netlink_unicast+0x346/0x490
  netlink_sendmsg+0x3b0/0x6c0
  sock_sendmsg+0x73/0xc0
  sock_write_iter+0x13b/0x1f0
  vfs_write+0x528/0x5d0
  ksys_write+0x120/0x150
  do_syscall_64+0x3d/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: 919e43fad5 ("xfrm: add an interface to offload policy")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-04-21 08:51:48 +02:00
Johannes Berg baa951a1c1 mac80211: use the new drop reasons infrastructure
It can be really hard to analyse or debug why packets are
going missing in mac80211, so add the needed infrastructure
to use use the new per-subsystem drop reasons.

We actually use two drop reason subsystems here because of
the different handling of frames that are dropped but still
go to monitor for old versions of hostapd, and those that
are just completely unusable (e.g. crypto failed.)

Annotate a few reasons here just to illustrate this, we'll
need to go through and annotate more of them later.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 20:20:49 -07:00
Johannes Berg 071c0fc6fb net: extend drop reasons for multiple subsystems
Extend drop reasons to make them usable by subsystems
other than core by reserving the high 16 bits for a
new subsystem ID, of which 0 of course is used for the
existing reasons immediately.

To still be able to have string reasons, restructure
that code a bit to make the loopup under RCU, the only
user of this (right now) is drop_monitor.

Link: https://lore.kernel.org/netdev/00659771ed54353f92027702c5bbb84702da62ce.camel@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 20:20:49 -07:00
Mahesh Bandewar 7ab75456be ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6
ICMPv6 error packets are not sent to the anycast destinations and this
prevents things like traceroute from working. So create a setting similar
to ECHO when dealing with Anycast sources (icmpv6_echo_ignore_anycast).

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20230419013238.2691167-1-maheshb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 20:07:50 -07:00
Vladimir Oltean 35b288d6e3 net: ethtool: mm: sanitize some UAPI configurations
The verify-enabled boolean (ETHTOOL_A_MM_VERIFY_ENABLED) was intended to
be a sub-setting of tx-enabled (ETHTOOL_A_MM_TX_ENABLED). IOW, MAC Merge
TX can be enabled with or without verification, but verification with TX
disabled makes no sense.

The pmac-enabled boolean (ETHTOOL_A_MM_PMAC_ENABLED) was intended to be
a global toggle from an API perspective, whereas tx-enabled just handles
the TX direction. IOW, the pMAC can be enabled with or without TX, but
it doesn't make sense to enable TX if the pMAC is not enabled.

Add two checks which sanitize and reject these invalid cases.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 20:03:21 -07:00