Commit Graph

74468 Commits

Author SHA1 Message Date
David S. Miller 5fc43ce03b ipsec-2023-08-15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmTbRoQACgkQrB3Eaf9P
 W7eY+A/8DJtqwFs1uAahS9jCX1bxf3vKKUPkEKu3IfcZVv2WcMVDI7XRLnBb93PC
 GR1RQskCErXMrVv7mmaBuk/uZAcAFQUkzna3MmyAw4lFfJSWORD6rzGiFeDVMsvx
 7gczhYC6aPwhiyAqk6eoNTKxLaZ0zfGiW9ZKWdZXuTjp+ijksa56gEdKsPwMQIht
 FE4+CHia0dxFK0bUZMLHc4ixQbqKkHj/qVxB8k8zQnDgmCavjlEAnc+PAOX+SNxm
 uju4gDV/9qXYOkHTwRD9/aPcvCofTlD9XynSHkMC24yLS6Ir4A1mFUZywNSiwcgX
 //WxymD1N93inuHGzVluhm6Jy+4hTaS5p1y+H86L2TfC9b5SOrNYtj3yLB3aqDgq
 1+4t4cVAtpk7uLfPYKzreDJH+CoxQDC8x+0dlzQUGnV11eIJ2RA0brJhFqHjOlbD
 SAQtBwkPqlAXnrdDr2pUhyrlrwAGXux8T5u5tF3NSS3FEwh7akRBfU2HV6vPEE80
 qPIxHSbA9d0j+tOjbkHIYEv9fMHHFC/aFLZMYOew016TKGBJth4g+DJqJiEEDZZh
 iEIC62lrMV2qsyW5PdYdxGesZaAC/4koFCTBgkBYyIC/4gm5E74Ygu5B2A3xx89p
 H6MlF3Miofsf9aSh1vw4cyb69mPaP1XG95OGTqc0qpV2XhhjpE0=
 =UVTC
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2023-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
1) Fix a slab-out-of-bounds read in xfrm_address_filter.
   From Lin Ma.

2) Fix the pfkey sadb_x_filter validation.
   From Lin Ma.

3) Use the correct nla_policy structure for XFRMA_SEC_CTX.
   From Lin Ma.

4) Fix warnings triggerable by bad packets in the encap functions.
   From Herbert Xu.

5) Fix some slab-use-after-free in decode_session6.
   From Zhengchao Shao.

6) Fix a possible NULL piointer dereference in xfrm_update_ae_params.
   Lin Ma.

7) Add a forgotten nla_policy for XFRMA_MTIMER_THRESH.
   From Lin Ma.

8) Don't leak offloaded policies.
   From Leon Romanovsky.

9) Delete also the offloading part of an acquire state.
   From Leon Romanovsky.

Please pull or let me know if there are problems.
2023-08-16 08:57:41 +01:00
Jakub Kicinski 956db0a13b net: warn about attempts to register negative ifindex
Since the xarray changes we mix returning valid ifindex and negative
errno in a single int returned from dev_index_reserve(). This depends
on the fact that ifindexes can't be negative. Otherwise we may insert
into the xarray and return a very large negative value. This in turn
may break ERR_PTR().

OvS is susceptible to this problem and lacking validation (fix posted
separately for net).

Reject negative ifindex explicitly. Add a warning because the input
validation is better handled by the caller.

Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 19:18:34 -07:00
Jakub Kicinski a552bfa16b net: openvswitch: reject negative ifindex
Recent changes in net-next (commit 759ab1edb5 ("net: store netdevs
in an xarray")) refactored the handling of pre-assigned ifindexes
and let syzbot surface a latent problem in ovs. ovs does not validate
ifindex, making it possible to create netdev ports with negative
ifindex values. It's easy to repro with YNL:

$ ./cli.py --spec netlink/specs/ovs_datapath.yaml \
         --do new \
	 --json '{"upcall-pid": 1, "name":"my-dp"}'
$ ./cli.py --spec netlink/specs/ovs_vport.yaml \
	 --do new \
	 --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}'

$ ip link show
-65536: some-port0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:48:21:ad:0b:fb brd ff:ff:ff:ff:ff:ff
...

Validate the inputs. Now the second command correctly returns:

$ ./cli.py --spec netlink/specs/ovs_vport.yaml \
	 --do new \
	 --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}'

lib.ynl.NlError: Netlink error: Numerical result out of range
nl_len = 108 (92) nl_flags = 0x300 nl_type = 2
	error: -34	extack: {'msg': 'integer out of range', 'unknown': [[type:4 len:36] b'\x0c\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x00\xff\xff\xff\x7f\x00\x00\x00\x00\x08\x00\x01\x00\x08\x00\x00\x00'], 'bad-attr': '.ifindex'}

Accept 0 since it used to be silently ignored.

Fixes: 54c4ef34c4 ("openvswitch: allow specifying ifindex of new interfaces")
Reported-by: syzbot+7456b5dcf65111553320@syzkaller.appspotmail.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814203840.2908710-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 19:07:52 -07:00
Ido Schimmel db1428f66a nexthop: Do not increment dump sentinel at the end of the dump
The nexthop and nexthop bucket dump callbacks previously returned a
positive return code even when the dump was complete, prompting the core
netlink code to invoke the callback again, until returning zero.

Zero was only returned by these callbacks when no information was filled
in the provided skb, which was achieved by incrementing the dump
sentinel at the end of the dump beyond the ID of the last nexthop.

This is no longer necessary as when the dump is complete these callbacks
return zero.

Remove the unnecessary increment.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 18:54:53 -07:00
Ido Schimmel 23ab9324fd nexthop: Simplify nexthop bucket dump
Before commit f10d3d9df4 ("nexthop: Make nexthop bucket dump more
efficient"), rtm_dump_nexthop_bucket_nh() returned a non-zero return
code for each resilient nexthop group whose buckets it dumped,
regardless if it encountered an error or not.

This meant that the sentinel ('dd->ctx->nh.idx') used by the function
that walked the different nexthops could not be used as a sentinel for
the bucket dump, as otherwise buckets from the same group would be
dumped over and over again.

This was dealt with by adding another sentinel ('dd->ctx->done_nh_idx')
that was incremented by rtm_dump_nexthop_bucket_nh() after successfully
dumping all the buckets from a given group.

After the previously mentioned commit this sentinel is no longer
necessary since the function no longer returns a non-zero return code
when successfully dumping all the buckets from a given group.

Remove this sentinel and simplify the code.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 18:54:52 -07:00
Andrea Mayer 7458575a07 seg6: add NEXT-C-SID support for SRv6 End.X behavior
The NEXT-C-SID mechanism described in [1] offers the possibility of
encoding several SRv6 segments within a single 128 bit SID address. Such
a SID address is called a Compressed SID (C-SID) container. In this way,
the length of the SID List can be drastically reduced.

A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address
logically structured in three main blocks: i) Locator-Block; ii)
Locator-Node Function; iii) Argument.

                        C-SID container
+------------------------------------------------------------------+
|     Locator-Block      |Loc-Node|            Argument            |
|                        |Function|                                |
+------------------------------------------------------------------+
<--------- B -----------> <- NF -> <------------- A --------------->

   (i) The Locator-Block can be any IPv6 prefix available to the provider;

  (ii) The Locator-Node Function represents the node and the function to
       be triggered when a packet is received on the node;

 (iii) The Argument carries the remaining C-SIDs in the current C-SID
       container.

This patch leverages the NEXT-C-SID mechanism previously introduced in the
Linux SRv6 subsystem [2] to support SID compression capabilities in the
SRv6 End.X behavior [3].
An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior
but it is capable of processing the compressed SID List encoded in C-SID
containers.

An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support
user-provided Locator-Block and Locator-Node Function lengths. In this
implementation, such lengths must be evenly divisible by 8 (i.e. must be
byte-aligned), otherwise the kernel informs the user about invalid
values with a meaningful error code and message through netlink_ext_ack.

If Locator-Block and/or Locator-Node Function lengths are not provided
by the user during configuration of an SRv6 End.X behavior instance with
NEXT-C-SID flavor, the kernel will choose their default values i.e.,
32-bit Locator-Block and 16-bit Locator-Node Function.

[1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
[2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/
[3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 18:51:47 -07:00
Joel Granados c899710fe7 networking: Update to register_net_sysctl_sz
Move from register_net_sysctl to register_net_sysctl_sz for all the
networking related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.

We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.

An additional size function was added to the following files in order to
calculate the size of an array that is defined in another file:
    include/net/ipv6.h
    net/ipv6/icmp.c
    net/ipv6/route.c
    net/ipv6/sysctl_net_ipv6.c

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:18 -07:00
Joel Granados 385a5dc9e5 netfilter: Update to register_net_sysctl_sz
Move from register_net_sysctl to register_net_sysctl_sz for all the
netfilter related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.

We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.

Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:17 -07:00
Joel Granados 7737e46d9d ax.25: Update to register_net_sysctl_sz
Move from register_net_sysctl to register_net_sysctl_sz and pass the
ARRAY_SIZE of the ctl_table array that was used to create the table
variable. We need to move to the new function in preparation for when we
change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro.
Failing to do so would erroneously allow ARRAY_SIZE() to be called on a
pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have
migrated all the relevant net sysctl registering functions to
register_net_sysctl_sz in subsequent commits.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:17 -07:00
Joel Granados 95d4977876 sysctl: Add size to register_net_sysctl function
This commit adds size to the register_net_sysctl indirection function to
facilitate the removal of the sentinel elements (last empty markers)
from the ctl_table arrays. Though we don't actually remove any sentinels
in this commit, register_net_sysctl* now has the capability of
forwarding table_size for when that happens.

We create a new function register_net_sysctl_sz with an extra size
argument. A macro replaces the existing register_net_sysctl. The size in
the macro is SIZE_MAX instead of ARRAY_SIZE to avoid compilation errors
while we systematically migrate to register_net_sysctl_sz. Will change
to ARRAY_SIZE in subsequent commits.

Care is taken to add table_size to the stopping criteria in such a way
that when we remove the empty sentinel element, it will continue
stopping in the last element of the ctl_table array.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:17 -07:00
Joel Granados 9edbfe92a0 sysctl: Add size to register_sysctl
This commit adds table_size to register_sysctl in preparation for the
removal of the sentinel elements in the ctl_table arrays (last empty
markers). And though we do *not* remove any sentinels in this commit, we
set things up by either passing the table_size explicitly or using
ARRAY_SIZE on the ctl_table arrays.

We replace the register_syctl function with a macro that will add the
ARRAY_SIZE to the new register_sysctl_sz function. In this way the
callers that are already using an array of ctl_table structs do not
change. For the callers that pass a ctl_table array pointer, we pass the
table_size to register_sysctl_sz instead of the macro.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:17 -07:00
Joel Granados bff97cf11b sysctl: Add a size arg to __register_sysctl_table
We make these changes in order to prepare __register_sysctl_table and
its callers for when we remove the sentinel element (empty element at
the end of ctl_table arrays). We don't actually remove any sentinels in
this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
is available when the removal occurs.

We add a table_size argument to __register_sysctl_table and adjust
callers, all of which pass ctl_table pointers and need an explicit call
to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
order to forward the size of the array pointer received from the network
register calls.

The new table_size argument does not yet have any effect in the
init_header call which is still dependent on the sentinel's presence.
table_size *does* however drive the `kzalloc` allocation in
__register_sysctl_table with no adverse effects as the allocated memory
is either one element greater than the calculated ctl_table array (for
the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
of the calculated ctl_table array (for the call from sysctl_net.c and
register_sysctl). This approach will allows us to "just" remove the
sentinel without further changes to __register_sysctl_table as
table_size will represent the exact size for all the callers at that
point.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15 15:26:17 -07:00
Pablo Neira Ayuso 23185c6aed netfilter: nft_dynset: disallow object maps
Do not allow to insert elements from datapath to objects maps.

Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:15 +02:00
Pablo Neira Ayuso 02c6c24402 netfilter: nf_tables: GC transaction race with netns dismantle
Use maybe_get_net() since GC workqueue might race with netns exit path.

Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:15 +02:00
Pablo Neira Ayuso 6a33d8b73d netfilter: nf_tables: fix GC transaction races with netns and netlink event exit path
Netlink event path is missing a synchronization point with GC
transactions. Add GC sequence number update to netns release path and
netlink event path, any GC transaction losing race will be discarded.

Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:15 +02:00
Sishuai Gong 5310760af1 ipvs: fix racy memcpy in proc_do_sync_threshold
When two threads run proc_do_sync_threshold() in parallel,
data races could happen between the two memcpy():

Thread-1			Thread-2
memcpy(val, valp, sizeof(val));
				memcpy(valp, val, sizeof(val));

This race might mess up the (struct ctl_table *) table->data,
so we add a mutex lock to serialize them.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.com/
Signed-off-by: Sishuai Gong <sishuai.system@gmail.com>
Acked-by: Simon Horman <horms@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:15 +02:00
Xin Long 9bfab6d23a netfilter: set default timeout to 3 secs for sctp shutdown send and recv state
In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and
SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout
value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300
msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state.

As Paolo Valerio noticed, this might cause unwanted expiration of the ct
entry. In my test, with 1s tc netem delay set on the NAT path, after the
SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND
state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is
sent back from the peer, the sctp ct entry has expired and been deleted,
and then the SHUTDOWN_ACK has to be dropped.

Also, it is confusing these two sysctl options always show 0 due to all
timeout values using sec as unit:

  net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0
  net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0

This patch fixes it by also using 3 secs for sctp shutdown send and recv
state in sctp conntrack, which is also RTO.initial value in SCTP protocol.

Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV
was probably used for a rare scenario where SHUTDOWN is sent on 1st path
but SHUTDOWN_ACK is replied on 2nd path, then a new connection started
immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV
to CLOSE when receiving INIT in the ORIGINAL direction.

Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Reported-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:15 +02:00
Florian Westphal 7845914f45 netfilter: nf_tables: don't fail inserts if duplicate has expired
nftables selftests fail:
run-tests.sh testcases/sets/0044interval_overlap_0
Expected: 0-2 . 0-3, got:
W: [FAILED]     ./testcases/sets/0044interval_overlap_0: got 1

Insertion must ignore duplicate but expired entries.

Moreover, there is a strange asymmetry in nft_pipapo_activate:

It refetches the current element, whereas the other ->activate callbacks
(bitmap, hash, rhash, rbtree) use elem->priv.
Same for .remove: other set implementations take elem->priv,
nft_pipapo_remove fetches elem->priv, then does a relookup,
remove this.

I suspect this was the reason for the change that prompted the
removal of the expired check in pipapo_get() in the first place,
but skipping exired elements there makes no sense to me, this helper
is used for normal get requests, insertions (duplicate check)
and deactivate callback.

In first two cases expired elements must be skipped.

For ->deactivate(), this gets called for DELSETELEM, so it
seems to me that expired elements should be skipped as well, i.e.
delete request should fail with -ENOENT error.

Fixes: 24138933b9 ("netfilter: nf_tables: don't skip expired elements during walk")
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:15 +02:00
Florian Westphal 90e5b3462e netfilter: nf_tables: deactivate catchall elements in next generation
When flushing, individual set elements are disabled in the next
generation via the ->flush callback.

Catchall elements are not disabled.  This is incorrect and may lead to
double-deactivations of catchall elements which then results in memory
leaks:

WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730
CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60
RIP: 0010:nft_map_deactivate+0x549/0x730
 [..]
 ? nft_map_deactivate+0x549/0x730
 nf_tables_delset+0xb66/0xeb0

(the warn is due to nft_use_dec() detecting underflow).

Fixes: aaa31047a6 ("netfilter: nftables: add catch-all set element support")
Reported-by: lonial con <kongln9170@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:15 +02:00
Florian Westphal 08713cb006 netfilter: nf_tables: fix kdoc warnings after gc rework
Jakub Kicinski says:
  We've got some new kdoc warnings here:
  net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc'
  net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc'
  include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set'

Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Fixes: f6c383b8c3 ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:14 +02:00
Florian Westphal b9f052dc68 netfilter: nf_tables: fix false-positive lockdep splat
->abort invocation may cause splat on debug kernels:

WARNING: suspicious RCU usage
net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage!
[..]
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid
[..]
 lockdep_rcu_suspicious+0x1ad/0x260
 nft_pipapo_abort+0x145/0x180
 __nf_tables_abort+0x5359/0x63d0
 nf_tables_abort+0x24/0x40
 nfnetlink_rcv+0x1a0a/0x22c0
 netlink_unicast+0x73c/0x900
 netlink_sendmsg+0x7f0/0xc20
 ____sys_sendmsg+0x48d/0x760

Transaction mutex is held, so parallel updates are not possible.
Switch to _protected and check mutex is held for lockdep enabled builds.

Fixes: 212ed75dc5 ("netfilter: nf_tables: integrate pipapo into commit protocol")
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16 00:05:14 +02:00
Jakub Kicinski f946270d05 ethtool: netlink: always pass genl_info to .prepare_data
We had a number of bugs in the past because developers forgot
to fully test dumps, which pass NULL as info to .prepare_data.
.prepare_data implementations would try to access info->extack
leading to a null-deref.

Now that dumps and notifications can access struct genl_info
we can pass it in, and remove the info null checks.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski ec0e5b09b8 ethtool: netlink: simplify arguments to ethnl_default_parse()
Pass struct genl_info directly instead of its members.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski 0e19d3108a netdev-genl: use struct genl_info for reply construction
Use the just added APIs to make the code simpler.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski 5c670a010d genetlink: add a family pointer to struct genl_info
Having family in struct genl_info is quite useful. It cuts
down the number of arguments which need to be passed to
helpers which already take struct genl_info.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski 7288dd2fd4 genetlink: use attrs from struct genl_info
Since dumps carry struct genl_info now, use the attrs pointer
from genl_info and remove the one in struct genl_dumpit_info.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:00:45 -07:00
Jakub Kicinski 9272af109f genetlink: add struct genl_info to struct genl_dumpit_info
Netlink GET implementations must currently juggle struct genl_info
and struct netlink_callback, depending on whether they were called
from doit or dumpit.

Add genl_info to the dump state and populate the fields.
This way implementations can simply pass struct genl_info around.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Jakub Kicinski bffcc6882a genetlink: remove userhdr from struct genl_info
Only three families use info->userhdr today and going forward
we discourage using fixed headers in new families.
So having the pointer to user header in struct genl_info
is an overkill. Compute the header pointer at runtime.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Jakub Kicinski fde9bd4a4d genetlink: make genl_info->nlhdr const
struct netlink_callback has a const nlh pointer, make the
pointer in struct genl_info const as well, to make copying
between the two easier.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Jakub Kicinski 84817d8c60 genetlink: push conditional locking into dumpit/done
Add helpers which take/release the genl mutex based
on family->parallel_ops. Remove the separation between
handling of ops in locked and parallel families.

Future patches would make the duplicated code grow even more.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Jason Xing e4dd0d3a2f net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled
In the real workload, I encountered an issue which could cause the RTO
timer to retransmit the skb per 1ms with linear option enabled. The amount
of lost-retransmitted skbs can go up to 1000+ instantly.

The root cause is that if the icsk_rto happens to be zero in the 6th round
(which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero
due to the changed calculation method in tcp_retransmit_timer() as follows:

icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);

Above line could be converted to
icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0

Therefore, the timer expires so quickly without any doubt.

I read through the RFC 6298 and found that the RTO value can be rounded
up to a certain value, in Linux, say TCP_RTO_MIN as default, which is
regarded as the lower bound in this patch as suggested by Eric.

Fixes: 36e31b0af5 ("net: TCP thin linear timeouts")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-15 20:24:04 +01:00
Jeff Layton c96e2a695e sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsg
svc_tcp_sendmsg used to factor in the xdr->page_base when sending pages,
but commit 5df5dd03a8 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather
then sendpage") dropped that part of the handling. Fix it by setting
the bv_offset of the first bvec.

Fixes: 5df5dd03a8 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-14 15:02:25 -04:00
Jiri Pirko 0149bca172 netlink: specs: devlink: extend health reporter dump attributes by port index
Allow user to pass port index for health reporter dump request.

Re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko b03f13cb67 devlink: extend health reporter dump selector by port index
Introduce a possibility for devlink object to expose attributes it
supports for selection of dumped objects.

Use this by health reporter to indicate it supports port index based
selection of dump objects. Implement this selection mechanism in
devlink_nl_cmd_health_reporter_get_dump_one()

Example:
$ devlink health
pci/0000:08:00.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32769:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32770:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98304:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98305:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98306:
  reporter vnic
    state healthy error 0 recover 0

$ devlink health show pci/0000:08:00.0
pci/0000:08:00.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32769:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32770:
  reporter vnic
    state healthy error 0 recover 0

$ devlink health show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0

The last command is possible because of this patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko 34493336e7 netlink: specs: devlink: extend per-instance dump commands to accept instance attributes
Extend per-instance dump command definitions to accept instance
attributes. Allow parsing of devlink handle attributes so they could
be used for instance selection.

Re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko 4a1b5aa8b5 devlink: allow user to narrow per-instance dumps by passing handle attrs
For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.

Allow user to pass devlink handle attributes alongside the dump command
and dump only objects which are under selected devlink instance.

Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false

$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko 833e479d33 devlink: remove converted commands from small ops
As the commands are already defined in split ops, remove them
from small ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko ddff283280 devlink: remove duplicate temporary netlink callback prototypes
Remove the duplicate temporary netlink callback prototype as the
generated ones are already in place.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko 7199c86247 netlink: specs: devlink: add commands that do per-instance dump
Add the definitions for the commands that do per-instance dump
and re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko 7d3c6fec61 devlink: pass flags as an arg of dump_one() callback
In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the
flags as an arg of dump_one() callback. Currently, it is always
NLM_F_MULTI.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko 24c8e56d4f devlink: introduce dumpit callbacks for split ops
Introduce dumpit callbacks for generated split ops. Have them
as a thin wrapper around iteration function and allow to pass dump_one()
function pointer directly without need to store in devlink_cmd structs.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko 8fa995ad1f devlink: rename doit callbacks for per-instance dump commands
Rename netlink doit callback functions for the commands that do
implement per-instance dump to match the generated names that are going
to be introduce in the follow-up patch.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko ee6d78ac28 devlink: introduce devlink_nl_pre_doit_port*() helper functions
Define port handling helpers what don't rely on internal_flags.
Have __devlink_nl_pre_doit() to accept the flags as a function arg and
make devlink_nl_pre_doit() a wrapper helper function calling it.
Introduce new helpers devlink_nl_pre_doit_port() and
devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up
patch.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko 41a1d4d139 devlink: parse rate attrs in doit() callbacks
No need to give the rate any special treatment in netlink attributes
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.

Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler
by moving the rate attributes parsing to rate_*_doit() ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko 63618463cb devlink: parse linecard attr in doit() callbacks
No need to give the linecards any special treatment in netlink attribute
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.

Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler
by moving the linecard attribute parsing to linecard_[gs]et_doit() ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Sven Eckelmann 6f96d46f9a batman-adv: Drop per algo GW section class code
This code was  only used in the past for the sysfs interface. But since
this was replace with netlink, it was never executed. The function pointer
was only checked to figure out whether the limit 255 (B.A.T.M.A.N. IV) or
2**32-1 (B.A.T.M.A.N. V) should be used as limit.

So instead of keeping the function pointer, just store the limits directly
in struct batadv_algo_gw_ops.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-14 18:01:21 +02:00
Sven Eckelmann 02e61f06a9 batman-adv: Keep batadv_netlink_notify_* static
The batadv_netlink_notify_*() functions are not used by any other source
file. Just keep them local to netlink.c to get informed by the compiler
when they are not used anymore.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-14 18:01:21 +02:00
Sven Eckelmann 950c92bbaa batman-adv: Drop unused function batadv_gw_bandwidth_set
This function is no longer used since the sysfs support was removed from
batman-adv.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-14 18:01:21 +02:00
Vlad Buslov ace0ab3a4b Revert "vlan: Fix VLAN 0 memory leak"
This reverts commit 718cb09aaa.

The commit triggers multiple syzbot issues, probably due to possibility of
manually creating VLAN 0 on netdevice which will cause the code to delete
it since it can't distinguish such VLAN from implicit VLAN 0 automatically
created for devices with NETIF_F_HW_VLAN_CTAG_FILTER feature.

Reported-by: syzbot+662f783a5cdf3add2719@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/00000000000090196d0602a6167d@google.com/
Reported-by: syzbot+4b4f06495414e92701d5@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/00000000000096ae870602a61602@google.com/
Reported-by: syzbot+d810d3cd45ed1848c3f7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000009f0f9c0602a616ce@google.com/
Fixes: 718cb09aaa ("vlan: Fix VLAN 0 memory leak")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:14:00 +01:00
Adrian Moreno 43d95b30cf net: openvswitch: add misc error drop reasons
Use drop reasons from include/net/dropreason-core.h when a reasonable
candidate exists.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno f329d1bc1a net: openvswitch: add meter drop reason
By using an independent drop reason it makes it easy to distinguish
between QoS-triggered or flow-triggered drop.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Eric Garver e7bc7db9ba net: openvswitch: add explicit drop action
From: Eric Garver <eric@garver.life>

This adds an explicit drop action. This is used by OVS to drop packets
for which it cannot determine what to do. An explicit action in the
kernel allows passing the reason _why_ the packet is being dropped or
zero to indicate no particular error happened (i.e: OVS intentionally
dropped the packet).

Since the error codes coming from userspace mean nothing for the kernel,
we squash all of them into only two drop reasons:
- OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed
- OVS_DROP_EXPLICIT to indicate a zero value was passed (no error)

e.g. trace all OVS dropped skbs

 # perf trace -e skb:kfree_skb --filter="reason >= 0x30000"
 [..]
 106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \
  location:0xffffffffc0d9b462, protocol: 2048, reason: 196611)

reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT)

Also, this patch allows ovs-dpctl.py to add explicit drop actions as:
  "drop"     -> implicit empty-action drop
  "drop(0)"  -> explicit non-error action drop
  "drop(42)" -> explicit error action drop

Signed-off-by: Eric Garver <eric@garver.life>
Co-developed-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno ec7bfb5e5a net: openvswitch: add action error drop reason
Add a drop reason for packets that are dropped because an action
returns a non-zero error code.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno 9d802da40b net: openvswitch: add last-action drop reason
Create a new drop reason subsystem for openvswitch and add the first
drop reason to represent last-action drops.

Last-action drops happen when a flow has an empty action list or there
is no action that consumes the packet (output, userspace, recirc, etc).
It is the most common way in which OVS drops packets.

Implementation-wise, most of these skb-consuming actions already call
"consume_skb" internally and return directly from within the
do_execute_actions() loop so with minimal changes we can assume that
any skb that exits the loop normally is a packet drop.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Kuniyuki Iwashima e263691773 mptcp: Remove unnecessary test for __mptcp_init_sock()
__mptcp_init_sock() always returns 0 because mptcp_init_sock() used
to return the value directly.

But after commit 18b683bff8 ("mptcp: queue data for mptcp level
retransmission"), __mptcp_init_sock() need not return value anymore.

Let's remove the unnecessary test for __mptcp_init_sock() and make
it return void.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:14 +01:00
Paolo Abeni 39880bd808 mptcp: get rid of msk->subflow
Such field is now unused just as a flag to control the first subflow
deletion at close() time. Introduce a new bit flag for that and finally
drop the mentioned field.

As an intended side effect, now the first subflow sock is not freed
before close() even for passive sockets. The msk has no open/active
subflows if the first one is closed and the subflow list is singular,
update accordingly the state check in mptcp_stream_accept().

Among other benefits, the subflow removal, reduces the amount of memory
used on the client side for each mptcp connection, allows passive sockets
to go through successful accept()/disconnect()/connect() and makes return
error code consistent for failing both passive and active sockets.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/290
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:14 +01:00
Paolo Abeni 3f326a821b mptcp: change the mpc check helper to return a sk
After the previous patch the __mptcp_nmpc_socket helper is used
only to ensure that the MPTCP socket is a suitable status - that
is, the mptcp capable handshake is not started yet.

Change the return value to the relevant subflow sock, to finally
remove the last references to first subflow socket in the MPTCP stack.

As a bonus, we can get rid of a few local variables in different
functions.

No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:14 +01:00
Paolo Abeni 3aa3624941 mptcp: avoid ssock usage in mptcp_pm_nl_create_listen_socket()
This is one of the few remaining spots actually manipulating the
first subflow socket. We can leverage the recently introduced
inet helpers to get rid of ssock there.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:14 +01:00
Paolo Abeni f0bc514bd5 mptcp: avoid additional indirection in sockopt
The mptcp sockopt infrastructure unneedly uses the first subflow
socket struct in a few spots. We are going to remove such field
soon, so use directly the first subflow sock instead.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:14 +01:00
Paolo Abeni 1f6610b92a mptcp: avoid unneeded indirection in mptcp_stream_accept()
We are going to remove the first subflow socket soon, so avoid
the additional indirection at accept() time. Instead access
directly the first subflow sock, and update mptcp_accept() to
operate on it. This allows dropping a duplicated check in
mptcp_accept().

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:14 +01:00
Paolo Abeni 5426a4ef64 mptcp: avoid additional indirection in mptcp_poll()
We are going to remove the first subflow socket soon, so avoid
the additional indirection at poll() time. Instead access
directly the first subflow sock.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:14 +01:00
Paolo Abeni 40f56d0c70 mptcp: avoid additional indirection in mptcp_listen()
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at listen() time. Instead call
directly the recently introduced helper on the first subflow sock.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:13 +01:00
Paolo Abeni 71a9a874cd net: factor out __inet_listen_sk() helper
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of them
is inet_listen().

Factor out an helper operating directly on the (locked) struct sock,
to allow get rid of the above dependency in the next patch without
duplicating the existing code.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:13 +01:00
Paolo Abeni 8cf2ebdc00 mptcp: mptcp: avoid additional indirection in mptcp_bind()
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at bind() time. Instead call directly
the recently introduced helpers on the first subflow sock.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:13 +01:00
Paolo Abeni e6d360ff87 net: factor out inet{,6}_bind_sk helpers
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of
them is bind().

Factor out the helpers operating directly on the struct sock, to
allow get rid of the above dependency in the next patch without
duplicating the existing code.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:13 +01:00
Paolo Abeni cfb63e50d3 mptcp: avoid subflow socket usage in mptcp_get_port()
We are going to remove the first subflow socket soon, so avoid
accessing it in mptcp_get_port(). Instead, access directly the
first subflow sock.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:13 +01:00
Paolo Abeni ccae357c1c mptcp: avoid additional __inet_stream_connect() call
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of them is
__inet_stream_connect().

We are going to remove the first subflow socket soon, so avoid
the additional indirection via at connect time, calling directly
into the sock-level connect() ops.

The sk-level connect never return -EINPROGRESS, cleanup the error
path accordingly. Additionally, the ssk status on error is always
TCP_CLOSE. Avoid unneeded access to the subflow sk state.

No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:13 +01:00
Paolo Abeni 131a627751 mptcp: avoid unneeded mptcp_token_destroy() calls
The MPTCP protocol currently clears the msk token both at connect() and
listen() time. That is needed to deal with failing connect() calls that
can create a new token while leaving the sk in TCP_CLOSE,SS_UNCONNECTED
status and thus allowing later connect() and/or listen() calls.

Let's deal with such failures explicitly, cleaning the token in a timely
manner and avoid the confusing early mptcp_token_destroy().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 07:06:13 +01:00
David S. Miller 3d3829363b bluetooth-next pull request for net-next:
- Add new VID/PID for Mediatek MT7922
  - Add support multiple BIS/BIG
  - Add support for Intel Gale Peak
  - Add support for Qualcomm WCN3988
  - Add support for BT_PKT_STATUS for ISO sockets
  - Various fixes for experimental ISO support
  - Load FW v2 for RTL8852C
  - Add support for NXP AW693 chipset
  - Add support for Mediatek MT2925
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmTWigQZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKd8sD/92kBczbO3v+nSNyiYcbVmB
 x3Z7x1l2ExxHnPdW8xBmEzHlDErYB/KBKYdJWM8y6Bam5z1lnsX7LflXSy+bhZeX
 iOFYl94Gh/9/ooyYOwwYUKC2fLKWT54PLg1TcJzyfp8uUizQNWAg9QD7vjvxe7lN
 HXrW6CaA4Oohcq2YXagZV1h6Q/jl3BjcfEe7N0E6YYjeonplsJsv6rYG8Ku5n0Pi
 9YhB5IkX5zszTGKBSSWURKvaJjbFd7pr3mYkgLZG2pIMGQcUAFJZ9kL7de9xeBWI
 TRfgehZZPB2bUac1LxGLcAfONTmzUmo3/trjL1opdxreVCAX565JlaVSJwd0zuQk
 cBrmtU3Q8peFSOgJRb1Ci5junE8tqjEWzFRIgw7/wL1Ys3mrbbVDDGKqPhwhvjdq
 grOBf6UGaDpEO797yWWpBl5DLV3klMQDi4v84J0yTdvf4GXF8t8fuZU+zIpknVou
 BwdeeF33yzqtk01BjomQcLVOrrGOP7+Salc5g7eEVU1jZnaw0MH9aH+o6R2JYtP8
 uIiH4QOUJh7NA543F+/wPdZU+OV1E+Io+b34pTZ1oIyM2UT9Dy57Tex/DDKq2UCe
 69WV6aVM+FTt2VSMUS2J0XrXkxbI4f6/ABOLht5hHKxT1m6LhOh8mCSTof+UENrr
 G0sVoCodRrSljSMS/VltTA==
 =akZ8
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2023-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

bluetooth-next pull request for net-next:

 - Add new VID/PID for Mediatek MT7922
 - Add support multiple BIS/BIG
 - Add support for Intel Gale Peak
 - Add support for Qualcomm WCN3988
 - Add support for BT_PKT_STATUS for ISO sockets
 - Various fixes for experimental ISO support
 - Load FW v2 for RTL8852C
 - Add support for NXP AW693 chipset
 - Add support for Mediatek MT2925
2023-08-13 14:53:53 +01:00
Yue Haibing 2b8893b639 net/rds: Remove unused function declarations
Commit 39de828179 ("RDS: Main header file") declared but never implemented
rds_trans_init() and rds_trans_exit(), remove it.
Commit d37c935905 ("RDS: Move loop-only function to loop.c") removed the
implementation rds_message_inc_free() but not the declaration.

Since commit 55b7ed0b58 ("RDS: Common RDMA transport code")
rds_rdma_conn_connect() is never implemented and used.
rds_tcp_map_seq() is never implemented and used since
commit 70041088e3 ("RDS: Add TCP transport to RDS").

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13 12:25:42 +01:00
Eric Dumazet 8fe08d70a2 netlink: convert nlk->flags to atomic flags
sk_diag_put_flags(), netlink_setsockopt(), netlink_getsockopt()
and others use nlk->flags without correct locking.

Use set_bit(), clear_bit(), test_bit(), assign_bit() to remove
data-races.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13 12:23:19 +01:00
Menglong Dong 031c44b752 net: tcp: refactor the dbg message in tcp_retransmit_timer()
The debug message in tcp_retransmit_timer() is slightly wrong, because
they could be printed even if we did not receive a new ACK packet from
the remote peer.

Change it to probing zero-window, as it is a expected case now. The
description may be not correct.

Adding the duration since the last ACK we received, and the duration of
the retransmission, which are useful for debugging.

And the message now like this:

Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13 12:21:38 +01:00
Menglong Dong e89688e3e9 net: tcp: fix unexcepted socket die when snd_wnd is 0
In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.

The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.

However, the timer can be delayed and be triggered after 122877ms, not
TCP_RTO_MAX, as I tested.

Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX, and the socket will die.

Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.

However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp
could already be a long time (minutes or hours) in the past even on the
first RTO. So we double check the timeout with the duration of the
retransmission.

Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket
dying too soon.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13 12:21:37 +01:00
Menglong Dong 800a666141 net: tcp: allow zero-window ACK update the window
Fow now, an ACK can update the window in following case, according to
the tcp_may_update_window():

1. the ACK acknowledged new data
2. the ACK has new data
3. the ACK expand the window and the seq of it is valid

Now, we allow the ACK update the window if the window is 0, and the
seq/ack of it is valid. This is for the case that the receiver replies
an zero-window ACK when it is under memory stress and can't queue the new
data.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13 12:21:37 +01:00
Menglong Dong e2142825c1 net: tcp: send zero-window ACK when no memory
For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

In this patch, we reply an ACK with zero-window in this case to update
the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
connection and will probe the zero-window with the retransmits.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13 12:21:37 +01:00
Jiri Slaby (SUSE) c70fd7c0e9 tty: rfcomm: convert counts to size_t
Unify the type of tty_operations::write() counters with the 'count'
parameter. I.e. use size_t for them.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Link: https://lore.kernel.org/r/20230810091510.13006-37-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 21:12:47 +02:00
Jiri Slaby (SUSE) 49b8220cee tty: ldops: unify to u8
Some hooks in struct tty_ldisc_ops still reference buffers by 'unsigned
char'. Unify to 'u8' as the rest of the tty layer does.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230810091510.13006-32-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 21:12:47 +02:00
Jiri Slaby (SUSE) 95713967ba tty: make tty_operations::write()'s count size_t
Unify with the rest of the code. Use size_t for counts and ssize_t for
retval.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20230810091510.13006-30-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 21:12:46 +02:00
Jiri Slaby (SUSE) 69851e4ab8 tty: propagate u8 data to tty_operations::write()
Data are now typed as u8. Propagate this change to
tty_operations::write().

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Vaibhav Gupta <vaibhavgupta40@gmail.com>
Cc: Jens Taprogge <jens.taprogge@taprogge.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Scott Branden <scott.branden@broadcom.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: David Lin <dtwlin@gmail.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Pengutronix Kernel Team <kernel@pengutronix.de>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Arnaud Pouliquen <arnaud.pouliquen@foss.st.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Mathias Nyman <mathias.nyman@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Link: https://lore.kernel.org/r/20230810091510.13006-28-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 21:12:46 +02:00
Jiri Slaby (SUSE) 892bc209f2 tty: use u8 for flags
This makes all those 'char's an explicit 'u8'. This is part of the
continuing unification of chars and flags to be consistent u8.

This approaches tty_port_default_receive_buf().

Note that we do not change signedness as we compile with
-funsigned-char.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: William Hubbs <w.d.hubbs@gmail.com>
Cc: Chris Brannon <chris@the-brannons.com>
Cc: Kirk Reiser <kirk@reisers.ca>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230810091510.13006-18-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 21:12:45 +02:00
Jiri Slaby (SUSE) e8161447bb tty: make tty_ldisc_ops::*buf*() hooks operate on size_t
Count passed to tty_ldisc_ops::receive_buf*(), ::lookahead_buf(), and
returned from ::receive_buf2() is expected to be size_t. So set it to
size_t to unify with the rest of the code.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: William Hubbs <w.d.hubbs@gmail.com>
Cc: Chris Brannon <chris@the-brannons.com>
Cc: Kirk Reiser <kirk@reisers.ca>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230810091510.13006-16-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 21:12:45 +02:00
Jiri Slaby (SUSE) 6e5710e71d tty: remove dummy tty_ldisc_ops::poll() implementations
tty_ldisc_ops::poll() is optional and needs not be provided. It is equal
to returning 0. So remove all those from the code.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230810091510.13006-4-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 21:12:44 +02:00
Pauli Virtanen b5793de3cf Bluetooth: hci_conn: avoid checking uninitialized CIG/CIS ids
The CIS/CIG ids of ISO connections are defined only when the connection
is unicast.

Fix the lookup functions to check for unicast first. Ensure CIG/CIS
IDs have valid value also in state BT_OPEN.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:57:54 -07:00
Pauli Virtanen 66dee21524 Bluetooth: hci_event: drop only unbound CIS if Set CIG Parameters fails
When user tries to connect a new CIS when its CIG is not configurable,
that connection shall fail, but pre-existing connections shall not be
affected.  However, currently hci_cc_le_set_cig_params deletes all CIS
of the CIG on error so it doesn't work, even though controller shall not
change CIG/CIS configuration if the command fails.

Fix by failing on command error only the connections that are not yet
bound, so that we keep the previous CIS configuration like the
controller does.

Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:57:33 -07:00
Ziyang Xuan 3cd43dd15f Bluetooth: Remove unnecessary NULL check before vfree()
Remove unnecessary NULL check which causes coccinelle warning:

net/bluetooth/coredump.c:104:2-7: WARNING: NULL check before some
freeing functions is not needed.

Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:56:54 -07:00
Manish Mandlik a2bcd2b632 Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_add_adv_monitor()
KSAN reports use-after-free in hci_add_adv_monitor().

While adding an adv monitor,
    hci_add_adv_monitor() calls ->
    msft_add_monitor_pattern() calls ->
    msft_add_monitor_sync() calls ->
    msft_le_monitor_advertisement_cb() calls in an error case ->
    hci_free_adv_monitor() which frees the *moniter.

This is referenced by bt_dev_dbg() in hci_add_adv_monitor().

Fix the bt_dev_dbg() by using handle instead of monitor->handle.

Fixes: b747a83690 ("Bluetooth: hci_sync: Refactor add Adv Monitor")
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:56:35 -07:00
Min Li 3673952cf0 Bluetooth: Fix potential use-after-free when clear keys
Similar to commit c5d2b6fa26 ("Bluetooth: Fix use-after-free in
hci_remove_ltk/hci_remove_irk"). We can not access k after kfree_rcu()
call.

Fixes: d7d41682ef ("Bluetooth: Fix Suspicious RCU usage warnings")
Signed-off-by: Min Li <lm0963hack@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:56:16 -07:00
Luiz Augusto von Dentz a1f6c3aef1 Bluetooth: hci_sync: Introduce PTR_UINT/UINT_PTR macros
This introduces PTR_UINT/UINT_PTR macros and replace the use of
PTR_ERR/ERR_PTR.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:55:48 -07:00
Luiz Augusto von Dentz a091289218 Bluetooth: hci_conn: Fix hci_le_set_cig_params
When running with concurrent task only one CIS was being assigned so
this attempts to rework the way the PDU is constructed so it is handled
later at the callback instead of in place.

Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:55:28 -07:00
Luiz Augusto von Dentz f88670161e Bluetooth: hci_core: Make hci_is_le_conn_scanning public
This moves hci_is_le_conn_scanning to hci_core.h so it can be used by
different files without having to duplicate its code.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:54:59 -07:00
Luiz Augusto von Dentz f2f84a70f9 Bluetooth: hci_conn: Fix not allowing valid CIS ID
Only the number of CIS shall be limited to 0x1f, the CIS ID in the
other hand is up to 0xef.

Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:54:39 -07:00
Luiz Augusto von Dentz 16e3b64291 Bluetooth: hci_conn: Fix modifying handle while aborting
This introduces hci_conn_set_handle which takes care of verifying the
conditions where the hci_conn handle can be modified, including when
hci_conn_abort has been called and also checks that the handles is
valid as well.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:54:10 -07:00
Luiz Augusto von Dentz b7f923b1ef Bluetooth: ISO: Fix not checking for valid CIG/CIS IDs
Valid range of CIG/CIS are 0x00 to 0xEF, so this checks they are
properly checked before attempting to use HCI_OP_LE_SET_CIG_PARAMS.

Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:53:50 -07:00
Luiz Augusto von Dentz 5af1f84ed1 Bluetooth: hci_sync: Fix UAF on hci_abort_conn_sync
Connections may be cleanup while waiting for the commands to complete so
this attempts to check if the connection handle remains valid in case of
errors that would lead to call hci_conn_failed:

BUG: KASAN: slab-use-after-free in hci_conn_failed+0x1f/0x160
Read of size 8 at addr ffff888001376958 by task kworker/u3:0/52

CPU: 0 PID: 52 Comm: kworker/u3:0 Not tainted
6.5.0-rc1-00527-g2dfe76d58d3a #5615
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.2-1.fc38 04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x1d/0x70
 print_report+0xce/0x620
 ? __virt_addr_valid+0xd4/0x150
 ? hci_conn_failed+0x1f/0x160
 kasan_report+0xd1/0x100
 ? hci_conn_failed+0x1f/0x160
 hci_conn_failed+0x1f/0x160
 hci_abort_conn_sync+0x237/0x360

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:53:30 -07:00
Luiz Augusto von Dentz 094e363962 Bluetooth: hci_sync: Fix handling of HCI_OP_CREATE_CONN_CANCEL
When sending HCI_OP_CREATE_CONN_CANCEL it shall Wait for
HCI_EV_CONN_COMPLETE, not HCI_EV_CMD_STATUS, when the reason is
anything but HCI_ERROR_REMOTE_POWER_OFF. This reason is used when
suspending or powering off, where we don't want to wait for the peer's
response.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:53:10 -07:00
Pauli Virtanen 2889bdd0a9 Bluetooth: hci_sync: delete CIS in BT_OPEN/CONNECT/BOUND when aborting
Dropped CIS that are in state BT_OPEN/BT_BOUND, and in state BT_CONNECT
with HCI_CONN_CREATE_CIS unset, should be cleaned up immediately.
Closing CIS ISO sockets should result to the hci_conn be deleted, so
that potentially pending CIG removal can run.

hci_abort_conn cannot refer to them by handle, since their handle is
still unset if Set CIG Parameters has not yet completed.

This fixes CIS not being terminated if the socket is shut down
immediately after connection, so that the hci_abort_conn runs before Set
CIG Parameters completes. See new BlueZ test "ISO Connect Close - Success"

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:52:51 -07:00
Pauli Virtanen 69997d50ec Bluetooth: ISO: handle bound CIS cleanup via hci_conn
Calling hci_conn_del in __iso_sock_close is invalid. It needs
hdev->lock, but it cannot be acquired there due to lock ordering.

Fix this by doing cleanup via hci_conn_drop.

Return hci_conn with refcount 1 from hci_bind_cis and hci_connect_cis,
so that the iso_conn always holds one reference.  This also fixes
refcounting when error handling.

Since hci_conn_abort shall handle termination of connections in any
state properly, we can handle BT_CONNECT socket state in the same way as
BT_CONNECTED.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:52:32 -07:00
Yue Haibing 90005880a6 Bluetooth: Remove unused declaration amp_read_loc_info()
This is introduced in commit 903e454110 but was never implemented.

Fixes: 903e454110 ("Bluetooth: AMP: Use HCI cmd to Read Loc AMP Assoc")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:52:13 -07:00
Luiz Augusto von Dentz 0731c5ab4d Bluetooth: ISO: Add support for BT_PKT_STATUS
This adds support for BT_PKT_STATUS socketopt by setting
BT_SK_PKT_STATUS. Then upon receiving an ISO packet the code would
attempt to store the Packet_Status_Flag to hci_skb_pkt_status which
is then forward to userspace in the form of BT_SCM_PKT_STATUS whenever
BT_PKT_STATUS has been enabled/set.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:49:45 -07:00
Luiz Augusto von Dentz 3f19ffb2f9 Bluetooth: af_bluetooth: Make BT_PKT_STATUS generic
This makes the handling of BT_PKT_STATUS more generic so it can be
reused by sockets other than SCO like BT_DEFER_SETUP, etc.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:49:16 -07:00
Ying Hsu 573ebae162 Bluetooth: Fix hci_suspend_sync crash
If hci_unregister_dev() frees the hci_dev object but hci_suspend_notifier
may still be accessing it, it can cause the program to crash.
Here's the call trace:
  <4>[102152.653246] Call Trace:
  <4>[102152.653254]  hci_suspend_sync+0x109/0x301 [bluetooth]
  <4>[102152.653259]  hci_suspend_dev+0x78/0xcd [bluetooth]
  <4>[102152.653263]  hci_suspend_notifier+0x42/0x7a [bluetooth]
  <4>[102152.653268]  notifier_call_chain+0x43/0x6b
  <4>[102152.653271]  __blocking_notifier_call_chain+0x48/0x69
  <4>[102152.653273]  __pm_notifier_call_chain+0x22/0x39
  <4>[102152.653276]  pm_suspend+0x287/0x57c
  <4>[102152.653278]  state_store+0xae/0xe5
  <4>[102152.653281]  kernfs_fop_write+0x109/0x173
  <4>[102152.653284]  __vfs_write+0x16f/0x1a2
  <4>[102152.653287]  ? selinux_file_permission+0xca/0x16f
  <4>[102152.653289]  ? security_file_permission+0x36/0x109
  <4>[102152.653291]  vfs_write+0x114/0x21d
  <4>[102152.653293]  __x64_sys_write+0x7b/0xdb
  <4>[102152.653296]  do_syscall_64+0x59/0x194
  <4>[102152.653299]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1

This patch holds the reference count of the hci_dev object while
processing it in hci_suspend_notifier to avoid potential crash
caused by the race condition.

Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:48:20 -07:00
Christophe JAILLET 82eae9dc43 Bluetooth: hci_debugfs: Use kstrtobool() instead of strtobool()
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.

In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.

While at it, include the corresponding header file (<linux/kstrtox.h>)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:47:44 -07:00
Luiz Augusto von Dentz 112b5090c2 Bluetooth: MGMT: Fix always using HCI_MAX_AD_LENGTH
HCI_MAX_AD_LENGTH shall only be used if the controller doesn't support
extended advertising, otherwise HCI_MAX_EXT_AD_LENGTH shall be used
instead.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:45:26 -07:00
Douglas Anderson 6f55eea116 Bluetooth: hci_sync: Don't double print name in add/remove adv_monitor
The hci_add_adv_monitor() hci_remove_adv_monitor() functions call
bt_dev_dbg() to print some debug statements. The bt_dev_dbg() macro
automatically adds in the device's name. That means that we shouldn't
include the name in the bt_dev_dbg() calls.

Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:45:07 -07:00
Dan Carpenter 528b2acf43 Bluetooth: msft: Fix error code in msft_cancel_address_filter_sync()
Return negative -EIO instead of positive EIO.

Fixes: 926df8962f3f ("Bluetooth: msft: Extended monitor tracking by address filter")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:44:12 -07:00
Iulia Tanasescu f777d88278 Bluetooth: ISO: Notify user space about failed bis connections
Some use cases require the user to be informed if BIG synchronization
fails. This commit makes it so that even if the BIG sync established
event arrives with error status, a new hconn is added for each BIS,
and the iso layer is notified about the failed connections.

Unsuccesful bis connections will be marked using the
HCI_CONN_BIG_SYNC_FAILED flag. From the iso layer, the POLLERR event
is triggered on the newly allocated bis sockets, before adding them
to the accept list of the parent socket.

From user space, a new fd for each failed bis connection will be
obtained by calling accept. The user should check for the POLLERR
event on the new socket, to determine if the connection was successful
or not.

The HCI_CONN_BIG_SYNC flag has been added to mark whether the BIG sync
has been successfully established. This flag is checked at bis cleanup,
so the HCI LE BIG Terminate Sync command is only issued if needed.

The BT_SK_BIG_SYNC flag indicates if BIG create sync has been called
for a listening socket, to avoid issuing the command everytime a BIGInfo
advertising report is received.

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:43:44 -07:00
Luiz Augusto von Dentz 9f78191cc9 Bluetooth: hci_conn: Always allocate unique handles
This attempts to always allocate a unique handle for connections so they
can be properly aborted by the likes of hci_abort_conn, so this uses the
invalid range as a pool of unset handles that way if userspace is trying
to create multiple connections at once each will be given a unique
handle which will be considered unset.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:43:02 -07:00
Luiz Augusto von Dentz 04a51d6169 Bluetooth: hci_sync: Fix not handling ISO_LINK in hci_abort_conn_sync
ISO_LINK connections where not being handled properly on
hci_abort_conn_sync which sometimes resulted in sending the wrong
commands, or in case of having the reject command being sent by the
socket code (iso.c) which is sort of a layer violation.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:42:43 -07:00
Luiz Augusto von Dentz a13f316e90 Bluetooth: hci_conn: Consolidate code for aborting connections
This consolidates code for aborting connections using
hci_cmd_sync_queue so it is synchronized with other threads, but
because of the fact that some commands may block the cmd_sync_queue
while waiting specific events this attempt to cancel those requests by
using hci_cmd_sync_cancel.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:42:15 -07:00
Claudia Draghicescu c33362a528 Bluetooth: hci_sync: Enable events for BIS capable devices
In the case of a Synchronized Receiver capable device, enable at start-up the
events for PA reports, PA Sync Established and Big Info Adv reports.

Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:40:25 -07:00
Hilda Wu 9e14606d8f Bluetooth: msft: Extended monitor tracking by address filter
Since limited tracking device per condition, this feature is to support
tracking multiple devices concurrently.
When a pattern monitor detects the device, this feature issues an address
monitor for tracking that device. Let pattern monitor can keep monitor
new devices.
This feature adds an address filter when receiving a LE monitor device
event which monitor handle is for a pattern, and the controller started
monitoring the device. And this feature also has cancelled the monitor
advertisement from address filters when receiving a LE monitor device
event when the controller stopped monitoring the device specified by an
address and monitor handle.

Below is an example to know the feature adds the address filter.

//Add MSFT pattern monitor
< HCI Command: Vendor (0x3f|0x00f0) plen 14          #142 [hci0] 55.552420
        03 b8 a4 03 ff 01 01 06 09 05 5f 52 45 46        .........._REF
> HCI Event: Command Complete (0x0e) plen 6          #143 [hci0] 55.653960
      Vendor (0x3f|0x00f0) ncmd 2
        Status: Success (0x00)
        03 00

//Got event from the pattern monitor
> HCI Event: Vendor (0xff) plen 18                   #148 [hci0] 58.384953
        23 79 54 33 77 88 97 68 02 00 fb c1 29 eb 27 b8  #yT3w..h....).'.
        00 01                                            ..

//Add MSFT address monitor (Sample address: B8:27:EB:29:C1:FB)
< HCI Command: Vendor (0x3f|0x00f0) plen 13          #149 [hci0] 58.385067
        03 b8 a4 03 ff 04 00 fb c1 29 eb 27 b8           .........).'.

//Report to userspace about found device (ADV Monitor Device Found)
@ MGMT Event: Unknown (0x002f) plen 38           {0x0003} [hci0] 58.680042
        01 00 fb c1 29 eb 27 b8 01 ce 00 00 00 00 16 00  ....).'.........
        0a 09 4b 45 59 42 44 5f 52 45 46 02 01 06 03 19  ..KEYBD_REF.....
        c1 03 03 03 12 18                                ......

//Got event from address monitor
> HCI Event: Vendor (0xff) plen 18                   #152 [hci0] 58.672956
        23 79 54 33 77 88 97 68 02 00 fb c1 29 eb 27 b8  #yT3w..h....).'.
        01 01

Signed-off-by: Alex Lu <alex_lu@realsil.com.cn>
Signed-off-by: Hilda Wu <hildawu@realtek.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:39:58 -07:00
Iulia Tanasescu 6a42e9bfd1 Bluetooth: ISO: Support multiple BIGs
This adds support for creating multiple BIGs. According to
spec, each BIG shall have an unique handle, and each BIG should
be associated with a different advertising handle. Otherwise,
the LE Create BIG command will fail, with error code
Command Disallowed (for reusing a BIG handle), or
Unknown Advertising Identifier (for reusing an advertising
handle).

The btmon snippet below shows an exercise for creating two BIGs
for the same controller, by opening two isotest instances with
the following command:
    tools/isotest -i hci0 -s 00:00:00:00:00:00

< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068) plen 31
        Handle: 0x00
        Advertising Handle: 0x01
        Number of BIS: 1
        SDU Interval: 10000 us (0x002710)
        Maximum SDU size: 40
        Maximum Latency: 10 ms (0x000a)
        RTN: 0x02
        PHY: LE 2M (0x02)
        Packing: Sequential (0x00)
        Framing: Unframed (0x00)
        Encryption: 0x00
        Broadcast Code: 00000000000000000000000000000000

> HCI Event: Command Status (0x0f) plen 4
      LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
        Status: Success (0x00)

> HCI Event: LE Meta Event (0x3e) plen 21
      LE Broadcast Isochronous Group Complete (0x1b)
        Status: Success (0x00)
        Handle: 0x00
        BIG Synchronization Delay: 912 us (0x000390)
        Transport Latency: 912 us (0x000390)
        PHY: LE 2M (0x02)
        NSE: 3
        BN: 1
        PTO: 1
        IRC: 3
        Maximum PDU: 40
        ISO Interval: 10.00 msec (0x0008)
        Connection Handle #0: 10

< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068)
        Handle: 0x01
        Advertising Handle: 0x02
        Number of BIS: 1
        SDU Interval: 10000 us (0x002710)
        Maximum SDU size: 40
        Maximum Latency: 10 ms (0x000a)
        RTN: 0x02
        PHY: LE 2M (0x02)
        Packing: Sequential (0x00)
        Framing: Unframed (0x00)
        Encryption: 0x00
        Broadcast Code: 00000000000000000000000000000000

> HCI Event: Command Status (0x0f) plen 4
      LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
        Status: Success (0x00)

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:38:12 -07:00
Luiz Augusto von Dentz 69ae506506 Bluetooth: hci_sock: Forward credentials to monitor
This stores scm_creds into hci_skb_cb so they can be properly forwarded
to the likes of btmon which is then able to print information about the
process who is originating the traffic:

bluetoothd[35]: @ MGMT Command: Rea.. (0x0001) plen 0  {0x0001}
@ MGMT Event: Command Complete (0x0001) plen 6         {0x0001}
      Read Management Version Information (0x0001) plen 3

bluetoothd[35]: < ACL Data T.. flags 0x00 dlen 41
      ATT: Write Command (0x52) len 36
        Handle: 0x0043 Type: ASE Control Point (0x2bc6)
          Data: 020203000110270000022800020a00409c0001000110270000022800020a00409c00
            Opcode: QoS Configuration (0x02)
            Number of ASE(s): 2
            ASE: #0
            ASE ID: 0x03
            CIG ID: 0x00
            CIS ID: 0x01
            SDU Interval: 10000 usec
            Framing: Unframed (0x00)
            PHY: 0x02
            LE 2M PHY (0x02)
            Max SDU: 40
            RTN: 2
            Max Transport Latency: 10
            Presentation Delay: 40000 us
            ASE: #1
            ASE ID: 0x01
            CIG ID: 0x00
            CIS ID: 0x01
            SDU Interval: 10000 usec
            Framing: Unframed (0x00)
            PHY: 0x02
            LE 2M PHY (0x02)
            Max SDU: 40
            RTN: 2
            Max Transport Latency: 10
            Presentation Delay: 40000 us

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:37:42 -07:00
Luiz Augusto von Dentz 464c702fb9 Bluetooth: Init sk_peer_* on bt_sock_alloc
This makes sure peer information is always available via sock when using
bt_sock_alloc.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:37:22 -07:00
Luiz Augusto von Dentz 6bfa273e53 Bluetooth: Consolidate code around sk_alloc into a helper function
This consolidates code around sk_alloc into bt_sock_alloc which does
take care of common initialization.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:36:50 -07:00
Pauli Virtanen 7f74563e61 Bluetooth: ISO: do not emit new LE Create CIS if previous is pending
LE Create CIS command shall not be sent before all CIS Established
events from its previous invocation have been processed. Currently it is
sent via hci_sync but that only waits for the first event, but there can
be multiple.

Make it wait for all events, and simplify the CIS creation as follows:

Add new flag HCI_CONN_CREATE_CIS, which is set if Create CIS has been
sent for the connection but it is not yet completed.

Make BT_CONNECT state to mean the connection wants Create CIS.

On events after which new Create CIS may need to be sent, send it if
possible and some connections need it. These events are:
hci_connect_cis, iso_connect_cfm, hci_cs_le_create_cis,
hci_le_cis_estabilished_evt.

The Create CIS status/completion events shall queue new Create CIS only
if at least one of the connections transitions away from BT_CONNECT, so
that we don't loop if controller is sending bogus events.

This fixes sending multiple CIS Create for the same CIS in the
"ISO AC 6(i) - Success" BlueZ test case:

< HCI Command: LE Create Co.. (0x08|0x0064) plen 9  #129 [hci0]
        Number of CIS: 2
        CIS Handle: 257
        ACL Handle: 42
        CIS Handle: 258
        ACL Handle: 42
> HCI Event: Command Status (0x0f) plen 4           #130 [hci0]
      LE Create Connected Isochronous Stream (0x08|0x0064) ncmd 1
        Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 29           #131 [hci0]
      LE Connected Isochronous Stream Established (0x19)
        Status: Success (0x00)
        Connection Handle: 257
        ...
< HCI Command: LE Setup Is.. (0x08|0x006e) plen 13  #132 [hci0]
        ...
> HCI Event: Command Complete (0x0e) plen 6         #133 [hci0]
      LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
        ...
< HCI Command: LE Create Co.. (0x08|0x0064) plen 5  #134 [hci0]
        Number of CIS: 1
        CIS Handle: 258
        ACL Handle: 42
> HCI Event: Command Status (0x0f) plen 4           #135 [hci0]
      LE Create Connected Isochronous Stream (0x08|0x0064) ncmd 1
        Status: ACL Connection Already Exists (0x0b)
> HCI Event: LE Meta Event (0x3e) plen 29           #136 [hci0]
      LE Connected Isochronous Stream Established (0x19)
        Status: Success (0x00)
        Connection Handle: 258
        ...

Fixes: c09b80be6f ("Bluetooth: hci_conn: Fix not waiting for HCI_EVT_LE_CIS_ESTABLISHED")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:36:01 -07:00
Iulia Tanasescu a0bfde167b Bluetooth: ISO: Add support for connecting multiple BISes
It is required for some configurations to have multiple BISes as part
of the same BIG.

Similar to the flow implemented for unicast, DEFER_SETUP will also be
used to bind multiple BISes for the same BIG, before starting Periodic
Advertising and creating the BIG.

The user will have to open a new socket for each BIS. By setting the
BT_DEFER_SETUP socket option and calling connect, a new connection
will be added for the BIG and advertising handle set by the socket
QoS parameters. Since all BISes will be bound for the same BIG and
advertising handle, the socket QoS options and base parameters should
match for all connections.

By calling connect on a socket that does not have the BT_DEFER_SETUP
option set, periodic advertising will be started and the BIG will
be created, with a BIS for each previously bound connection. Since
a BIG cannot be reconfigured with additional BISes after creation,
no more connections can be bound for the BIG after the start periodic
advertising and create BIG commands have been queued.

The bis_cleanup function has also been updated, so that the advertising
set and the BIG will not be terminated unless there are no more
bound or connected BISes.

The HCI_CONN_BIG_CREATED connection flag has been added to indicate
that the BIG has been successfully created. This flag is checked at
bis_cleanup, so that the BIG is only terminated if the
HCI_LE_Create_BIG_Complete has been received.

This implementation has been tested on hardware, using the "isotest"
tool with an additional command line option, to specify the number of
BISes to create as part of the desired BIG:

    tools/isotest -i hci0 -s 00:00:00:00:00:00 -N 2 -G 1 -T 1

The btmon log shows that a BIG containing 2 BISes has been created:

< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068) plen 31
        Handle: 0x01
        Advertising Handle: 0x01
        Number of BIS: 2
        SDU Interval: 10000 us (0x002710)
        Maximum SDU size: 40
        Maximum Latency: 10 ms (0x000a)
        RTN: 0x02
        PHY: LE 2M (0x02)
        Packing: Sequential (0x00)
        Framing: Unframed (0x00)
        Encryption: 0x00
        Broadcast Code: 00000000000000000000000000000000

> HCI Event: Command Status (0x0f) plen 4
      LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
        Status: Success (0x00)

> HCI Event: LE Meta Event (0x3e) plen 23
      LE Broadcast Isochronous Group Complete (0x1b)
        Status: Success (0x00)
        Handle: 0x01
        BIG Synchronization Delay: 1974 us (0x0007b6)
        Transport Latency: 1974 us (0x0007b6)
        PHY: LE 2M (0x02)
        NSE: 3
        BN: 1
        PTO: 1
        IRC: 3
        Maximum PDU: 40
        ISO Interval: 10.00 msec (0x0008)
        Connection Handle #0: 10
        Connection Handle #1: 11

< HCI Command: LE Setup Isochronous Data Path (0x08|0x006e) plen 13
        Handle: 10
        Data Path Direction: Input (Host to Controller) (0x00)
        Data Path: HCI (0x00)
        Coding Format: Transparent (0x03)
        Company Codec ID: Ericsson Technology Licensing (0)
        Vendor Codec ID: 0
        Controller Delay: 0 us (0x000000)
        Codec Configuration Length: 0
        Codec Configuration:

> HCI Event: Command Complete (0x0e) plen 6
      LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
        Status: Success (0x00)
        Handle: 10

< HCI Command: LE Setup Isochronous Data Path (0x08|0x006e) plen 13
        Handle: 11
        Data Path Direction: Input (Host to Controller) (0x00)
        Data Path: HCI (0x00)
        Coding Format: Transparent (0x03)
        Company Codec ID: Ericsson Technology Licensing (0)
        Vendor Codec ID: 0
        Controller Delay: 0 us (0x000000)
        Codec Configuration Length: 0
        Codec Configuration:

> HCI Event: Command Complete (0x0e) plen 6
      LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
        Status: Success (0x00)
        Handle: 11

< ISO Data TX: Handle 10 flags 0x02 dlen 44

< ISO Data TX: Handle 11 flags 0x02 dlen 44

> HCI Event: Number of Completed Packets (0x13) plen 5
        Num handles: 1
        Handle: 10
        Count: 1

> HCI Event: Number of Completed Packets (0x13) plen 5
        Num handles: 1
        Handle: 11
        Count: 1

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:35:33 -07:00
Claudia Draghicescu ae75336131 Bluetooth: Check for ISO support in controller
This patch checks for ISO_BROADCASTER and ISO_SYNC_RECEIVER in
controller.

Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11 11:31:23 -07:00
Jakub Kicinski 4d016ae42e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/intel/igc/igc_main.c
  06b412589e ("igc: Add lock to safeguard global Qbv variables")
  d3750076d4 ("igc: Add TransmissionOverrun counter")

drivers/net/ethernet/microsoft/mana/mana_en.c
  a7dfeda6fd ("net: mana: Fix MANA VF unload when hardware is unresponsive")
  a9ca9f9cef ("page_pool: split types and declarations from page_pool.h")
  92272ec410 ("eth: add missing xdp.h includes in drivers")

net/mptcp/protocol.h
  511b90e392 ("mptcp: fix disconnect vs accept race")
  b8dc6d6ce9 ("mptcp: fix rcv buffer auto-tuning")

tools/testing/selftests/net/mptcp/mptcp_join.sh
  c8c101ae39 ("selftests: mptcp: join: fix 'implicit EP' test")
  03668c65d1 ("selftests: mptcp: join: rework detailed report")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 14:10:53 -07:00
Linus Torvalds 25aa0bebba Including fixes from netfilter, wireless and bpf.
Still trending up in size but the good news is that the "current"
 regressions are resolved, AFAIK.
 
 We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
 handling on embedded this week (most not merged yet), not sure why.
 But those are all for older bugs.
 
 Current release - regressions:
 
  - tls: set MSG_SPLICE_PAGES consistently when handing encrypted
    data over to TCP
 
 Current release - new code bugs:
 
  - eth: mlx5: correct IDs on VFs internal to the device (IPU)
 
 Previous releases - regressions:
 
  - phy: at803x: fix WoL support / reporting on AR8032
 
  - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID
    from slaves, leading to BUG_ON()
 
  - tun: prevent tun_build_skb() from exceeding the packet size limit
 
  - wifi: rtw89: fix 8852AE disconnection caused by RX full flags
 
  - eth/PCI: enetc: fix probing after 6fffbc7ae1 ("PCI: Honor
    firmware's device disabled status"), keep PCI devices around
    even if they are disabled / not going to be probed to be
    able to apply quirks on them
 
  - eth: prestera: fix handling IPv4 routes with nexthop IDs
 
 Previous releases - always broken:
 
  - netfilter: re-work garbage collection to avoid races between
    user-facing API and timeouts
 
  - tunnels: fix generating ipv4 PMTU error on non-linear skbs
 
  - nexthop: fix infinite nexthop bucket dump when using maximum
    nexthop ID
 
  - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
 
 Misc:
 
  - unix: use consistent error code in SO_PEERPIDFD
 
  - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO,
    in prep for upcoming IETF RFC
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTVMSsACgkQMUZtbf5S
 Irul3g//RlSANV/MWkiDmHIS5IhqkVWbvjGhFXFfdqZPH4gfgcX9VrsMuxgNM1Xu
 YXGx+rIu408qNNkVG2hpFMxPerRiqVB/XsH1TxRr0Mi6AMFoKGXS+cGwzSOaoMQj
 FYlcC6j2SnQ9N4I0qQuKOSOffyvyxrx/l9ozpVXsbGsOic1k6j1Ipwtf3+WP7dEe
 kkAPUlsQPdCIhMyQdK3X4xI1PGLtAXFgY3VV9bZ7u99l7QBwmconkl3GHq/xnPa8
 Uyll005ThyYce0c4EPVcrY1YBXyY0LjOBIRtiTFAk6CMWc0Su8Ug/i4+K2KTq0eh
 yjqqHkpR//ruLgtAXBLLE9mxma8448vmmex/cSLIBaMAttlnj9n2LvCqvbzNfTZA
 ssnKO4D3HhoQvHqbeOOW6VzVX7XyhomOvQXihfdLUs9u2tKE3nQoU+QCnrnIUXZO
 VF5/ubCERRdZDPQ1SSAktimlC0R1qVL7JPMRaQF0aW5xByabbEWwMaNiwkYQOh2o
 w2KsbhM/vWyd+5JB412LrNsEgK1BV6WjgwzC+27YQ7QD/JxUZBUghL0ps2jgU2Lu
 d4YdbBOgYz+xyUBPByeYzcac0SIeMkB/UEcaO54ySWU8GcWYLt4KXwydUq/cXlw0
 rUDCO5bikMxmygLKtnTSwmwvGbGByEXbGvVKwUwNPqTnR+vPIbM=
 =NZgp
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, wireless and bpf.

  Still trending up in size but the good news is that the "current"
  regressions are resolved, AFAIK.

  We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
  handling on embedded this week (most not merged yet), not sure why.
  But those are all for older bugs.

  Current release - regressions:

   - tls: set MSG_SPLICE_PAGES consistently when handing encrypted data
     over to TCP

  Current release - new code bugs:

   - eth: mlx5: correct IDs on VFs internal to the device (IPU)

  Previous releases - regressions:

   - phy: at803x: fix WoL support / reporting on AR8032

   - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from
     slaves, leading to BUG_ON()

   - tun: prevent tun_build_skb() from exceeding the packet size limit

   - wifi: rtw89: fix 8852AE disconnection caused by RX full flags

   - eth/PCI: enetc: fix probing after 6fffbc7ae1 ("PCI: Honor
     firmware's device disabled status"), keep PCI devices around even
     if they are disabled / not going to be probed to be able to apply
     quirks on them

   - eth: prestera: fix handling IPv4 routes with nexthop IDs

  Previous releases - always broken:

   - netfilter: re-work garbage collection to avoid races between
     user-facing API and timeouts

   - tunnels: fix generating ipv4 PMTU error on non-linear skbs

   - nexthop: fix infinite nexthop bucket dump when using maximum
     nexthop ID

   - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()

  Misc:

   - unix: use consistent error code in SO_PEERPIDFD

   - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for
     upcoming IETF RFC"

* tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
  net: hns3: fix strscpy causing content truncation issue
  net: tls: set MSG_SPLICE_PAGES consistently
  ibmvnic: Ensure login failure recovery is safe from other resets
  ibmvnic: Do partial reset on login failure
  ibmvnic: Handle DMA unmapping of login buffs in release functions
  ibmvnic: Unmap DMA login rsp buffer on send login fail
  ibmvnic: Enforce stronger sanity checks on login response
  net: mana: Fix MANA VF unload when hardware is unresponsive
  netfilter: nf_tables: remove busy mark and gc batch API
  netfilter: nft_set_hash: mark set element as dead when deleting from packet path
  netfilter: nf_tables: adapt set backend to use GC transaction API
  netfilter: nf_tables: GC transaction API to avoid race with control plane
  selftests/bpf: Add sockmap test for redirecting partial skb data
  selftests/bpf: fix a CI failure caused by vsock sockmap test
  bpf, sockmap: Fix bug that strp_done cannot be called
  bpf, sockmap: Fix map type error in sock_map_del_link
  xsk: fix refcount underflow in error path
  ipv6: adjust ndisc_is_useropt() to also return true for PIO
  selftests: forwarding: bridge_mdb: Make test more robust
  selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
  ...
2023-08-10 12:37:24 -07:00
Jakub Kicinski 6b486676b4 net: tls: set MSG_SPLICE_PAGES consistently
We used to change the flags for the last segment, because
non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
That flag is no longer a thing so remove the setting.

Since flags most likely don't have MSG_SPLICE_PAGES set
this avoids passing parts of the sg as splice and parts
as non-splice. Before commit under Fixes we'd have called
tcp_sendpage() which would add the MSG_SPLICE_PAGES.

Why this leads to trouble remains unclear but Tariq
reports hitting the WARN_ON(!sendpage_ok()) due to
page refcount of 0.

Fixes: e117dcfd64 ("tls: Inline do_tcp_sendpages()")
Reported-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/all/4c49176f-147a-4283-f1b1-32aac7b4b996@gmail.com/
Tested-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20230808180917.1243540-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 11:36:57 -07:00
Jakub Kicinski 3e91b0ebd9 netfilter pull request 23-08-10
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmTUhbUACgkQ1V2XiooU
 IORz1Q//a2fDuMsK5iW1BlF4y0P9aQUSVV//r3DYaoYOspJhsB2yZu4HtL+XQJvY
 yncwg+ub24yQh5sUNSJnZztQVTN+NPY9Vl2TkXXMx6Wxs2XenmgzZmDdghUDzhTd
 DuOjIGVEJ2M6XpPAOub89sqL+E0K7J0/q0aIcV0K0/xKo7U/z3vgLv4aZx/ZjPCV
 daj3gcGpYQ1JJ9pi2se2yh89dzT321U7yYde9ek0TUeKGdCFJkfHkqMurwbcgoJ8
 jkx5NOtrp+GLbhd+ME86IUtD+Edm46+bJUxvG0My99CVlak7y5gJh/aPxpAPACuW
 NhWWy26kivVRWyttLQk0ScZfbO1CIwvaPzQC+QdlFdNA1eWTMhEk6AG2dVaU9CNB
 V9WKWv59CPaDwPCKhXiPLQ9J+Kds7oyHPXGlV2dDOuSmJ9QbOh/HBQGEm/mI93qX
 Fr+qqP3A9/juXZ5FdSLT2pJPuVlXdhQdgyHgiunyDPHoL9q7GFn5aQL/BVKE23tc
 bgMez0GKzBR0waS9cycFSVls1rQN1XUIdoD6SLaRYq9FkKcCx+YGn3LH44Y1feL/
 UnLMFlt9xIG4dPbGcGGy4r7mB53JpglHEqJEftvsNcBEd/r/f+4JP+/fa9FJ70uZ
 GpGmv7Wo5DZT5V8LaMeWDWpJl6G7UcxrFOyDTw27l2OOVNaD2Ic=
 =KNf7
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The existing attempt to resolve races between control plane and GC work
is error prone, as reported by Bien Pham <phamnnb@sea.com>, some places
forgot to call nft_set_elem_mark_busy(), leading to double-deactivation
of elements.

This series contains the following patches:

1) Do not skip expired elements during walk otherwise elements might
   never decrement the reference counter on data, leading to memleak.

2) Add a GC transaction API to replace the former attempt to deal with
   races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT
   on elements and it creates a GC transaction to remove the expired
   elements, GC transaction could abort in case of interference with
   control plane and retried later (GC async). Set backends such as
   rbtree and pipapo also perform GC from control plane (GC sync), in
   such case, element deactivation and removal is safe because mutex
   is held then collected elements are released via call_rcu().

3) Adapt existing set backends to use the GC transaction API.

4) Update rhash set backend to set on _DEAD bit to report deleted
   elements from datapath for GC.

5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT.

* tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: remove busy mark and gc batch API
  netfilter: nft_set_hash: mark set element as dead when deleting from packet path
  netfilter: nf_tables: adapt set backend to use GC transaction API
  netfilter: nf_tables: GC transaction API to avoid race with control plane
  netfilter: nf_tables: don't skip expired elements during walk
====================

Link: https://lore.kernel.org/r/20230810070830.24064-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 10:47:08 -07:00
Jakub Kicinski 62d02fca8b bpf pull-request 2023-08-09
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRuIQAKCRBar9k/UBDW
 4++9AP9ymOcPOKTKdQwZ6cnq3vkmvN37H6teufTyM8vsCha9NAD+OQE+vg1304RM
 aETtG6d5Nb+byIHZGJrdUyT7g9jRzgw=
 =qr/C
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Martin KaFai Lau says:

====================
pull-request: bpf 2023-08-09

We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 6 files changed, 102 insertions(+), 8 deletions(-).

The main changes are:

1) A bpf sockmap memleak fix and a fix in accessing the programs of
   a sockmap under the incorrect map type from Xu Kuohai.

2) A refcount underflow fix in xsk from Magnus Karlsson.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add sockmap test for redirecting partial skb data
  selftests/bpf: fix a CI failure caused by vsock sockmap test
  bpf, sockmap: Fix bug that strp_done cannot be called
  bpf, sockmap: Fix map type error in sock_map_del_link
  xsk: fix refcount underflow in error path
====================

Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 10:41:36 -07:00
Pablo Neira Ayuso a2dd0233cb netfilter: nf_tables: remove busy mark and gc batch API
Ditch it, it has been replace it by the GC transaction API and it has no
clients anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10 08:25:27 +02:00
Pablo Neira Ayuso c92db30304 netfilter: nft_set_hash: mark set element as dead when deleting from packet path
Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of
performing element removal which might race with an ongoing transaction.
Enable gc when dynamic flag is set on since dynset deletion requires
garbage collection after this patch.

Fixes: d0a8d877da ("netfilter: nft_dynset: support for element deletion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10 08:25:27 +02:00
Pablo Neira Ayuso f6c383b8c3 netfilter: nf_tables: adapt set backend to use GC transaction API
Use the GC transaction API to replace the old and buggy gc API and the
busy mark approach.

No set elements are removed from async garbage collection anymore,
instead the _DEAD bit is set on so the set element is not visible from
lookup path anymore. Async GC enqueues transaction work that might be
aborted and retried later.

rbtree and pipapo set backends does not set on the _DEAD bit from the
sync GC path since this runs in control plane path where mutex is held.
In this case, set elements are deactivated, removed and then released
via RCU callback, sync GC never fails.

Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Fixes: 9d0982927e ("netfilter: nft_hash: add support for timeouts")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10 08:25:27 +02:00
Pablo Neira Ayuso 5f68718b34 netfilter: nf_tables: GC transaction API to avoid race with control plane
The set types rhashtable and rbtree use a GC worker to reclaim memory.
From system work queue, in periodic intervals, a scan of the table is
done.

The major caveat here is that the nft transaction mutex is not held.
This causes a race between control plane and GC when they attempt to
delete the same element.

We cannot grab the netlink mutex from the work queue, because the
control plane has to wait for the GC work queue in case the set is to be
removed, so we get following deadlock:

   cpu 1                                cpu2
     GC work                            transaction comes in , lock nft mutex
       `acquire nft mutex // BLOCKS
                                        transaction asks to remove the set
                                        set destruction calls cancel_work_sync()

cancel_work_sync will now block forever, because it is waiting for the
mutex the caller already owns.

This patch adds a new API that deals with garbage collection in two
steps:

1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
   so they are not visible via lookup. Annotate current GC sequence in
   the GC transaction. Enqueue GC transaction work as soon as it is
   full. If ruleset is updated, then GC transaction is aborted and
   retried later.

2) GC work grabs the mutex. If GC sequence has changed then this GC
   transaction lost race with control plane, abort it as it contains
   stale references to objects and let GC try again later. If the
   ruleset is intact, then this GC transaction deactivates and removes
   the elements and it uses call_rcu() to destroy elements.

Note that no elements are removed from GC lockless path, the _DEAD bit
is set and pointers are collected. GC catchall does not remove the
elements anymore too. There is a new set->dead flag that is set on to
abort the GC transaction to deal with set->ops->destroy() path which
removes the remaining elements in the set from commit_release, where no
mutex is held.

To deal with GC when mutex is held, which allows safe deactivate and
removal, add sync GC API which releases the set element object via
call_rcu(). This is used by rbtree and pipapo backends which also
perform garbage collection from control plane path.

Since element removal from sets can happen from control plane and
element garbage collection/timeout, it is necessary to keep the set
structure alive until all elements have been deactivated and destroyed.

We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
its called with the transaction mutex held, but the aforementioned async
work queue might be blocked on the very mutex that nft_set_destroy()
callchain is sitting on.

This gives us the choice of ABBA deadlock or UaF.

To avoid both, add set->refs refcount_t member. The GC API can then
increment the set refcount and release it once the elements have been
free'd.

Set backends are adapted to use the GC transaction API in a follow up
patch entitled:

  ("netfilter: nf_tables: use gc transaction API in set backends")

This is joint work with Florian Westphal.

Fixes: cfed7e1b1f ("netfilter: nf_tables: add set garbage collection helpers")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10 08:25:16 +02:00
Xu Kuohai 809e4dc71a bpf, sockmap: Fix bug that strp_done cannot be called
strp_done is only called when psock->progs.stream_parser is not NULL,
but stream_parser was set to NULL by sk_psock_stop_strp(), called
by sk_psock_drop() earlier. So, strp_done can never be called.

Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock.
Change the condition for calling strp_done from judging whether
stream_parser is set to judging whether this flag is set. This flag is
only set once when strp_init() succeeds, and will never be cleared later.

Fixes: c0d95d3380 ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-09 20:29:02 -07:00
Xu Kuohai 7e96ec0e66 bpf, sockmap: Fix map type error in sock_map_del_link
sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although
both types have member named "progs", the offset of "progs" member in
these two types is different, so "progs" should be accessed with the
real map type.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-2-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-09 20:29:02 -07:00
Magnus Karlsson 85c2c79a07 xsk: fix refcount underflow in error path
Fix a refcount underflow problem reported by syzbot that can happen
when a system is running out of memory. If xp_alloc_tx_descs() fails,
and it can only fail due to not having enough memory, then the error
path is triggered. In this error path, the refcount of the pool is
decremented as it has incremented before. However, the reference to
the pool in the socket was not nulled. This means that when the socket
is closed later, the socket teardown logic will think that there is a
pool attached to the socket and try to decrease the refcount again,
leading to a refcount underflow.

I chose this fix as it involved adding just a single line. Another
option would have been to move xp_get_pool() and the assignment of
xs->pool to after the if-statement and using xs_umem->pool instead of
xs->pool in the whole if-statement resulting in somewhat simpler code,
but this would have led to much more churn in the code base perhaps
making it harder to backport.

Fixes: ba3beec2ec ("xsk: Fix possible crash when multiple sockets are created")
Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-09 20:08:04 -07:00
Vladimir Oltean 665338b2a7 net/sched: taprio: dump class stats for the actual q->qdiscs[]
This makes a difference for the software scheduling mode, where
dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
but when we're talking about what Qdisc and stats get reported for a
traffic class, the root taprio isn't what comes to mind, but q->qdiscs[]
is.

To understand the difference, I've attempted to send 100 packets in
software mode through class 8001:5, and recorded the stats before and
after the change.

Here is before:

$ tc -s class show dev eth0
class taprio 8001:1 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:2 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:3 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:4 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:5 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:6 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:7 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:8 root leaf 8001:
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0

and here is after:

class taprio 8001:1 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:2 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:3 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:4 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:5 root
 Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:6 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:7 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0
class taprio 8001:8 root leaf 800d:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 window_drops 0

The most glaring (and expected) difference is that before, all class
stats reported the global stats, whereas now, they really report just
the counters for that traffic class.

Finally, Pedro Tammela points out that there is a tc selftest which
checks specifically which handle do the child Qdiscs corresponding to
each class have. That's changing here - taprio no longer reports
tcm->tcm_info as the same handle "1:" as itself (the root Qdisc), but 0
(the handle of the default pfifo child Qdiscs). Since iproute2 does not
print a child Qdisc handle of 0, adjust the test's expected output.

Link: https://lore.kernel.org/netdev/3b83fcf6-a5e8-26fb-8c8a-ec34ec4c3342@mojatatu.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:59:21 -07:00
Vladimir Oltean 6e0ec800c1 net/sched: taprio: delete misleading comment about preallocating child qdiscs
As mentioned in commit af7b29b1de ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike
mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport
between Qdisc_ops :: init() and Qdisc_ops :: attach().

Delete the comment, which is just stolen from mqprio, but there, the
usage patterns are a lot different, and this is nothing but confusing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:59:20 -07:00
Vladimir Oltean 98766add2d net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()
This is another stab at commit 1461d212ab ("net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later
reverted in commit af7b29b1de ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"").

I believe that the problems that caused the revert were fixed, and thus,
this change is identical to the original patch.

Its purpose is to properly reject attaching a software taprio child
qdisc to a software taprio parent. Because unoffloaded taprio currently
reports itself (the root Qdisc) as the return value from qdisc_leaf(),
then the process of attaching another taprio as child to a Qdisc class
of the root will just result in a Qdisc_ops :: change() call for the
root. Whereas that's not we want. We want Qdisc_ops :: init() to be
called for the taprio child, in order to give the taprio child a chance
to check whether its sch->parent is TC_H_ROOT or not (and reject this
configuration).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:59:20 -07:00
Vladimir Oltean 25b0d4e4e4 net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
Normally, Qdiscs have one reference on them held by their owner and one
held for each TXQ to which they are attached, however this is not the
case with the children of an offloaded taprio. Instead, the taprio qdisc
currently lives in the following fragile equilibrium.

In the software scheduling case, taprio attaches itself (the root Qdisc)
to all TXQs, thus having a refcount of 1 + the number of TX queues. In
this mode, the q->qdiscs[] children are not visible directly to the
Qdisc API. The lifetime of the Qdiscs from this private array lasts
until qdisc_destroy() -> taprio_destroy().

In the fully offloaded case, the root taprio has a refcount of 1, and
all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
are attached to the netdev TXQs directly and thus are visible to the
Qdisc API, however taprio loses a reference to them very early - during
qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
the q->qdiscs[] array to not leak memory, but interestingly, it does not
release a reference on these qdiscs because it doesn't effectively own
them - they are created by taprio but owned by the Qdisc core, and will
be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
the Qdisc is deleted or when the child Qdisc is replaced with something
else.

My interest is to change this equilibrium such that taprio also owns a
reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
Qdisc, including in full offload mode. I want this because I would like
taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
insight into q->qdiscs[] for the software scheduling mode - currently
they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
as the root taprio.

The following set of changes is necessary:
- don't free q->qdiscs[] early in taprio_attach(), free it late in
  taprio_destroy() for consistency with software mode. But:
- currently that's not possible, because taprio doesn't own a reference
  on q->qdiscs[]. So hold that reference - once during the initial
  attach() and once during subsequent graft() calls when the child is
  changed.
- always keep track of the current child in q->qdiscs[], even for full
  offload mode, so that we free in taprio_destroy() what we should, and
  not something stale.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:59:20 -07:00
Vladimir Oltean 09e0c3bbde net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()
This is a simple code transformation with no intended behavior change,
just to make it absolutely clear that q->qdiscs[] is only attached to
the child taprio classes in full offload mode.

Right now we use the q->qdiscs[] variable in taprio_attach() for
software mode too, but that is quite confusing and avoidable. We use
it only to reach the netdev TX queue, but we could as well just use
netdev_get_tx_queue() for that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:59:20 -07:00
Maciej Żenczykowski 048c796beb ipv6: adjust ndisc_is_useropt() to also return true for PIO
The upcoming (and nearly finalized):
  https://datatracker.ietf.org/doc/draft-collink-6man-pio-pflag/
will update the IPv6 RA to include a new flag in the PIO field,
which will serve as a hint to perform DHCPv6-PD.

As we don't want DHCPv6 related logic inside the kernel, this piece of
information needs to be exposed to userspace.  The simplest option is to
simply expose the entire PIO through the already existing mechanism.

Even without this new flag, the already existing PIO R (router address)
flag (from RFC6275) cannot AFAICT be handled entirely in kernel,
and provides useful information that should be exposed to userspace
(the router's global address, for use by Mobile IPv6).

Also cc'ing stable@ for inclusion in LTS, as while technically this is
not quite a bugfix, and instead more of a feature, it is absolutely
trivial and the alternative is manually cherrypicking into all Android
Common Kernel trees - and I know Greg will ask for it to be sent in via
LTS instead...

Cc: Jen Linkova <furry@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Cc: stable@vger.kernel.org
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20230807102533.1147559-1-maze@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:36:12 -07:00
Nick Desaulniers fa1891aeb7 net/llc/llc_conn.c: fix 4 instances of -Wmissing-variable-declarations
I'm looking to enable -Wmissing-variable-declarations behind W=1. 0day
bot spotted the following instances:

  net/llc/llc_conn.c:44:5: warning: no previous extern declaration for
  non-static variable 'sysctl_llc2_ack_timeout'
  [-Wmissing-variable-declarations]
  44 | int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ;
     |     ^
  net/llc/llc_conn.c:44:1: note: declare 'static' if the variable is not
  intended to be used outside of this translation unit
  44 | int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ;
     | ^
  net/llc/llc_conn.c:45:5: warning: no previous extern declaration for
  non-static variable 'sysctl_llc2_p_timeout'
  [-Wmissing-variable-declarations]
  45 | int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ;
     |     ^
  net/llc/llc_conn.c:45:1: note: declare 'static' if the variable is not
  intended to be used outside of this translation unit
  45 | int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ;
     | ^
  net/llc/llc_conn.c:46:5: warning: no previous extern declaration for
  non-static variable 'sysctl_llc2_rej_timeout'
  [-Wmissing-variable-declarations]
  46 | int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ;
     |     ^
  net/llc/llc_conn.c:46:1: note: declare 'static' if the variable is not
  intended to be used outside of this translation unit
  46 | int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ;
     | ^
  net/llc/llc_conn.c:47:5: warning: no previous extern declaration for
  non-static variable 'sysctl_llc2_busy_timeout'
  [-Wmissing-variable-declarations]
  47 | int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ;
     |     ^
  net/llc/llc_conn.c:47:1: note: declare 'static' if the variable is not
  intended to be used outside of this translation unit
  47 | int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ;
     | ^

These symbols are referenced by more than one translation unit, so make
include the correct header for their declarations. Finally, sort the
list of includes to help keep them tidy.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/llvm/202308081000.tTL1ElTr-lkp@intel.com/
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808-llc_static-v1-1-c140c4c297e4@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:34:28 -07:00
Eric Dumazet 1ded5e5a59 net: annotate data-races around sock->ops
IPV6_ADDRFORM socket option is evil, because it can change sock->ops
while other threads might read it. Same issue for sk->sk_family
being set to AF_INET.

Adding READ_ONCE() over sock->ops reads is needed for sockets
that might be impacted by IPV6_ADDRFORM.

Note that mptcp_is_tcpsk() can also overwrite sock->ops.

Adding annotations for all sk->sk_family reads will require
more patches :/

BUG: KCSAN: data-race in ____sys_sendmsg / do_ipv6_setsockopt

write to 0xffff888109f24ca0 of 8 bytes by task 4470 on cpu 0:
do_ipv6_setsockopt+0x2c5e/0x2ce0 net/ipv6/ipv6_sockglue.c:491
ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012
udpv6_setsockopt+0x95/0xa0 net/ipv6/udp.c:1690
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3663
__sys_setsockopt+0x1c3/0x230 net/socket.c:2273
__do_sys_setsockopt net/socket.c:2284 [inline]
__se_sys_setsockopt net/socket.c:2281 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2281
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff888109f24ca0 of 8 bytes by task 4469 on cpu 1:
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x349/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmmsg+0x263/0x500 net/socket.c:2643
__do_sys_sendmmsg net/socket.c:2672 [inline]
__se_sys_sendmmsg net/socket.c:2669 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0xffffffff850e32b8 -> 0xffffffff850da890

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4469 Comm: syz-executor.1 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808135809.2300241-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:32:43 -07:00
Jakub Kicinski 15c8795dbf Just a few small updates:
* fix an integer overflow in nl80211
  * fix rtw89 8852AE disconnections
  * fix a buffer overflow in ath12k
  * fix AP_VLAN configuration lookups
  * fix allocation failure handling in brcm80211
  * update MAINTAINERS for some drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmTTitIACgkQ10qiO8sP
 aAAF/hAAnyF2Q4rjtfelRRj0ghR5uLxzIItNtkeWG5Z2KyGpbzF94ESMGJ/PnD/9
 rcpEhj+KCKB7ZojHRgcleBSOds6yMTj0m9XJ7iMA/QYnV45Gi+cnlIiKyxSmpHBT
 jSpddG4BLEUGNd8qwghJlK6ApqtVuFRDw3nBXhPEnc9z6ohNHVAOXXjNP2FWAwWA
 3Xh4/IVK8ayLlmwyWFOKs1V2dx+rqfcOa/PXt4NK+/sIPrPOwbhgGSJed+QFosI7
 btuKjG1uQAXBbL5/zRwFrVnKqUBcqnX3Fk4NJgJDhxhh1ei9hfdxNDFECjjI6mb+
 rnPjZMBGv+3u7SgyH0avdUulb5j5tLHZJMMhbDNPgccIL/sxsi6iErUbhbYsmo72
 HqHRLw4Cw5OaFFAZZhlmyeUzVDSD67MElqiyV2sBSU6/QQG4BYqCfo9EkuQLQ7g/
 TE9zsklzpMIjgBL3ERl8r5LpbJqU7m4mmjncTQrB/o6SDbvmXmzIZoD7HuCM0z7r
 SVgMcPig6i7taL/UkdzsqI/nmyo3TtRMD6pcxW3UIUJkFBJ+qwJIdCeDj3UNaOtY
 xfMXnemx0C628Gdtbwrsyd3v5pbE0tWYXbG7vJIqE4cuNc2x5K+lQSyaefaKau+e
 wamtQ6+hkv6kVYYBYvZ7yA/7Tfi3G3msrh8Oof0DQ93n9uA6EL8=
 =Q+kn
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Just a few small updates:
 * fix an integer overflow in nl80211
 * fix rtw89 8852AE disconnections
 * fix a buffer overflow in ath12k
 * fix AP_VLAN configuration lookups
 * fix allocation failure handling in brcm80211
 * update MAINTAINERS for some drivers

* tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: ath12k: Fix buffer overflow when scanning with extraie
  wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
  wifi: cfg80211: fix sband iftype data lookup for AP_VLAN
  wifi: rtw89: fix 8852AE disconnection caused by RX full flags
  MAINTAINERS: Remove tree entry for rtl8180
  MAINTAINERS: Update entry for rtl8187
  wifi: brcm80211: handle params_v1 allocation failure
====================

Link: https://lore.kernel.org/r/20230809124818.167432-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 15:04:44 -07:00
Jakub Kicinski 052059b663 nf-next pull request 2023-08-08
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTSNyYNHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AOpyD/4pRjBvLgU0O+33x3l/X8z80QW6VMwq6PDUDmNA
 t0AFnhIx0v6yky0abLzGGV9q2N9SLdNltTsqb0pem00f1TncKR1BlHNttfU+yjS+
 qmkuUTJ8HixxkbBRKB9E7kA3IPM2aj1gxd3sji/QHKMT8XtcE5ufoad8/jcM9px2
 FHNHJ8Onwl8ohjV4qNeuPe8XWm47pN/FeaxK7jRrKJaCal0P96sT8AGf/Rvx5VNY
 jCysb2+fIKMzHssbLcRr1UDMJJFqtcQx0alnzwxh4sEPsmgYYR7UGmcDku4pSbtB
 uJBrjMnLpORpw1l2syuYiiyEy+VRAAIjWAUxb6oTOvDhj1Yj2ki/915b/Hl/jnqa
 q8EUm6i+B4CuiE8LCj0WLG2gKO7vRjFnDH/Li/qiFMUHzW/HmnLRituxTVTXwopC
 1CxFkekNIklxLr+n21dP6f+NJU9hIs1nw+iy0JhLffTV8u6TosZ3Ve5ZInUQ4Bna
 hZUyksy1s42fi0oGHN1Gi3AWPhUIlH69lKXTOher/9PvC+rL5/l8LFVywqtYxT4t
 HWJwBtpm58IrsDttgPm6fnJqIgrm5mcW+FJYqP9Td6NjUvyfafhJsdhXJ9Bs1lJV
 SfJp4iEUkOx4bfUi0rDFb/8NR2fwrCpbbgYx2xdS+tHQ2BkRlAOxYJC9Q8YP2vxH
 Mjgc9w==
 =XfrS
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter updates for net-next

First 4 Patches, from Yue Haibing, remove unused prototypes in
various netfilter headers.

Last patch makes nfnetlink_log to always include a packet timestamp,
up to now it was only included if the skb had assigned previously.
From Maciej Żenczykowski.

* tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nfnetlink_log: always add a timestamp
  netfilter: h323: Remove unused function declarations
  netfilter: conntrack: Remove unused function declarations
  netfilter: helper: Remove unused function declarations
  netfilter: gre: Remove unused function declaration nf_ct_gre_keymap_flush()
====================

Link: https://lore.kernel.org/r/20230808124159.19046-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:52:36 -07:00
Ido Schimmel 8743aeff5b nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.

The nexthop bucket dump callback always returns a positive number if
nexthop buckets were filled in the provided skb, even if the dump is
complete. This means that a dump will span at least two recvmsg() calls
as long as nexthop buckets are present. In the last recvmsg() call the
dump callback will not fill in any nexthop buckets because the previous
call indicated that the dump should restart from the last dumped nexthop
ID plus one.

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id 10 group 1 type resilient buckets 2
 # strace -e sendto,recvmsg -s 5 ip nexthop bucket
 sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
 id 10 index 0 idle_time 6.66 nhid 1
 id 10 index 1 idle_time 6.66 nhid 1
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
 +++ exited with 0 +++

This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
 # ip nexthop bucket
 id 4294967295 index 0 idle_time 5.55 nhid 1
 id 4294967295 index 1 idle_time 5.55 nhid 1
 id 4294967295 index 0 idle_time 5.55 nhid 1
 id 4294967295 index 1 idle_time 5.55 nhid 1
 [...]

Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses:

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
 # strace -e sendto,recvmsg -s 5 ip nexthop bucket
 sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
 id 4294967295 index 0 idle_time 6.61 nhid 1
 id 4294967295 index 1 idle_time 6.61 nhid 1
 +++ exited with 0 +++

Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.

Add a test that fails before the fix:

 # ./fib_nexthops.sh -t basic_res
 [...]
 TEST: Maximum nexthop ID dump                                       [FAIL]
 [...]

And passes after it:

 # ./fib_nexthops.sh -t basic_res
 [...]
 TEST: Maximum nexthop ID dump                                       [ OK ]
 [...]

Fixes: 8a1bbabb03 ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:45:12 -07:00
Ido Schimmel f10d3d9df4 nexthop: Make nexthop bucket dump more efficient
rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging
to a specific resilient nexthop group. The function returns a positive
return code (the skb length) upon both success and failure.

The above behavior is problematic. When a complete nexthop bucket dump
is requested, the function that walks the different nexthops treats the
non-zero return code as an error. This causes buckets belonging to
different resilient nexthop groups to be dumped using different buffers
even if they can all fit in the same buffer:

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id 10 group 1 type resilient buckets 1
 # ip nexthop add id 20 group 1 type resilient buckets 1
 # strace -e recvmsg -s 0 ip nexthop bucket
 [...]
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
 id 10 index 0 idle_time 10.27 nhid 1
 [...]
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
 id 20 index 0 idle_time 6.44 nhid 1
 [...]

Fix by only returning a non-zero return code when an error occurred and
restarting the dump from the bucket index we failed to fill in. This
allows buckets belonging to different resilient nexthop groups to be
dumped using the same buffer:

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id 10 group 1 type resilient buckets 1
 # ip nexthop add id 20 group 1 type resilient buckets 1
 # strace -e recvmsg -s 0 ip nexthop bucket
 [...]
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
 id 10 index 0 idle_time 30.21 nhid 1
 id 20 index 0 idle_time 26.7 nhid 1
 [...]

While this change is more of a performance improvement change than an
actual bug fix, it is a prerequisite for a subsequent patch that does
fix a bug.

Fixes: 8a1bbabb03 ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:45:04 -07:00
Ido Schimmel 913f60cacd nexthop: Fix infinite nexthop dump when using maximum nexthop ID
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.

The nexthop dump callback always returns a positive number if nexthops
were filled in the provided skb, even if the dump is complete. This
means that a dump will span at least two recvmsg() calls as long as
nexthops are present. In the last recvmsg() call the dump callback will
not fill in any nexthops because the previous call indicated that the
dump should restart from the last dumped nexthop ID plus one.

 # ip nexthop add id 1 blackhole
 # strace -e sendto,recvmsg -s 5 ip nexthop
 sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
 id 1 blackhole
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
 +++ exited with 0 +++

This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:

 # ip nexthop add id $((2**32-1)) blackhole
 # ip nexthop
 id 4294967295 blackhole
 id 4294967295 blackhole
 [...]

Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response:

 # ip nexthop add id $((2**32-1)) blackhole
 # strace -e sendto,recvmsg -s 5 ip nexthop
 sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 56
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56
 id 4294967295 blackhole
 +++ exited with 0 +++

Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.

Add a test that fails before the fix:

 # ./fib_nexthops.sh -t basic
 [...]
 TEST: Maximum nexthop ID dump                                       [FAIL]
 [...]

And passes after it:

 # ./fib_nexthops.sh -t basic
 [...]
 TEST: Maximum nexthop ID dump                                       [ OK ]
 [...]

Fixes: ab84be7e54 ("net: Initial nexthop code")
Reported-by: Petr Machata <petrm@nvidia.com>
Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:44:36 -07:00
Vlad Buslov 718cb09aaa vlan: Fix VLAN 0 memory leak
The referenced commit intended to fix memleak of VLAN 0 that is implicitly
created on devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. However, it
doesn't take into account that the feature can be re-set during the
netdevice lifetime which will cause memory leak if feature is disabled
during the device deletion as illustrated by [0]. Fix the leak by
unconditionally deleting VLAN 0 on NETDEV_DOWN event.

[0]:
> modprobe 8021q
> ip l set dev eth2 up
> ethtool -K eth2 rx-vlan-filter off
> modprobe -r mlx5_ib
> modprobe -r mlx5_core
> cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888103dcd900 (size 256):
  comm "ip", pid 1490, jiffies 4294907305 (age 325.364s)
  hex dump (first 32 bytes):
    00 80 5d 03 81 88 ff ff 00 00 00 00 00 00 00 00  ..].............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000899f3bb9>] kmalloc_trace+0x25/0x80
    [<000000002889a7a2>] vlan_vid_add+0xa0/0x210
    [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q]
    [<000000009a0716b1>] notifier_call_chain+0x35/0xb0
    [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0
    [<0000000053d2b05d>] dev_change_flags+0x4d/0x60
    [<00000000982807e9>] do_setlink+0x28d/0x10a0
    [<0000000058c1be00>] __rtnl_newlink+0x545/0x980
    [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70
    [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390
    [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100
    [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0
    [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0
    [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60
    [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200
    [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0
unreferenced object 0xffff88813354fde0 (size 32):
  comm "ip", pid 1490, jiffies 4294907305 (age 325.364s)
  hex dump (first 32 bytes):
    a0 d9 dc 03 81 88 ff ff a0 d9 dc 03 81 88 ff ff  ................
    81 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000899f3bb9>] kmalloc_trace+0x25/0x80
    [<000000002da64724>] vlan_vid_add+0xdf/0x210
    [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q]
    [<000000009a0716b1>] notifier_call_chain+0x35/0xb0
    [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0
    [<0000000053d2b05d>] dev_change_flags+0x4d/0x60
    [<00000000982807e9>] do_setlink+0x28d/0x10a0
    [<0000000058c1be00>] __rtnl_newlink+0x545/0x980
    [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70
    [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390
    [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100
    [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0
    [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0
    [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60
    [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200
    [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0

Fixes: efc73f4bbc ("net: Fix memory leak - vlan_info struct")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20230808093521.1468929-1-vladbu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:44:27 -07:00
Russell King (Oracle) 145622771d net: dsa: mark parsed interface mode for legacy switch drivers
If we successfully parsed an interface mode with a legacy switch
driver, populate that mode into phylink's supported interfaces rather
than defaulting to the internal and gmii interfaces.

This hasn't caused an issue so far, because when the interface doesn't
match a supported one, phylink_validate() doesn't clear the supported
mask, but instead returns -EINVAL. phylink_parse_fixedlink() doesn't
check this return value, and merely relies on the supported ethtool
link modes mask being cleared. Therefore, the fixed link settings end
up being allowed despite validation failing.

Before this causes a problem, arrange for DSA to more accurately
populate phylink's supported interfaces mask so validation can
correctly succeed.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1qTKdM-003Cpx-Eh@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:08:09 -07:00
Jiri Pirko 832140804e devlink: clear flag on port register error path
When xarray insertion fails, clear the flag.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230808082020.1363497-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:06:21 -07:00
Yue Haibing ca76b386d4 tipc: Remove unused declaration tipc_link_build_bc_sync_msg()
Commit 5266698661 ("tipc: let broadcast packet reception use new link receive function")
declared but never implemented this.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230807142926.45752-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09 13:03:14 -07:00
Breno Leitao 8e9fad0e70 io_uring: Add io_uring command support for sockets
Enable io_uring commands on network sockets. Create two new
SOCKET_URING_OP commands that will operate on sockets.

In order to call ioctl on sockets, use the file_operations->io_uring_cmd
callbacks, and map it to a uring socket function, which handles the
SOCKET_URING_OP accordingly, and calls socket ioctls.

This patches was tested by creating a new test case in liburing.
Link: https://github.com/leitao/liburing/tree/io_uring_cmd

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 10:46:15 -06:00
Remi Pommarel 421d467dc2 batman-adv: Fix batadv_v_ogm_aggr_send memory leak
When batadv_v_ogm_aggr_send is called for an inactive interface, the skb
is silently dropped by batadv_v_ogm_send_to_if() but never freed causing
the following memory leak:

  unreferenced object 0xffff00000c164800 (size 512):
    comm "kworker/u8:1", pid 2648, jiffies 4295122303 (age 97.656s)
    hex dump (first 32 bytes):
      00 80 af 09 00 00 ff ff e1 09 00 00 75 01 60 83  ............u.`.
      1f 00 00 00 b8 00 00 00 15 00 05 00 da e3 d3 64  ...............d
    backtrace:
      [<0000000007ad20f6>] __kmalloc_track_caller+0x1a8/0x310
      [<00000000d1029e55>] kmalloc_reserve.constprop.0+0x70/0x13c
      [<000000008b9d4183>] __alloc_skb+0xec/0x1fc
      [<00000000c7af5051>] __netdev_alloc_skb+0x48/0x23c
      [<00000000642ee5f5>] batadv_v_ogm_aggr_send+0x50/0x36c
      [<0000000088660bd7>] batadv_v_ogm_aggr_work+0x24/0x40
      [<0000000042fc2606>] process_one_work+0x3b0/0x610
      [<000000002f2a0b1c>] worker_thread+0xa0/0x690
      [<0000000059fae5d4>] kthread+0x1fc/0x210
      [<000000000c587d3a>] ret_from_fork+0x10/0x20

Free the skb in that case to fix this leak.

Cc: stable@vger.kernel.org
Fixes: 0da0035942 ("batman-adv: OGMv2 - add basic infrastructure")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-09 17:33:03 +02:00
Keith Yeo 6311071a05 wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
nl80211_parse_mbssid_elems() uses a u8 variable num_elems to count the
number of MBSSID elements in the nested netlink attribute attrs, which can
lead to an integer overflow if a user of the nl80211 interface specifies
256 or more elements in the corresponding attribute in userspace. The
integer overflow can lead to a heap buffer overflow as num_elems determines
the size of the trailing array in elems, and this array is thereafter
written to for each element in attrs.

Note that this vulnerability only affects devices with the
wiphy->mbssid_max_interfaces member set for the wireless physical device
struct in the device driver, and can only be triggered by a process with
CAP_NET_ADMIN capabilities.

Fix this by checking for a maximum of 255 elements in attrs.

Cc: stable@vger.kernel.org
Fixes: dc1e3cb8da ("nl80211: MBSSID and EMA support in AP mode")
Signed-off-by: Keith Yeo <keithyjy@gmail.com>
Link: https://lore.kernel.org/r/20230731034719.77206-1-keithyjy@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-08-09 14:43:35 +02:00
Florian Westphal 24138933b9 netfilter: nf_tables: don't skip expired elements during walk
There is an asymmetry between commit/abort and preparation phase if the
following conditions are met:

1. set is a verdict map ("1.2.3.4 : jump foo")
2. timeouts are enabled

In this case, following sequence is problematic:

1. element E in set S refers to chain C
2. userspace requests removal of set S
3. kernel does a set walk to decrement chain->use count for all elements
   from preparation phase
4. kernel does another set walk to remove elements from the commit phase
   (or another walk to do a chain->use increment for all elements from
    abort phase)

If E has already expired in 1), it will be ignored during list walk, so its use count
won't have been changed.

Then, when set is culled, ->destroy callback will zap the element via
nf_tables_set_elem_destroy(), but this function is only safe for
elements that have been deactivated earlier from the preparation phase:
lack of earlier deactivate removes the element but leaks the chain use
count, which results in a WARN splat when the chain gets removed later,
plus a leak of the nft_chain structure.

Update pipapo_get() not to skip expired elements, otherwise flush
command reports bogus ENOENT errors.

Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Fixes: 9d0982927e ("netfilter: nft_hash: add support for timeouts")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-09 14:39:28 +02:00
Gerd Bayer 30c3c4a449 net/smc: Use correct buffer sizes when switching between TCP and SMC
Tuning of the effective buffer size through setsockopts was working for
SMC traffic only but not for TCP fall-back connections even before
commit 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and
make them tunable"). That change made it apparent that TCP fall-back
connections would use net.smc.[rw]mem as buffer size instead of
net.ipv4_tcp_[rw]mem.

Amend the code that copies attributes between the (TCP) clcsock and the
SMC socket and adjust buffer sizes appropriately:
- Copy over sk_userlocks so that both sockets agree on whether tuning
  via setsockopt is active.
- When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
  setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
- Likewise, use either values from setsockopt or from sysctl for SMC
  (duplicated) on successful SMC connect.

In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
is taken care of by the attribute copy.

Fixes: 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-09 11:20:29 +01:00
Gerd Bayer 833bac7ec3 net/smc: Fix setsockopt and sysctl to specify same buffer size again
Commit 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") introduced the net.smc.rmem and net.smc.wmem
sysctls to specify the size of buffers to be used for SMC type
connections. This created a regression for users that specified the
buffer size via setsockopt() as the effective buffer size was now
doubled.

Re-introduce the division by 2 in the SMC buffer create code and level
this out by duplicating the net.smc.[rw]mem values used for initializing
sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
methods (setsockopt or sysctl) the effective buffer size that they
expect.

Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
Internal performance tests show that this value is a good compromise
between throughput/latency and memory consumption. Also, this decouples
it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
module for SMC protocol was loaded. Check that no more than INT_MAX / 2
is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
when that is doubled for use in sk_sndbuf or sk_rcvbuf.

While at it, drop the confusing sk_buf_size variable from
__smc_buf_create and name "compressed" buffer size variables more
consistently.

Background:

Before the commit mentioned above, SMC's buffer allocator in
__smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
value as initial value to search for appropriate buffers. If the search
resorted to using a bigger buffer when all buffers of the specified
size were busy, the duplicate of the used effective buffer size is
stored back to sk_rcvbuf/sk_sndbuf.

When available, buffers of exactly the size that a user had specified as
input to setsockopt() were used, despite setsockopt()'s documentation in
"man 7 socket" talking of a mandatory duplication:

[...]
       SO_SNDBUF
              Sets  or  gets the maximum socket send buffer in bytes.
              The kernel doubles this value (to allow space for book‐
              keeping  overhead)  when it is set using setsockopt(2),
              and this doubled value is  returned  by  getsockopt(2).
              The     default     value     is     set     by     the
              /proc/sys/net/core/wmem_default file  and  the  maximum
              allowed value is set by the /proc/sys/net/core/wmem_max
              file.  The minimum (doubled) value for this  option  is
              2048.
[...]

Fixes: 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Co-developed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-09 11:20:28 +01:00
David Rheinsberg b6f79e826f net/unix: use consistent error code in SO_PEERPIDFD
Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA
rather than ESRCH if a socket type does not support remote peer-PID
queries.

Currently, SO_PEERPIDFD returns ESRCH when the socket in question is
not an AF_UNIX socket. This is quite unexpected, given that one would
assume ESRCH means the peer process already exited and thus cannot be
found. However, in that case the sockopt actually returns EINVAL (via
pidfd_prepare()). This is rather inconsistent with other syscalls, which
usually return ESRCH if a given PID refers to a non-existant process.

This changes SO_PEERPIDFD to return ENODATA instead. This is also what
SO_PEERGROUPS returns, and thus keeps a consistent behavior across
sockopts.

Note that this code is returned in 2 cases: First, if the socket type is
not AF_UNIX, and secondly if the socket was not yet connected. In both
cases ENODATA seems suitable.

Signed-off-by: David Rheinsberg <david@readahead.eu>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Fixes: 7b26952a91 ("net: core: add getsockopt SO_PEERPIDFD")
Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-08 15:56:48 -07:00
Hannes Reinecke ba4a734e1a net/tls: avoid TCP window full during ->read_sock()
When flushing the backlog after decoding a record we don't really
know how much data the caller want us to evaluate, so use INT_MAX
and 0 as arguments to tls_read_flush_backlog() to ensure we flush
at 128k of data. Otherwise we might be reading too much data and
trigger a TCP window full.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20230807071022.10091-1-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-08 15:53:49 -07:00
Ziyang Xuan 794529c448 ipv6: exthdrs: Replace opencoded swap() implementation
Get a coccinelle warning as follows:
net/ipv6/exthdrs.c:800:29-30: WARNING opportunity for swap()

Use swap() to replace opencoded implementation.

Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230807020947.1991716-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-08 15:36:47 -07:00
xu xin c67180efc5 net/ipv4: return the real errno instead of -EINVAL
For now, No matter what error pointer ip_neigh_for_gw() returns,
ip_finish_output2() always return -EINVAL, which may mislead the upper
users.

For exemple, an application uses sendto to send an UDP packet, but when the
neighbor table overflows, sendto() will get a value of -EINVAL, and it will
cause users to waste a lot of time checking parameters for errors.

Return the real errno instead of -EINVAL.

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Si Hao <si.hao@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20230807015408.248237-1-xu.xin16@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-08 15:35:51 -07:00
Maciej Żenczykowski 1d85594fd3 netfilter: nfnetlink_log: always add a timestamp
Compared to all the other work we're already doing to deliver
an skb to userspace this is very cheap - at worse an extra
call to ktime_get_real() - and very useful.

(and indeed it may even be cheaper if we're running from other hooks)

(background: Android occasionally logs packets which
caused wake from sleep/suspend and we'd like to have
timestamps reliably associated with these events)

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-08 13:03:36 +02:00
Andrew Kanner d14eea09ed net: core: remove unnecessary frame_sz check in bpf_xdp_adjust_tail()
Syzkaller reported the following issue:
=======================================
Too BIG xdp->frame_sz = 131072
WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
  ____bpf_xdp_adjust_tail net/core/filter.c:4121 [inline]
WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
  bpf_xdp_adjust_tail+0x466/0xa10 net/core/filter.c:4103
...
Call Trace:
 <TASK>
 bpf_prog_4add87e5301a4105+0x1a/0x1c
 __bpf_prog_run include/linux/filter.h:600 [inline]
 bpf_prog_run_xdp include/linux/filter.h:775 [inline]
 bpf_prog_run_generic_xdp+0x57e/0x11e0 net/core/dev.c:4721
 netif_receive_generic_xdp net/core/dev.c:4807 [inline]
 do_xdp_generic+0x35c/0x770 net/core/dev.c:4866
 tun_get_user+0x2340/0x3ca0 drivers/net/tun.c:1919
 tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2043
 call_write_iter include/linux/fs.h:1871 [inline]
 new_sync_write fs/read_write.c:491 [inline]
 vfs_write+0x650/0xe40 fs/read_write.c:584
 ksys_write+0x12f/0x250 fs/read_write.c:637
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

xdp->frame_sz > PAGE_SIZE check was introduced in commit c8741e2bfe
("xdp: Allow bpf_xdp_adjust_tail() to grow packet size"). But Jesper
Dangaard Brouer <jbrouer@redhat.com> noted that after introducing the
xdp_init_buff() which all XDP driver use - it's safe to remove this
check. The original intend was to catch cases where XDP drivers have
not been updated to use xdp.frame_sz, but that is not longer a concern
(since xdp_init_buff).

Running the initial syzkaller repro it was discovered that the
contiguous physical memory allocation is used for both xdp paths in
tun_get_user(), e.g. tun_build_skb() and tun_alloc_skb(). It was also
stated by Jesper Dangaard Brouer <jbrouer@redhat.com> that XDP can
work on higher order pages, as long as this is contiguous physical
memory (e.g. a page).

Reported-and-tested-by: syzbot+f817490f5bd20541b90a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000774b9205f1d8a80d@google.com/T/
Link: https://syzkaller.appspot.com/bug?extid=f817490f5bd20541b90a
Link: https://lore.kernel.org/all/20230725155403.796-1-andrew.kanner@gmail.com/T/
Fixes: 43b5169d83 ("net, xdp: Introduce xdp_init_buff utility routine")
Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20230803190316.2380231-1-andrew.kanner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07 19:14:41 -07:00
Alexander Lobakin 4a36d0180c net: skbuff: always try to recycle PP pages directly when in softirq
Commit 8c48eea3ad ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple of other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of time we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, check whether we are in the
softirq context. It will most likely be so and then if ->list_owner
is our current CPU, we're good to use direct recycling, even though
@napi_safe is false -- concurrent access is excluded. in_softirq()
protection is needed mostly due to we can hit this place in the
process context (not the hardirq though).
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of time, which means the
MM layer is not even called to allocate any new pages.

Suggested-by: Jakub Kicinski <kuba@kernel.org> # in_softirq()
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-7-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07 13:05:53 -07:00
Jakub Kicinski ff4e538c8c page_pool: add a lockdep check for recycling in hardirq
Page pool use in hardirq is prohibited, add debug checks
to catch misuses. IIRC we previously discussed using
DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
that people will have DEBUG_NET enabled in perf testing.
I don't think anyone enables lockdep in perf testing,
so use lockdep to avoid pushback and arguing :)

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07 13:05:53 -07:00
Alexander Lobakin 5b899c33b3 net: skbuff: avoid accessing page_pool if !napi_safe when returning page
Currently, pp->p.napi is always read, but the actual variable it gets
assigned to is read-only when @napi_safe is true. For the !napi_safe
cases, which yet is still a pack, it's an unneeded operation.
Moreover, it can lead to premature or even redundant page_pool
cacheline access. For example, when page_pool_is_last_frag() returns
false (with the recent frag improvements).
Thus, read it only when @napi_safe is true. This also allows moving
@napi inside the condition block itself. Constify it while we are
here, because why not.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07 13:05:53 -07:00
Alexander Lobakin 75eaf63ea7 net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>
Currently, touching <net/page_pool/types.h> triggers a rebuild of more
than half of the kernel. That's because it's included in
<linux/skbuff.h>. And each new include to page_pool/types.h adds more
[useless] data for the toolchain to process per each source file from
that pile.

In commit 6a5bcd84e8 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple of functions
defined there. Then, in commit 57f05bc2ab ("page_pool: keep pp info as
long as page pool owns the page") one of the calls was removed, so only
one was left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having very niche page_pool_types.h included only for that
looks like an overkill.

As %PP_SIGNATURE is not local to page_pool.c (was only in the
early submissions), nothing holds this function there. Teleport
page_pool_return_skb_page() to skbuff.c, just next to the main consumer,
skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't
work with skbs at all and the former name tells nothing. The #if guards
here are only to not compile and have it in the vmlinux when not needed
-- both call sites are already guarded.
Now, touching page_pool_types.h only triggers rebuilding of the drivers
using it and a couple of core networking files.

Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07 13:05:53 -07:00
Yunsheng Lin a9ca9f9cef page_pool: split types and declarations from page_pool.h
Split types and pure function declarations from page_pool.h
and add them in page_page/types.h, so that C sources can
include page_pool.h and headers should generally only include
page_pool/types.h as suggested by jakub.
Rename page_pool.h to page_pool/helpers.h to have both in
one place.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com
[Jakub: change microsoft/mana, fix kdoc paths in Documentation]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07 13:05:19 -07:00
Yue Haibing f6ecb68b38 net/tls: Remove unused function declarations
Commit 3c4d755915 ("tls: kernel TLS support") declared but never implemented
these functions.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-07 08:53:54 +01:00
Vladimir Oltean c35e927cbe net: omit ndo_hwtstamp_get() call when possible in dev_set_hwtstamp_phylib()
Setting dev->priv_flags & IFF_SEE_ALL_HWTSTAMP_REQUESTS is only legal
for drivers which were converted to ndo_hwtstamp_get() and
ndo_hwtstamp_set(), and it is only there that we call ndo_hwtstamp_set()
for a request that otherwise goes to phylib (for stuff like packet traps,
which need to be undone if phylib failed, hence the old_cfg logic).

The problem is that we end up calling ndo_hwtstamp_get() when we don't
need to (even if the SIOCSHWTSTAMP wasn't intended for phylib, or if it
was, but the driver didn't set IFF_SEE_ALL_HWTSTAMP_REQUESTS). For those
unnecessary conditions, we share a code path with virtual drivers (vlan,
macvlan, bonding) where ndo_hwtstamp_get() is implemented as
generic_hwtstamp_get_lower(), and may be resolved through
generic_hwtstamp_ioctl_lower() if the lower device is unconverted.

I.e. this situation:

$ ip link add link eno0 name eno0.100 type vlan id 100
$ hwstamp_ctl -i eno0.100 -t 1

We are unprepared to deal with this, because if ndo_hwtstamp_get() is
resolved through a legacy ndo_eth_ioctl(SIOCGHWTSTAMP) lower_dev
implementation, that needs a non-NULL old_cfg.ifr pointer, and we don't
have it.

But we don't even need to deal with it either. In the general case,
drivers may not even implement SIOCGHWTSTAMP handling, only SIOCSHWTSTAMP,
so it makes sense to completely avoid a SIOCGHWTSTAMP call if we can.

The solution is to split the single "if" condition into 3 smaller ones,
thus separating the decision to call ndo_hwtstamp_get() from the
decision to call ndo_hwtstamp_set(). The third "if" condition is
identical to the first one, and both are subsets of the second one.
Thus, the "cfg" argument of kernel_hwtstamp_config_changed() is always
valid.

Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89iLOspJsvjPj+y8jikg7erXDomWe8sqHMdfL_2LQSFrPAg@mail.gmail.com/
Fixes: fd770e856e ("net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 13:25:10 +01:00
Jakub Kicinski 6b47808f22 net: tls: avoid discarding data on record close
TLS records end with a 16B tag. For TLS device offload we only
need to make space for this tag in the stream, the device will
generate and replace it with the actual calculated tag.

Long time ago the code would just re-reference the head frag
which mostly worked but was suboptimal because it prevented TCP
from combining the record into a single skb frag. I'm not sure
if it was correct as the first frag may be shorter than the tag.

The commit under fixes tried to replace that with using the page
frag and if the allocation failed rolling back the data, if record
was long enough. It achieves better fragment coalescing but is
also buggy.

We don't roll back the iterator, so unless we're at the end of
send we'll skip the data we designated as tag and start the
next record as if the rollback never happened.
There's also the possibility that the record was constructed
with MSG_MORE and the data came from a different syscall and
we already told the user space that we "got it".

Allocate a single dummy page and use it as fallback.

Found by code inspection, and proven by forcing allocation
failures.

Fixes: e7b159a48b ("net/tls: remove the record tail optimization")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:32:18 +01:00
Eric Dumazet 6e97ba552b tcp: set TCP_DEFER_ACCEPT locklessly
rskq_defer_accept field can be read/written without
the need of holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:24:56 +01:00
Eric Dumazet a81722ddd7 tcp: set TCP_LINGER2 locklessly
tp->linger2 can be set locklessly as long as readers
use READ_ONCE().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:24:55 +01:00
Eric Dumazet 84485080cb tcp: set TCP_KEEPCNT locklessly
tp->keepalive_probes can be set locklessly, readers
are already taking care of this field being potentially
set by other threads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:24:55 +01:00
Eric Dumazet 6fd70a6b4e tcp: set TCP_KEEPINTVL locklessly
tp->keepalive_intvl can be set locklessly, readers
are already taking care of this field being potentially
set by other threads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:24:55 +01:00
Eric Dumazet d58f2e15aa tcp: set TCP_USER_TIMEOUT locklessly
icsk->icsk_user_timeout can be set locklessly,
if all read sides use READ_ONCE().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:24:55 +01:00
Eric Dumazet d44fd4a767 tcp: set TCP_SYNCNT locklessly
icsk->icsk_syn_retries can safely be set without locking the socket.

We have to add READ_ONCE() annotations in tcp_fastopen_synack_timer()
and tcp_write_timeout().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:24:55 +01:00
Remi Pommarel d25ddb7e78 batman-adv: Fix TT global entry leak when client roamed back
When a client roamed back to a node before it got time to destroy the
pending local entry (i.e. within the same originator interval) the old
global one is directly removed from hash table and left as such.

But because this entry had an extra reference taken at lookup (i.e using
batadv_tt_global_hash_find) there is no way its memory will be reclaimed
at any time causing the following memory leak:

  unreferenced object 0xffff0000073c8000 (size 18560):
    comm "softirq", pid 0, jiffies 4294907738 (age 228.644s)
    hex dump (first 32 bytes):
      06 31 ac 12 c7 7a 05 00 01 00 00 00 00 00 00 00  .1...z..........
      2c ad be 08 00 80 ff ff 6c b6 be 08 00 80 ff ff  ,.......l.......
    backtrace:
      [<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300
      [<000000000ff2fdbc>] batadv_tt_global_add+0x700/0xe20
      [<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790
      [<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110
      [<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10
      [<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0
      [<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4
      [<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0
      [<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90
      [<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74
      [<000000000f39a009>] __netif_receive_skb+0x48/0xe0
      [<00000000f2cd8888>] process_backlog+0x174/0x344
      [<00000000507d6564>] __napi_poll+0x58/0x1f4
      [<00000000b64ef9eb>] net_rx_action+0x504/0x590
      [<00000000056fa5e4>] _stext+0x1b8/0x418
      [<00000000878879d6>] run_ksoftirqd+0x74/0xa4
  unreferenced object 0xffff00000bae1a80 (size 56):
    comm "softirq", pid 0, jiffies 4294910888 (age 216.092s)
    hex dump (first 32 bytes):
      00 78 b1 0b 00 00 ff ff 0d 50 00 00 00 00 00 00  .x.......P......
      00 00 00 00 00 00 00 00 50 c8 3c 07 00 00 ff ff  ........P.<.....
    backtrace:
      [<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300
      [<00000000d9aaa49e>] batadv_tt_global_add+0x53c/0xe20
      [<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790
      [<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110
      [<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10
      [<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0
      [<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4
      [<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0
      [<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90
      [<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74
      [<000000000f39a009>] __netif_receive_skb+0x48/0xe0
      [<00000000f2cd8888>] process_backlog+0x174/0x344
      [<00000000507d6564>] __napi_poll+0x58/0x1f4
      [<00000000b64ef9eb>] net_rx_action+0x504/0x590
      [<00000000056fa5e4>] _stext+0x1b8/0x418
      [<00000000878879d6>] run_ksoftirqd+0x74/0xa4

Releasing the extra reference from batadv_tt_global_hash_find even at
roam back when batadv_tt_global_free is called fixes this memory leak.

Cc: stable@vger.kernel.org
Fixes: 068ee6e204 ("batman-adv: roaming handling mechanism redesign")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by; Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-05 08:02:01 +02:00
Kuniyuki Iwashima b205153689 tcp: Update stale comment for MD5 in tcp_parse_options().
Since commit 9ea88a1530 ("tcp: md5: check md5 signature without socket
lock"), the MD5 option is checked in tcp_v[46]_rcv().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230803224552.69398-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 18:28:36 -07:00
Kuniyuki Iwashima d0f2b7a9ca tcp: Disable header prediction for MD5 flow.
TCP socket saves the minimum required header length in tcp_header_len
of struct tcp_sock, and later the value is used in __tcp_fast_path_on()
to generate a part of TCP header in tcp_sock(sk)->pred_flags.

In tcp_rcv_established(), if the incoming packet has the same pattern
with pred_flags, we enter the fast path and skip full option parsing.

The MD5 option is parsed in tcp_v[46]_rcv(), so we need not parse it
again later in tcp_rcv_established() unless other options exist.  We
add TCPOLEN_MD5SIG_ALIGNED to tcp_header_len in two paths to avoid the
slow path.

For passive open connections with MD5, we add TCPOLEN_MD5SIG_ALIGNED
to tcp_header_len in tcp_create_openreq_child() after 3WHS.

On the other hand, we do it in tcp_connect_init() for active open
connections.  However, the value is overwritten while processing
SYN+ACK or crossed SYN in tcp_rcv_synsent_state_process().

These two cases will have the wrong value in pred_flags and never go
into the fast path.

We could update tcp_header_len in tcp_rcv_synsent_state_process(), but
a test with slightly modified netperf which uses MD5 for each flow shows
that the slow path is actually a bit faster than the fast path.

  On c5.4xlarge EC2 instance (16 vCPU, 32 GiB mem)

  $ for i in {1..10}; do
  ./super_netperf $(nproc) -H localhost -l 10 -- -m 256 -M 256;
  done

  Avg of 10
  * 36e68eadd3  : 10.376 Gbps
  * all fast path : 10.374 Gbps (patch v2, See Link)
  * all slow path : 10.394 Gbps

The header prediction is not worth adding complexity for MD5, so let's
disable it for MD5.

Link: https://lore.kernel.org/netdev/20230803042214.38309-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230803224552.69398-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 18:28:36 -07:00
Eric Dumazet a47e598fbd dccp: fix data-race around dp->dccps_mss_cache
dccp_sendmsg() reads dp->dccps_mss_cache before locking the socket.
Same thing in do_dccp_getsockopt().

Add READ_ONCE()/WRITE_ONCE() annotations,
and change dccp_sendmsg() to check again dccps_mss_cache
after socket is locked.

Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230803163021.2958262-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 18:27:58 -07:00
Paolo Abeni 511b90e392 mptcp: fix disconnect vs accept race
Despite commit 0ad529d9fd ("mptcp: fix possible divide by zero in
recvmsg()"), the mptcp protocol is still prone to a race between
disconnect() (or shutdown) and accept.

The root cause is that the mentioned commit checks the msk-level
flag, but mptcp_stream_accept() does acquire the msk-level lock,
as it can rely directly on the first subflow lock.

As reported by Christoph than can lead to a race where an msk
socket is accepted after that mptcp_subflow_queue_clean() releases
the listener socket lock and just before it takes destructive
actions leading to the following splat:

BUG: kernel NULL pointer dereference, address: 0000000000000012
PGD 5a4ca067 P4D 5a4ca067 PUD 37d4c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 10955 Comm: syz-executor.5 Not tainted 6.5.0-rc1-gdc7b257ee5dd #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:mptcp_stream_accept+0x1ee/0x2f0 include/net/inet_sock.h:330
Code: 0a 09 00 48 8b 1b 4c 39 e3 74 07 e8 bc 7c 7f fe eb a1 e8 b5 7c 7f fe 4c 8b 6c 24 08 eb 05 e8 a9 7c 7f fe 49 8b 85 d8 09 00 00 <0f> b6 40 12 88 44 24 07 0f b6 6c 24 07 bf 07 00 00 00 89 ee e8 89
RSP: 0018:ffffc90000d07dc0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff888037e8d020 RCX: ffff88803b093300
RDX: 0000000000000000 RSI: ffffffff833822c5 RDI: ffffffff8333896a
RBP: 0000607f82031520 R08: ffff88803b093300 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000003e83 R12: ffff888037e8d020
R13: ffff888037e8c680 R14: ffff888009af7900 R15: ffff888009af6880
FS:  00007fc26d708640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000012 CR3: 0000000066bc5001 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 do_accept+0x1ae/0x260 net/socket.c:1872
 __sys_accept4+0x9b/0x110 net/socket.c:1913
 __do_sys_accept4 net/socket.c:1954 [inline]
 __se_sys_accept4 net/socket.c:1951 [inline]
 __x64_sys_accept4+0x20/0x30 net/socket.c:1951
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Address the issue by temporary removing the pending request socket
from the accept queue, so that racing accept() can't touch them.

After depleting the msk - the ssk still exists, as plain TCP sockets,
re-insert them into the accept queue, so that later inet_csk_listen_stop()
will complete the tcp socket disposal.

Fixes: 2a6a870e44 ("mptcp: stops worker on unaccepted sockets at listener close")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/423
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-4-6671b1ab11cc@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 18:26:27 -07:00
Paolo Abeni ff18f9ef30 mptcp: avoid bogus reset on fallback close
Since the blamed commit, the MPTCP protocol unconditionally sends
TCP resets on all the subflows on disconnect().

That fits full-blown MPTCP sockets - to implement the fastclose
mechanism - but causes unexpected corruption of the data stream,
caught as sporadic self-tests failures.

Fixes: d21f834855 ("mptcp: use fastclose on more edge scenarios")
Cc: stable@vger.kernel.org
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/419
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-3-6671b1ab11cc@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 18:26:27 -07:00
Florian Westphal 6a7ac3d205 tunnels: fix kasan splat when generating ipv4 pmtu error
If we try to emit an icmp error in response to a nonliner skb, we get

BUG: KASAN: slab-out-of-bounds in ip_compute_csum+0x134/0x220
Read of size 4 at addr ffff88811c50db00 by task iperf3/1691
CPU: 2 PID: 1691 Comm: iperf3 Not tainted 6.5.0-rc3+ #309
[..]
 kasan_report+0x105/0x140
 ip_compute_csum+0x134/0x220
 iptunnel_pmtud_build_icmp+0x554/0x1020
 skb_tunnel_check_pmtu+0x513/0xb80
 vxlan_xmit_one+0x139e/0x2ef0
 vxlan_xmit+0x1867/0x2760
 dev_hard_start_xmit+0x1ee/0x4f0
 br_dev_queue_push_xmit+0x4d1/0x660
 [..]

ip_compute_csum() cannot deal with nonlinear skbs, so avoid it.
After this change, splat is gone and iperf3 is no longer stuck.

Fixes: 4cb47a8644 ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230803152653.29535-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 18:24:52 -07:00
Eric Dumazet 8a98961777 net/packet: annotate data-races around tp->status
Another syzbot report [1] is about tp->status lockless reads
from __packet_get_status()

[1]
BUG: KCSAN: data-race in __packet_rcv_has_room / __packet_set_status

write to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 0:
__packet_set_status+0x78/0xa0 net/packet/af_packet.c:407
tpacket_rcv+0x18bb/0x1a60 net/packet/af_packet.c:2483
deliver_skb net/core/dev.c:2173 [inline]
__netif_receive_skb_core+0x408/0x1e80 net/core/dev.c:5337
__netif_receive_skb_one_core net/core/dev.c:5491 [inline]
__netif_receive_skb+0x57/0x1b0 net/core/dev.c:5607
process_backlog+0x21f/0x380 net/core/dev.c:5935
__napi_poll+0x60/0x3b0 net/core/dev.c:6498
napi_poll net/core/dev.c:6565 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6698
__do_softirq+0xc1/0x265 kernel/softirq.c:571
invoke_softirq kernel/softirq.c:445 [inline]
__irq_exit_rcu+0x57/0xa0 kernel/softirq.c:650
sysvec_apic_timer_interrupt+0x6d/0x80 arch/x86/kernel/apic/apic.c:1106
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645
smpboot_thread_fn+0x33c/0x4a0 kernel/smpboot.c:112
kthread+0x1d7/0x210 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

read to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 1:
__packet_get_status net/packet/af_packet.c:436 [inline]
packet_lookup_frame net/packet/af_packet.c:524 [inline]
__tpacket_has_room net/packet/af_packet.c:1255 [inline]
__packet_rcv_has_room+0x3f9/0x450 net/packet/af_packet.c:1298
tpacket_rcv+0x275/0x1a60 net/packet/af_packet.c:2285
deliver_skb net/core/dev.c:2173 [inline]
dev_queue_xmit_nit+0x38a/0x5e0 net/core/dev.c:2243
xmit_one net/core/dev.c:3574 [inline]
dev_hard_start_xmit+0xcf/0x3f0 net/core/dev.c:3594
__dev_queue_xmit+0xefb/0x1d10 net/core/dev.c:4244
dev_queue_xmit include/linux/netdevice.h:3088 [inline]
can_send+0x4eb/0x5d0 net/can/af_can.c:276
bcm_can_tx+0x314/0x410 net/can/bcm.c:302
bcm_tx_timeout_handler+0xdb/0x260
__run_hrtimer kernel/time/hrtimer.c:1685 [inline]
__hrtimer_run_queues+0x217/0x700 kernel/time/hrtimer.c:1749
hrtimer_run_softirq+0xd6/0x120 kernel/time/hrtimer.c:1766
__do_softirq+0xc1/0x265 kernel/softirq.c:571
run_ksoftirqd+0x17/0x20 kernel/softirq.c:939
smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164
kthread+0x1d7/0x210 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

value changed: 0x0000000000000000 -> 0x0000000020000081

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 6.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023

Fixes: 69e3c75f4d ("net: TX_RING and packet mmap")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230803145600.2937518-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 18:03:16 -07:00
Eric Dumazet c4a6b2da4b tcp_metrics: hash table allocation cleanup
After commit 098a697b49 ("tcp_metrics: Use a single hash table
for all network namespaces.") we can avoid calling tcp_net_metrics_init()
for each new netns.

Instead, rename tcp_net_metrics_init() to tcp_metrics_hash_alloc(),
and move it to __init section.

Also move tcpmhash_entries to __initdata section.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230803135417.2716879-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 15:33:39 -07:00
Xiang Yang 17ebf8a4c3 mptcp: fix the incorrect judgment for msk->cb_flags
Coccicheck reports the error below:
net/mptcp/protocol.c:3330:15-28: ERROR: test of a variable/field address

Since the address of msk->cb_flags is used in __test_and_clear_bit, the
address should not be NULL. The judgment for if (unlikely(msk->cb_flags))
will always be true, we should check the real value of msk->cb_flags here.

Fixes: 65a569b03c ("mptcp: optimize release_cb for the common case")
Signed-off-by: Xiang Yang <xiangyang3@huawei.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230803072438.1847500-1-xiangyang3@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 15:22:13 -07:00
Jiri Pirko 6e067d0cab devlink: use generated split ops and remove duplicated commands from small ops
Do the switch and use generated split ops for get and info_get commands.
Remove those from small ops array.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 14:03:02 -07:00
Jiri Pirko b2551b1517 devlink: include the generated netlink header
Put the newly added generated header to the include list. Remove the
duplicated temporary function prototypes.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 14:03:02 -07:00
Jiri Pirko 6b7c486cae devlink: add split ops generated according to spec
Improve the existing devlink spec in order to serve as a source for
generation of valid devlink split ops for the existing commands.
Add the generated sources.

Node that the policies are narrowed down only to the attributes that
are actually parsed. The dont-validate-strict parsing policy makes sure
that other possibly passed garbage attributes from userspace are
ignored during validation.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 14:03:01 -07:00
Jiri Pirko 8300dce542 devlink: un-static devlink_nl_pre/post_doit()
To be prepared for the follow-up generated split ops addition,
make the functions devlink_nl_pre_doit() and devlink_nl_post_doit()
usable outside of netlink.c. Introduce temporary prototypes which are
going to be removed once the generated header will be included.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 14:03:01 -07:00
Jiri Pirko 491a24872a devlink: introduce couple of dumpit callbacks for split ops
Introduce couple of dumpit callbacks for generated split ops. Have them
as a thin wrapper around iteration function and allow to pass dump_one()
function pointer directly without need to store in devlink_cmd structs.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 14:03:01 -07:00
Jiri Pirko d61aedcf62 devlink: rename couple of doit netlink callbacks to match generated names
The generated names of the doit netlink callback are missing "cmd" in
their names. Change names to be ready to switch to generated split ops
header.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 14:03:01 -07:00
Jiri Pirko ba0f66c95f devlink: rename devlink_nl_ops to devlink_nl_small_ops
In order to avoid name collision with the generated split ops array
which is going to be introduced as a follow-up patch, rename
the existing ops array to devlink_nl_small_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04 14:03:01 -07:00
Linus Torvalds 4593f3c2c6 Two patches to improve RBD exclusive lock interaction with
osd_request_timeout option and another fix to reduce the potential for
 erroneous blocklisting -- this time in CephFS.  All going to stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTNFFUTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi5I8B/9a8C5ed0XfTadHcHX5VQsY3b//4rgp
 0VYkQbjYnSCwrYRIPsvnL8LeLHzbcPGLpFAQXg7uUlmJ5dpaOz303hKmKt5GdyOR
 qvWka3K4zeG177b6yc1srqs0cEsCLpQrn+krnvOl5v87QdFsCP/bsJMOrJ9mlhdM
 9GjkjDRn6jvNyOLGbn3kIvwCRF9NH6/nHzjBcTUzvS8fBUye02o9C1H6ZQ7sYjKH
 sJnmQCNCFHEqdaVjDZ7mw/doIrAbmTV6sgusuPjiF5bHILzX4oWG4UJmRpHFV//S
 JPQgMp2DNjP8tW9aCVLVVVV5t5AKBr84etF59DaFNflk27U3COJWkE0a
 =gw7n
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two patches to improve RBD exclusive lock interaction with
  osd_request_timeout option and another fix to reduce the potential for
  erroneous blocklisting -- this time in CephFS. All going to stable"

* tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client:
  libceph: fix potential hang in ceph_osdc_notify()
  rbd: prevent busy loop when requesting exclusive lock
  ceph: defer stopping mdsc delayed_work
2023-08-04 11:29:38 -07:00
Jakub Kicinski d07b7b32da pull-request: bpf-next 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
 Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
 =/kVE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2023-08-03

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).

The main changes are:

1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
   Daniel Borkmann

2) Support new insns from cpu v4 from Yonghong Song

3) Non-atomically allocate freelist during prefill from YiFei Zhu

4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu

5) Add tracepoint to xdp attaching failure from Leon Hwang

6) struct netdev_rx_queue and xdp.h reshuffling to reduce
   rebuild time from Jakub Kicinski

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  net: invert the netdevice.h vs xdp.h dependency
  net: move struct netdev_rx_queue out of netdevice.h
  eth: add missing xdp.h includes in drivers
  selftests/bpf: Add testcase for xdp attaching failure tracepoint
  bpf, xdp: Add tracepoint to xdp attaching failure
  selftests/bpf: fix static assert compilation issue for test_cls_*.c
  bpf: fix bpf_probe_read_kernel prototype mismatch
  riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
  libbpf: fix typos in Makefile
  tracing: bpf: use struct trace_entry in struct syscall_tp_t
  bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
  bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
  netfilter: bpf: Only define get_proto_defrag_hook() if necessary
  bpf: Fix an array-index-out-of-bounds issue in disasm.c
  net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
  docs/bpf: Fix malformed documentation
  bpf: selftests: Add defrag selftests
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Support not connecting client socket
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  ...
====================

Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 15:34:36 -07:00
Jakub Kicinski 35b1b1fd96 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/dsa/port.c
  9945c1fb03 ("net: dsa: fix older DSA drivers using phylink")
  a88dd75384 ("net: dsa: remove legacy_pre_march2020 detection")
https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/

net/xdp/xsk.c
  3c5b4d69c3 ("net: annotate data-races around sk->sk_mark")
  b7f72a30e9 ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  37b61cda9c ("bnxt: don't handle XDP in netpoll")
  2b56b3d992 ("eth: bnxt: handle invalid Tx completions more gracefully")
https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/

Adjacent changes:

drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c
  62da08331f ("net/mlx5e: Set proper IPsec source port in L4 selector")
  fbd517549c ("net/mlx5e: Add function to get IPsec offload namespace")

drivers/net/ethernet/sfc/selftest.c
  55c1528f9b ("sfc: fix field-spanning memcpy in selftest")
  ae9d445cd4 ("sfc: Miscellaneous comment removals")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 14:34:37 -07:00
Linus Torvalds 999f663186 Including fixes from bpf and wireless.
Nothing scary here. Feels like the first wave of regressions
 from v6.5 is addressed - one outstanding fix still to come
 in TLS for the sendpage rework.
 
 Current release - regressions:
 
  - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
 
  - dsa: fix older DSA drivers using phylink
 
 Previous releases - regressions:
 
  - gro: fix misuse of CB in udp socket lookup
 
  - mlx5: unregister devlink params in case interface is down
 
  - Revert "wifi: ath11k: Enable threaded NAPI"
 
 Previous releases - always broken:
 
  - sched: cls_u32: fix match key mis-addressing
 
  - sched: bind logic fixes for cls_fw, cls_u32 and cls_route
 
  - add bound checks to a number of places which hand-parse netlink
 
  - bpf: disable preemption in perf_event_output helpers code
 
  - qed: fix scheduling in a tasklet while getting stats
 
  - avoid using APIs which are not hardirq-safe in couple of drivers,
    when we may be in a hard IRQ (netconsole)
 
  - wifi: cfg80211: fix return value in scan logic, avoid page
    allocator warning
 
  - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY
    of MT7615D (DBDC)
 
 Misc:
 
  - drop handful of inactive maintainers, put some new in place
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTMCRwACgkQMUZtbf5S
 Irv1tRAArN6rfYrr2ulaTOfMqhWb1Q+kAs00nBCKqC+OdWgT0hqw2QAuqTAVjhje
 8HBYlNGyhJ10yp0Q5y4Fp9CsBDHDDNjIp/YGEbr0vC/9mUDOhYD8WV07SmZmzEJu
 gmt4LeFPTk07yZy7VxMLY5XKuwce6MWGHArehZE7PSa9+07yY2Ov9X02ntr9hSdH
 ih+VdDI12aTVSj208qb0qNb2JkefFHW9dntVxce4/mtYJE9+47KMR2aXDXtCh0C6
 ECgx0LQkdEJ5vNSYfypww0SXIG5aj7sE6HMTdJkjKH7ws4xrW8H+P9co77Hb/DTH
 TsRBS4SgB20hFNxz3OQwVmAvj+2qfQssL7SeIkRnaEWeTBuVqCwjLdoIzKXJxxq+
 cvtUAAM8XUPqec5cPiHPkeAJV6aJhrdUdMjjbCI9uFYU32AWFBQEqvVGP9xdhXHK
 QIpTLiy26Vw8PwiJdROuGiZJCXePqQRLDuMX1L43ZO1rwIrZcWGHjCNtsR9nXKgQ
 apbbxb2/rq2FBMB+6obKeHzWDy3JraNCsUspmfleqdjQ2mpbRokd4Vw2564FJgaC
 5OznPIX6OuoCY5sftLUcRcpH5ncNj01BvyqjWyCIfJdkCqCUL7HSAgxfm5AUnZip
 ZIXOzZnZ6uTUQFptXdjey/jNEQ6qpV8RmwY0CMsmJoo88DXI34Y=
 =HYkl
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and wireless.

  Nothing scary here. Feels like the first wave of regressions from v6.5
  is addressed - one outstanding fix still to come in TLS for the
  sendpage rework.

  Current release - regressions:

   - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES

   - dsa: fix older DSA drivers using phylink

  Previous releases - regressions:

   - gro: fix misuse of CB in udp socket lookup

   - mlx5: unregister devlink params in case interface is down

   - Revert "wifi: ath11k: Enable threaded NAPI"

  Previous releases - always broken:

   - sched: cls_u32: fix match key mis-addressing

   - sched: bind logic fixes for cls_fw, cls_u32 and cls_route

   - add bound checks to a number of places which hand-parse netlink

   - bpf: disable preemption in perf_event_output helpers code

   - qed: fix scheduling in a tasklet while getting stats

   - avoid using APIs which are not hardirq-safe in couple of drivers,
     when we may be in a hard IRQ (netconsole)

   - wifi: cfg80211: fix return value in scan logic, avoid page
     allocator warning

   - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D
     (DBDC)

  Misc:

   - drop handful of inactive maintainers, put some new in place"

* tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits)
  MAINTAINERS: update TUN/TAP maintainers
  test/vsock: remove vsock_perf executable on `make clean`
  tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
  tcp_metrics: annotate data-races around tm->tcpm_net
  tcp_metrics: annotate data-races around tm->tcpm_vals[]
  tcp_metrics: annotate data-races around tm->tcpm_lock
  tcp_metrics: annotate data-races around tm->tcpm_stamp
  tcp_metrics: fix addr_same() helper
  prestera: fix fallback to previous version on same major version
  udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
  net/mlx5e: Set proper IPsec source port in L4 selector
  net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
  net/mlx5: fs_core: Make find_closest_ft more generic
  wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
  vxlan: Fix nexthop hash size
  ip6mr: Fix skb_under_panic in ip6mr_cache_report()
  s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
  net: tap_open(): set sk_uid from current_fsuid()
  net: tun_chr_open(): set sk_uid from current_fsuid()
  net: dcb: choose correct policy to parse DCB_ATTR_BCN
  ...
2023-08-03 14:00:02 -07:00
Sven Eckelmann 112cbcb4af batman-adv: Check hardif MTU against runtime MTU
If the MTU of the soft/mesh interface was already reduced (enough), it is
not necessary to print a warning about a hard interface not having a MTU to
transport ethernet payloads of 1500 bytes.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-03 21:11:42 +02:00
Sven Eckelmann e4b8178045 batman-adv: Avoid magic value for minimum MTU
The header linux/if_ether.h already defines a constant for the minimum MTU.
So simply use it instead of having a magic constant in the code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-03 21:11:42 +02:00
YueHaibing bbfb428a0c batman-adv: Remove unused declarations
Since commit 335fbe0f5d ("batman-adv: tvlv - convert tt query packet to use tvlv unicast packets")
batadv_recv_tt_query() is not used.
And commit 122edaa059 ("batman-adv: tvlv - convert roaming adv packet to use tvlv unicast packets")
left behind batadv_recv_roam_adv().

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-03 21:11:42 +02:00
Simon Wunderlich 2744cefe03 batman-adv: Start new development cycle
This version will contain all the (major or even only minor) changes for
Linux 6.6.

The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-03 21:11:42 +02:00
Jakub Kicinski 3932f22723 pull-request: bpf 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW
 48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf
 61d6hoiXj/sIibgMQT/ihODgeJ4wfQE=
 =u7qn
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Martin KaFai Lau says:

====================
pull-request: bpf 2023-08-03

We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 3 files changed, 37 insertions(+), 20 deletions(-).

The main changes are:

1) Disable preemption in perf_event_output helpers code,
   from Jiri Olsa

2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing,
   from Lin Ma

3) Multiple warning splat fixes in cpumap from Hou Tao

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, cpumap: Handle skb as well when clean up ptr_ring
  bpf, cpumap: Make sure kthread is running before map update returns
  bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
  bpf: Disable preemption in bpf_event_output
  bpf: Disable preemption in bpf_perf_event_output
====================

Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 11:22:53 -07:00
Jakub Kicinski 0d48a84b31 wireless fixes for v6.5
We did some house cleaning in MAINTAINERS file so several patches
 about that. Few regressions fixed and also fix some recently enabled
 memcpy() warnings. Only small commits and nothing special standing
 out.
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmTLsrcRHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZtn6gf/ZsEOZl98ZVbCoFB09t5/M2IgRdWzbv8C
 vXyVoacrRaq80rzFQwGZqorEsnEdDXOIJI54VIqnT5avZbIIWIia4mFzBkHwPBef
 TXcdL2k1KDd+ktPrw3GK8401iEMnWSHs2a/4ztx3x8CFCB47VhGT9DiaIWh6jg1J
 FUvDhUK7BAk0dItgVjioL+0XKJ5vo4VLENiOCAVj4QJgShKIaq72j/WhKiI/W/+Q
 8TBBUjydu0nx7MOM0tOcQlI0z6HXOB89RHj4GxOMA/wvEf+7PHhOE67RAgSAMHJM
 R9TmeVvdub05Yppv33PUbbvK29McZEI+M+lHMZjLy5AYaXxyYJ+nhw==
 =4o1a
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Kalle Valo says:

====================
wireless fixes for v6.5

We did some house cleaning in MAINTAINERS file so several patches
about that. Few regressions fixed and also fix some recently enabled
memcpy() warnings. Only small commits and nothing special standing
out.

* tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
  wifi: ray_cs: Replace 1-element array with flexible array
  MAINTAINERS: add Jeff as ath10k, ath11k and ath12k maintainer
  MAINTAINERS: wifi: mark mlw8k as orphan
  MAINTAINERS: wifi: mark b43 as orphan
  MAINTAINERS: wifi: mark zd1211rw as orphan
  MAINTAINERS: wifi: mark wl3501 as orphan
  MAINTAINERS: wifi: mark rndis_wlan as orphan
  MAINTAINERS: wifi: mark ar5523 as orphan
  MAINTAINERS: wifi: mark cw1200 as orphan
  MAINTAINERS: wifi: atmel: mark as orphan
  MAINTAINERS: wifi: rtw88: change Ping as the maintainer
  Revert "wifi: ath6k: silence false positive -Wno-dangling-pointer warning on GCC 12"
  wifi: cfg80211: Fix return value in scan logic
  Revert "wifi: ath11k: Enable threaded NAPI"
  MAINTAINERS: Update mwifiex maintainer list
  wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
====================

Link: https://lore.kernel.org/r/20230803140058.57476C433C9@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 11:05:46 -07:00
Eric Dumazet ddf251fa2b tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
Whenever tcpm_new() reclaims an old entry, tcpm_suck_dst()
would overwrite data that could be read from tcp_fastopen_cache_get()
or tcp_metrics_fill_info().

We need to acquire fastopen_seqlock to maintain consistency.

For newly allocated objects, tcpm_new() can switch to kzalloc()
to avoid an extra fastopen_seqlock acquisition.

Fixes: 1fe4c481ba ("net-tcp: Fast Open client - cookie cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 10:58:24 -07:00
Eric Dumazet d5d986ce42 tcp_metrics: annotate data-races around tm->tcpm_net
tm->tcpm_net can be read or written locklessly.

Instead of changing write_pnet() and read_pnet() and potentially
hurt performance, add the needed READ_ONCE()/WRITE_ONCE()
in tm_net() and tcpm_new().

Fixes: 849e8a0ca8 ("tcp_metrics: Add a field tcpm_net and verify it matches on lookup")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 10:58:24 -07:00
Eric Dumazet 8c4d04f6b4 tcp_metrics: annotate data-races around tm->tcpm_vals[]
tm->tcpm_vals[] values can be read or written locklessly.

Add needed READ_ONCE()/WRITE_ONCE() to document this,
and force use of tcp_metric_get() and tcp_metric_set()

Fixes: 51c5d0c4b1 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 10:58:24 -07:00
Eric Dumazet 285ce119a3 tcp_metrics: annotate data-races around tm->tcpm_lock
tm->tcpm_lock can be read or written locklessly.

Add needed READ_ONCE()/WRITE_ONCE() to document this.

Fixes: 51c5d0c4b1 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 10:58:24 -07:00
Eric Dumazet 949ad62a5d tcp_metrics: annotate data-races around tm->tcpm_stamp
tm->tcpm_stamp can be read or written locklessly.

Add needed READ_ONCE()/WRITE_ONCE() to document this.

Also constify tcpm_check_stamp() dst argument.

Fixes: 51c5d0c4b1 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 10:58:24 -07:00
Eric Dumazet e6638094d7 tcp_metrics: fix addr_same() helper
Because v4 and v6 families use separate inetpeer trees (respectively
net->ipv4.peers and net->ipv6.peers), inetpeer_addr_cmp(a, b) assumes
a & b share the same family.

tcp_metrics use a common hash table, where entries can have different
families.

We must therefore make sure to not call inetpeer_addr_cmp()
if the families do not match.

Fixes: d39d14ffa2 ("net: Add helper function to compare inetpeer addresses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 10:58:24 -07:00
Jakub Kicinski 82e896d992 docs: net: page_pool: use kdoc to avoid duplicating the information
All struct members of the driver-facing APIs are documented twice,
in the code and under Documentation. This is a bit tedious.

I also get the feeling that a lot of developers will read the header
when coding, rather than the doc. Bring the two a little closer
together by using kdoc for structs and functions.

Using kdoc also gives us links (mentioning a function or struct
in the text gets replaced by a link to its doc).

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230802161821.3621985-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 09:54:24 -07:00
Jakub Kicinski 680ee0456a net: invert the netdevice.h vs xdp.h dependency
xdp.h is far more specific and is included in only 67 other
files vs netdevice.h's 1538 include sites.
Make xdp.h include netdevice.h, instead of the other way around.
This decreases the incremental allmodconfig builds size when
xdp.h is touched from 5947 to 662 objects.

Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h
is a mega-header in its own right so it's nice to avoid xdp.h
getting included there as well.

The only unfortunate part is that the typedef for xdp_features_t
has to move to netdevice.h, since its embedded in struct netdevice.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
Jakub Kicinski 49e47a5b61 net: move struct netdev_rx_queue out of netdevice.h
struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.

In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.

We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.

Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
David Howells ce650a1663 udp6: Fix __ip6_append_data()'s handling of MSG_SPLICE_PAGES
__ip6_append_data() can has a similar problem to __ip_append_data()[1] when
asked to splice into a partially-built UDP message that has more than the
frag-limit data and up to the MTU limit, but in the ipv6 case, it errors
out with EINVAL.  This can be triggered with something like:

        pipe(pfd);
        sfd = socket(AF_INET6, SOCK_DGRAM, 0);
        connect(sfd, ...);
        send(sfd, buffer, 8137, MSG_CONFIRM|MSG_MORE);
        write(pfd[1], buffer, 8);
        splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);

where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).

The problem is that the calculation of the amount to copy in
__ip6_append_data() goes negative in two places, but a check has been put
in to give an error in this case.

This happens because when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), the terms in:

        copy = datalen - transhdrlen - fraggap - pagedlen;

then mostly cancel when pagedlen is substituted for, leaving just -fraggap.

Fix this by:

 (1) Insert a note about the dodgy calculation of 'copy'.

 (2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
     equation, so that 'offset' isn't regressed and 'length' isn't
     increased, which will mean that length and thus copy should match the
     amount left in the iterator.

 (3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
     we're asked to splice more than is in the iterator.  It might be
     better to not give the warning or even just give a 'short' write.

 (4) If MSG_SPLICE_PAGES, override the copy<0 check.

[!] Note that this should also affect MSG_ZEROCOPY, but that will return
-EINVAL for the range of send sizes that requires the skbuff to be split.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ [1]
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1580952.1690961810@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-03 14:56:19 +02:00
Yue Haibing c956910d5a tipc: Remove unused function declarations
Commit d50ccc2d39 ("tipc: add 128-bit node identifier") declared but never
implemented tipc_node_id2hash().
Also commit 5c216e1d28 ("tipc: Allow run-time alteration of default link settings")
never implemented tipc_media_set_priority() and tipc_media_set_window(),
commit cad2929dc4 ("tipc: update a binding service via broadcast") only declared
tipc_named_bcast().

Since commit be07f05639 ("tipc: simplify the finalize work queue")
tipc_sched_net_finalize() is removed and declaration is unused.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230802034659.39840-1-yuehaibing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-03 12:51:45 +02:00
Jiri Slaby 8a76d8b075 net: nfc: remove casts from tty->disc_data
tty->disc_data is 'void *', so there is no need to cast from that.
Therefore remove the casts and assign the pointer directly.

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: linux-can@vger.kernel.org
Cc: netdev@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Max Staudt <max@enpas.org>
Link: https://lore.kernel.org/r/20230801062237.2687-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03 09:51:21 +02:00
David Howells 0f71c9caf2 udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
__ip_append_data() can get into an infinite loop when asked to splice into
a partially-built UDP message that has more than the frag-limit data and up
to the MTU limit.  Something like:

        pipe(pfd);
        sfd = socket(AF_INET, SOCK_DGRAM, 0);
        connect(sfd, ...);
        send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE);
        write(pfd[1], buffer, 8);
        splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);

where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).

The problem is that the calculation of the amount to copy in
__ip_append_data() goes negative in two places, and, in the second place,
this gets subtracted from the length remaining, thereby increasing it.

This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), because the terms in:

        copy = datalen - transhdrlen - fraggap - pagedlen;

then mostly cancel when pagedlen is substituted for, leaving just -fraggap.
This causes:

        length -= copy + transhdrlen;

to increase the length to more than the amount of data in msg->msg_iter,
which causes skb_splice_from_iter() to be unable to fill the request and it
returns less than 'copied' - which means that length never gets to 0 and we
never exit the loop.

Fix this by:

 (1) Insert a note about the dodgy calculation of 'copy'.

 (2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
     equation, so that 'offset' isn't regressed and 'length' isn't
     increased, which will mean that length and thus copy should match the
     amount left in the iterator.

 (3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
     we're asked to splice more than is in the iterator.  It might be
     better to not give the warning or even just give a 'short' write.

[!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY
avoids the problem by simply assuming that everything asked for got copied,
not just the amount that was in the iterator.  This is a potential bug for
the future.

Fixes: 7ac7c98785 ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:19:32 -07:00
Vladimir Oltean fd770e856e net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers
It is desirable that the new .ndo_hwtstamp_set() API gives more
uniformity, less overhead and future flexibility w.r.t. the PHY
timestamping behavior.

Currently there are some drivers which allow PHY timestamping through
the procedure mentioned in Documentation/networking/timestamping.rst.
They don't do anything locally if phy_has_hwtstamp() is set, except for
lan966x which installs PTP packet traps.

Centralize that behavior in a new dev_set_hwtstamp_phylib() code
function, which calls either phy_mii_ioctl() for the phylib PHY,
or .ndo_hwtstamp_set() of the netdev, based on a single policy
(currently simplistic: phy_has_hwtstamp()).

Any driver converted to .ndo_hwtstamp_set() will automatically opt into
the centralized phylib timestamping policy. Unconverted drivers still
get to choose whether they let the PHY handle timestamping or not.

Netdev drivers with integrated PHY drivers that don't use phylib
presumably don't set dev->phydev, and those will always see
HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping
policy will remain 100% up to them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:06 -07:00
Vladimir Oltean 70ef7d87f6 net: transfer rtnl_lock() requirement from ethtool_set_ethtool_phy_ops() to caller
phy_init() and phy_exit() will have to do more stuff under rtnl_lock()
in a future change. Since rtnl_unlock() -> netdev_run_todo() does a lot
of stuff under the hood, it's a pity to lock and unlock the rtnetlink
mutex twice in a row.

Change the calling convention such that the only caller of
ethtool_set_ethtool_phy_ops(), phy_device.c, provides a context where
the rtnl_mutex is already acquired.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-11-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:06 -07:00
Maxim Georgiev 65c9fde15a net: vlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()
8021q is one of the stackable net devices which pass the hardware
timestamping ops to the real device through ndo_eth_ioctl(). This
prevents converting any device driver to the new hwtimestamping API
without regressions.

Remove that limitation in the vlan driver by using the newly introduced
helpers for timestamping through lower devices, that handle both the new
and the old driver API.

Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:05 -07:00
Maxim Georgiev e47d01fea6 net: add hwtstamping helpers for stackable net devices
The stackable net devices with hwtstamping support (vlan, macvlan,
bonding) only pass the hwtstamping ops to the lower (real) device.

These drivers are the first that need to be converted to the new
timestamping API, because if they aren't prepared to handle that,
then no real device driver cannot be converted to the new API either.

After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and
bond_eth_ioctl() have in common, here we propose two generic
implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which
can be called by those 3 drivers, with "dev" being their lower device.

These helpers cover both cases, when the lower driver is converted to
the new API or unconverted.

We need some hacks in case of an unconverted driver, namely to stuff
some pointers in struct kernel_hwtstamp_config which shouldn't have
been there (since the new API isn't supposed to need it). These will
be removed when all drivers will have been converted to the new API.

Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:05 -07:00
Maxim Georgiev 66f7223039 net: add NDOs for configuring hardware timestamping
Current hardware timestamping API for NICs requires implementing
.ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP.

That API has some boilerplate such as request parameter translation
between user and kernel address spaces, handling possible translation
failures correctly, etc. Since it is the same all across the board, it
would be desirable to handle it through generic code.

Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which
implement that boilerplate and allow drivers to just act upon requests.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:05 -07:00
Eric Dumazet ae6db08f8b net/packet: change packet_alloc_skb() to allow bigger paged allocations
packet_alloc_skb() is currently calling sock_alloc_send_pskb()
forcing order-0 page allocations.

Switch to PAGE_ALLOC_COSTLY_ORDER, to increase max size by 8x.

Also add logic to increase the linear part if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tahsin Erdogan <trdgn@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230801205254.400094-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 18:44:55 -07:00
Eric Dumazet 09c2c90705 net: allow alloc_skb_with_frags() to allocate bigger packets
Refactor alloc_skb_with_frags() to allow bigger packets allocations.

Instead of assuming that only order-0 allocations will be attempted,
use the caller supplied max order.

v2: try harder to use high-order pages, per Willem feedback.

Link: https://lore.kernel.org/netdev/CANn89iJQfmc_KeUr3TeXvsLQwo3ZymyoCr7Y6AnHrkWSuz0yAg@mail.gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tahsin Erdogan <trdgn@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230801205254.400094-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 18:44:55 -07:00
Leon Hwang bf4ea1d0b2 bpf, xdp: Add tracepoint to xdp attaching failure
When error happens in dev_xdp_attach(), it should have a way to tell
users the error message like the netlink approach.

To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is
an appropriate way to notify users the error message.

Hence, bpf libraries are able to retrieve the error message by this
tracepoint, and then report the error message to users.

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-02 14:21:12 -07:00
Yue Haibing e12f2a6d1b netlabel: Remove unused declaration netlbl_cipsov4_doi_free()
Since commit b1edeb1023 ("netlabel: Replace protocol/NetLabel linking with refrerence counts")
this declaration is unused and can be removed.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20230801143453.24452-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 12:28:22 -07:00
Yue Haibing 2fca1b5ef8 ila: Remove unnecessary file net/ila.h
Commit 642c2c9558 ("ila: xlat changes") removed ila_xlat_outgoing()
and ila_xlat_incoming() functions, then this file became unnecessary.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230801143129.40652-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 12:28:16 -07:00
Yue Haibing 30e0191b16 ip6mr: Fix skb_under_panic in ip6mr_cache_report()
skbuff: skb_under_panic: text:ffffffff88771f69 len:56 put:-4
 head:ffff88805f86a800 data:ffff887f5f86a850 tail:0x88 end:0x2c0 dev:pim6reg
 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:192!
 invalid opcode: 0000 [#1] PREEMPT SMP KASAN
 CPU: 2 PID: 22968 Comm: kworker/2:11 Not tainted 6.5.0-rc3-00044-g0a8db05b571a #236
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 Workqueue: ipv6_addrconf addrconf_dad_work
 RIP: 0010:skb_panic+0x152/0x1d0
 Call Trace:
  <TASK>
  skb_push+0xc4/0xe0
  ip6mr_cache_report+0xd69/0x19b0
  reg_vif_xmit+0x406/0x690
  dev_hard_start_xmit+0x17e/0x6e0
  __dev_queue_xmit+0x2d6a/0x3d20
  vlan_dev_hard_start_xmit+0x3ab/0x5c0
  dev_hard_start_xmit+0x17e/0x6e0
  __dev_queue_xmit+0x2d6a/0x3d20
  neigh_connected_output+0x3ed/0x570
  ip6_finish_output2+0x5b5/0x1950
  ip6_finish_output+0x693/0x11c0
  ip6_output+0x24b/0x880
  NF_HOOK.constprop.0+0xfd/0x530
  ndisc_send_skb+0x9db/0x1400
  ndisc_send_rs+0x12a/0x6c0
  addrconf_dad_completed+0x3c9/0xea0
  addrconf_dad_work+0x849/0x1420
  process_one_work+0xa22/0x16e0
  worker_thread+0x679/0x10c0
  ret_from_fork+0x28/0x60
  ret_from_fork_asm+0x11/0x20

When setup a vlan device on dev pim6reg, DAD ns packet may sent on reg_vif_xmit().
reg_vif_xmit()
    ip6mr_cache_report()
        skb_push(skb, -skb_network_offset(pkt));//skb_network_offset(pkt) is 4
And skb_push declared as:
	void *skb_push(struct sk_buff *skb, unsigned int len);
		skb->data -= len;
		//0xffff88805f86a84c - 0xfffffffc = 0xffff887f5f86a850
skb->data is set to 0xffff887f5f86a850, which is invalid mem addr, lead to skb_push() fails.

Fixes: 14fb64e1f4 ("[IPV6] MROUTE: Support PIM-SM (SSM).")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-02 10:35:21 +01:00
Ratheesh Kannoth c8915d7329 tc: flower: Enable offload support IPSEC SPI field.
This patch enables offload for TC classifier
flower rules which matches against SPI field.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-02 10:09:32 +01:00
Ratheesh Kannoth 4c13eda757 tc: flower: support for SPI
tc flower rules support to classify ESP/AH
packets matching SPI field.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-02 10:09:31 +01:00
Ratheesh Kannoth a57c34a80c net: flow_dissector: Add IPSEC dissector
Support for dissecting IPSEC field SPI (which is
32bits in size) for ESP and AH packets.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-02 10:09:31 +01:00
Ilya Dryomov e6e2843230 libceph: fix potential hang in ceph_osdc_notify()
If the cluster becomes unavailable, ceph_osdc_notify() may hang even
with osd_request_timeout option set because linger_notify_finish_wait()
waits for MWatchNotify NOTIFY_COMPLETE message with no associated OSD
request in flight -- it's completely asynchronous.

Introduce an additional timeout, derived from the specified notify
timeout.  While at it, switch both waits to killable which is more
correct.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
2023-08-02 09:07:34 +02:00
Lin Ma 31d49ba033 net: dcb: choose correct policy to parse DCB_ATTR_BCN
The dcbnl_bcn_setcfg uses erroneous policy to parse tb[DCB_ATTR_BCN],
which is introduced in commit 859ee3c438 ("DCB: Add support for DCB
BCN"). Please see the comment in below code

static int dcbnl_bcn_setcfg(...)
{
  ...
  ret = nla_parse_nested_deprecated(..., dcbnl_pfc_up_nest, .. )
  // !!! dcbnl_pfc_up_nest for attributes
  //  DCB_PFC_UP_ATTR_0 to DCB_PFC_UP_ATTR_ALL in enum dcbnl_pfc_up_attrs
  ...
  for (i = DCB_BCN_ATTR_RP_0; i <= DCB_BCN_ATTR_RP_7; i++) {
  // !!! DCB_BCN_ATTR_RP_0 to DCB_BCN_ATTR_RP_7 in enum dcbnl_bcn_attrs
    ...
    value_byte = nla_get_u8(data[i]);
    ...
  }
  ...
  for (i = DCB_BCN_ATTR_BCNA_0; i <= DCB_BCN_ATTR_RI; i++) {
  // !!! DCB_BCN_ATTR_BCNA_0 to DCB_BCN_ATTR_RI in enum dcbnl_bcn_attrs
  ...
    value_int = nla_get_u32(data[i]);
  ...
  }
  ...
}

That is, the nla_parse_nested_deprecated uses dcbnl_pfc_up_nest
attributes to parse nlattr defined in dcbnl_pfc_up_attrs. But the
following access code fetch each nlattr as dcbnl_bcn_attrs attributes.
By looking up the associated nla_policy for dcbnl_bcn_attrs. We can find
the beginning part of these two policies are "same".

static const struct nla_policy dcbnl_pfc_up_nest[...] = {
        [DCB_PFC_UP_ATTR_0]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_1]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_2]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_3]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_4]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_5]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_6]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_7]   = {.type = NLA_U8},
        [DCB_PFC_UP_ATTR_ALL] = {.type = NLA_FLAG},
};

static const struct nla_policy dcbnl_bcn_nest[...] = {
        [DCB_BCN_ATTR_RP_0]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_1]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_2]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_3]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_4]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_5]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_6]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_7]         = {.type = NLA_U8},
        [DCB_BCN_ATTR_RP_ALL]       = {.type = NLA_FLAG},
        // from here is somewhat different
        [DCB_BCN_ATTR_BCNA_0]       = {.type = NLA_U32},
        ...
        [DCB_BCN_ATTR_ALL]          = {.type = NLA_FLAG},
};

Therefore, the current code is buggy and this
nla_parse_nested_deprecated could overflow the dcbnl_pfc_up_nest and use
the adjacent nla_policy to parse attributes from DCB_BCN_ATTR_BCNA_0.

Hence use the correct policy dcbnl_bcn_nest to parse the nested
tb[DCB_ATTR_BCN] TLV.

Fixes: 859ee3c438 ("DCB: Add support for DCB BCN")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230801013248.87240-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-01 21:07:46 -07:00
Jakub Kicinski ceaac91dcd net: make sure we never create ifindex = 0
Instead of allocating from 1 use proper xa_init flag,
to protect ourselves from IDs wrapping back to 0.

Fixes: 759ab1edb5 ("net: store netdevs in an xarray")
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/all/20230728162350.2a6d4979@hermes.local/
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230731171159.988962-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-01 15:01:29 -07:00
Leon Romanovsky f3ec2b5d87 xfrm: don't skip free of empty state in acquire policy
In destruction flow, the assignment of NULL to xso->dev
caused to skip of xfrm_dev_state_free() call, which was
called in xfrm_state_put(to_put) routine.

Instead of open-coded variant of xfrm_dev_state_delete() and
xfrm_dev_state_free(), let's use them directly.

Fixes: f8a70afafc ("xfrm: add TX datapath support for IPsec packet offload mode")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-08-01 12:04:43 +02:00
Leon Romanovsky 982c3aca8b xfrm: delete offloaded policy
The policy memory was released but not HW driver data. Add
call to xfrm_dev_policy_delete(), so drivers will have a chance
to release their resources.

Fixes: 919e43fad5 ("xfrm: add an interface to offload policy")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-08-01 12:04:43 +02:00
Christian Marangi de9db136dc net: dsa: tag_qca: return early if dev is not found
Currently checksum is recalculated and dsa tag stripped even if we later
don't find the dev.

To improve code, exit early if we don't find the dev and skip additional
operation on the skb since it will be freed anyway.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20230730074113.21889-1-ansuelsmth@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01 12:02:42 +02:00
Pedro Tammela e20e75017c net/sched: sch_qfq: warn about class in use while deleting
Add extack to warn that delete was rejected because
the class is still in use

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01 10:47:25 +02:00
Pedro Tammela 7118f56e04 net/sched: sch_htb: warn about class in use while deleting
Add extack to warn that delete was rejected because
the class is still in use

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01 10:47:24 +02:00
Pedro Tammela 8e4553ef3e net/sched: sch_hfsc: warn about class in use while deleting
Add extack to warn that delete was rejected because
the class is still in use

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01 10:47:24 +02:00
Pedro Tammela daf8d9181b net/sched: sch_drr: warn about class in use while deleting
Add extack to warn that delete was rejected because
the class is still in use

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01 10:47:24 +02:00
Pedro Tammela 8798481b66 net/sched: wrap open coded Qdics class filter counter
The 'filter_cnt' counter is used to control a Qdisc class lifetime.
Each filter referecing this class by its id will eventually
increment/decrement this counter in their respective
'add/update/delete' routines.
As these operations are always serialized under rtnl lock, we don't
need an atomic type like 'refcount_t'.

It also means that we lose the overflow/underflow checks already
present in refcount_t, which are valuable to hunt down bugs
where the unsigned counter wraps around as it aids automated tools
like syzkaller to scream in such situations.

Wrap the open coded increment/decrement into helper functions and
add overflow checks to the operations.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01 10:47:24 +02:00
Tomas Glozar 13d2618b48 bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
Disabling preemption in sock_map_sk_acquire conflicts with GFP_ATOMIC
allocation later in sk_psock_init_link on PREEMPT_RT kernels, since
GFP_ATOMIC might sleep on RT (see bpf: Make BPF and PREEMPT_RT co-exist
patchset notes for details).

This causes calling bpf_map_update_elem on BPF_MAP_TYPE_SOCKMAP maps to
BUG (sleeping function called from invalid context) on RT kernels.

preempt_disable was introduced together with lock_sk and rcu_read_lock
in commit 99ba2b5aba ("bpf: sockhash, disallow bpf_tcp_close and update
in parallel"), probably to match disabled migration of BPF programs, and
is no longer necessary.

Remove preempt_disable to fix BUG in sock_map_update_common on RT.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/all/20200224140131.461979697@linutronix.de/
Fixes: 99ba2b5aba ("bpf: sockhash, disallow bpf_tcp_close and update in parallel")
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230728064411.305576-1-tglozar@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01 09:24:34 +02:00
Yue Haibing 2f48401dd0 net/hsr: Remove unused function declarations
commit f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
introducted these but never implemented.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230729123456.36340-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 20:11:47 -07:00
valis b80b829e9e net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
When route4_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.

This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.

Fix this by no longer copying the tcf_result struct from the old filter.

Fixes: 1109c00547 ("net: sched: RCU cls_route")
Reported-by: valis <sec@valis.email>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-4-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 20:10:37 -07:00
valis 76e42ae831 net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
When fw_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.

This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.

Fix this by no longer copying the tcf_result struct from the old filter.

Fixes: e35a8ee599 ("net: sched: fw use RCU")
Reported-by: valis <sec@valis.email>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-3-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 20:10:36 -07:00
valis 3044b16e7c net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
When u32_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.

This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.

Fix this by no longer copying the tcf_result struct from the old filter.

Fixes: de5df63228 ("net: sched: cls_u32 changes to knode must appear atomic to readers")
Reported-by: valis <sec@valis.email>
Reported-by: M A Ramdhan <ramdhan@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 20:10:36 -07:00
Daniel Xu 81584c23f2 netfilter: bpf: Only define get_proto_defrag_hook() if necessary
Before, we were getting this warning:

  net/netfilter/nf_bpf_link.c:32:1: warning: 'get_proto_defrag_hook' defined but not used [-Wunused-function]

Guard the definition with CONFIG_NF_DEFRAG_IPV[4|6].

Fixes: 91721c2d02 ("netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307291213.fZ0zDmoG-lkp@intel.com/
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/b128b6489f0066db32c4772ae4aaee1480495929.1690840454.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-31 17:51:13 -07:00
Yue Haibing 634e449719 vsock: Remove unused function declarations
These are never implemented since introduction in
commit d021c34405 ("VSOCK: Introduce VM Sockets")

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230729122036.32988-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 14:41:08 -07:00
Yue Haibing 4cbc32a8a2 net/smc: Remove unused function declarations
commit f9aab6f2ce ("net/smc: immediate freeing in smc_lgr_cleanup_early()")
left behind smc_lgr_schedule_free_work_fast() declaration.
And since commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
smc_ib_modify_qp_reset() is not used anymore.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20230729121929.17180-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 14:40:26 -07:00
Lorenz Bauer 74bdfab4fd net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
There are already INDIRECT_CALLABLE_DECLARE in the hashtable
headers, no need to declare them again.

Fixes: 0f495f7617 ("net: remove duplicate reuseport_lookup functions")
Suggested-by: Martin Lau <martin.lau@linux.dev>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230731-indir-call-v1-1-4cd0aeaee64f@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 13:53:10 -07:00
Jiri Slaby 659705d0a6 Bluetooth: rfcomm: remove casts from tty->driver_data
tty->driver_data is 'void *', so there is no need to cast from that.
Therefore remove the casts and assign the pointer directly.

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: linux-bluetooth@vger.kernel.org
Link: https://lore.kernel.org/r/20230731080244.2698-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-31 17:16:05 +02:00
Kuniyuki Iwashima 8936bf53a0 net: Use sockaddr_storage for getsockopt(SO_PEERNAME).
Commit df8fc4e934 ("kbuild: Enable -fstrict-flex-arrays=3") started
applying strict rules to standard string functions.

It does not work well with conventional socket code around each protocol-
specific sockaddr_XXX struct, which is cast from sockaddr_storage and has
a bigger size than fortified functions expect.  See these commits:

 commit 06d4c8a808 ("af_unix: Fix fortify_panic() in unix_bind_bsd().")
 commit ecb4534b6a ("af_unix: Terminate sun_path when bind()ing pathname socket.")
 commit a0ade8404c ("af_packet: Fix warning of fortified memcpy() in packet_getname().")

We must cast the protocol-specific address back to sockaddr_storage
to call such functions.

However, in the case of getsockaddr(SO_PEERNAME), the rationale is a bit
unclear as the buffer is defined by char[128] which is the same size as
sockaddr_storage.

Let's use sockaddr_storage explicitly.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-31 09:14:16 +01:00
Kuniyuki Iwashima e739718444 net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
syzkaller found zero division error [0] in div_s64_rem() called from
get_cycle_time_elapsed(), where sched->cycle_time is the divisor.

We have tests in parse_taprio_schedule() so that cycle_time will never
be 0, and actually cycle_time is not 0 in get_cycle_time_elapsed().

The problem is that the types of divisor are different; cycle_time is
s64, but the argument of div_s64_rem() is s32.

syzkaller fed this input and 0x100000000 is cast to s32 to be 0.

  @TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME={0xc, 0x8, 0x100000000}

We use s64 for cycle_time to cast it to ktime_t, so let's keep it and
set max for cycle_time.

While at it, we prevent overflow in setup_txtime() and add another
test in parse_taprio_schedule() to check if cycle_time overflows.

Also, we add a new tdc test case for this issue.

[0]:
divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 PID: 103 Comm: kworker/1:3 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:div_s64_rem include/linux/math64.h:42 [inline]
RIP: 0010:get_cycle_time_elapsed net/sched/sch_taprio.c:223 [inline]
RIP: 0010:find_entry_to_transmit+0x252/0x7e0 net/sched/sch_taprio.c:344
Code: 3c 02 00 0f 85 5e 05 00 00 48 8b 4c 24 08 4d 8b bd 40 01 00 00 48 8b 7c 24 48 48 89 c8 4c 29 f8 48 63 f7 48 99 48 89 74 24 70 <48> f7 fe 48 29 d1 48 8d 04 0f 49 89 cc 48 89 44 24 20 49 8d 85 10
RSP: 0018:ffffc90000acf260 EFLAGS: 00010206
RAX: 177450e0347560cf RBX: 0000000000000000 RCX: 177450e0347560cf
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000100000000
RBP: 0000000000000056 R08: 0000000000000000 R09: ffffed10020a0934
R10: ffff8880105049a7 R11: ffff88806cf3a520 R12: ffff888010504800
R13: ffff88800c00d800 R14: ffff8880105049a0 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0edf84f0e8 CR3: 000000000d73c002 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 get_packet_txtime net/sched/sch_taprio.c:508 [inline]
 taprio_enqueue_one+0x900/0xff0 net/sched/sch_taprio.c:577
 taprio_enqueue+0x378/0xae0 net/sched/sch_taprio.c:658
 dev_qdisc_enqueue+0x46/0x170 net/core/dev.c:3732
 __dev_xmit_skb net/core/dev.c:3821 [inline]
 __dev_queue_xmit+0x1b2f/0x3000 net/core/dev.c:4169
 dev_queue_xmit include/linux/netdevice.h:3088 [inline]
 neigh_resolve_output net/core/neighbour.c:1552 [inline]
 neigh_resolve_output+0x4a7/0x780 net/core/neighbour.c:1532
 neigh_output include/net/neighbour.h:544 [inline]
 ip6_finish_output2+0x924/0x17d0 net/ipv6/ip6_output.c:135
 __ip6_finish_output+0x620/0xaa0 net/ipv6/ip6_output.c:196
 ip6_finish_output net/ipv6/ip6_output.c:207 [inline]
 NF_HOOK_COND include/linux/netfilter.h:292 [inline]
 ip6_output+0x206/0x410 net/ipv6/ip6_output.c:228
 dst_output include/net/dst.h:458 [inline]
 NF_HOOK.constprop.0+0xea/0x260 include/linux/netfilter.h:303
 ndisc_send_skb+0x872/0xe80 net/ipv6/ndisc.c:508
 ndisc_send_ns+0xb5/0x130 net/ipv6/ndisc.c:666
 addrconf_dad_work+0xc14/0x13f0 net/ipv6/addrconf.c:4175
 process_one_work+0x92c/0x13a0 kernel/workqueue.c:2597
 worker_thread+0x60f/0x1240 kernel/workqueue.c:2748
 kthread+0x2fe/0x3f0 kernel/kthread.c:389
 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308
 </TASK>
Modules linked in:

Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Co-developed-by: Eric Dumazet <edumazet@google.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-31 09:12:27 +01:00
Ratheesh Kannoth 2b3082c6ef net: flow_dissector: Use 64bits for used_keys
As 32bits of dissector->used_keys are exhausted,
increase the size to 64bits.

This is base change for ESP/AH flow dissector patch.
Please find patch and discussions at
https://lore.kernel.org/netdev/ZMDNjD46BvZ5zp5I@corigine.com/T/#t

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw
Tested-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-31 09:11:24 +01:00
Lin Ma 5e2424708d xfrm: add forgotten nla_policy for XFRMA_MTIMER_THRESH
The previous commit 4e484b3e96 ("xfrm: rate limit SA mapping change
message to user space") added one additional attribute named
XFRMA_MTIMER_THRESH and described its type at compat_policy
(net/xfrm/xfrm_compat.c).

However, the author forgot to also describe the nla_policy at
xfrma_policy (net/xfrm/xfrm_user.c). Hence, this suppose NLA_U32 (4
bytes) value can be faked as empty (0 bytes) by a malicious user, which
leads to 4 bytes overflow read and heap information leak when parsing
nlattrs.

To exploit this, one malicious user can spray the SLUB objects and then
leverage this 4 bytes OOB read to leak the heap data into
x->mapping_maxage (see xfrm_update_ae_params(...)), and leak it to
userspace via copy_to_user_state_extra(...).

The above bug is assigned CVE-2023-3773. To fix it, this commit just
completes the nla_policy description for XFRMA_MTIMER_THRESH, which
enforces the length check and avoids such OOB read.

Fixes: 4e484b3e96 ("xfrm: rate limit SA mapping change message to user space")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-31 08:20:08 +02:00
Lin Ma 00374d9b6d xfrm: add NULL check in xfrm_update_ae_params
Normally, x->replay_esn and x->preplay_esn should be allocated at
xfrm_alloc_replay_state_esn(...) in xfrm_state_construct(...), hence the
xfrm_update_ae_params(...) is okay to update them. However, the current
implementation of xfrm_new_ae(...) allows a malicious user to directly
dereference a NULL pointer and crash the kernel like below.

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 8253067 P4D 8253067 PUD 8e0e067 PMD 0
Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 PID: 98 Comm: poc.npd Not tainted 6.4.0-rc7-00072-gdad9774deaf1 #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.o4
RIP: 0010:memcpy_orig+0xad/0x140
Code: e8 4c 89 5f e0 48 8d 7f e0 73 d2 83 c2 20 48 29 d6 48 29 d7 83 fa 10 72 34 4c 8b 06 4c 8b 4e 08 c
RSP: 0018:ffff888008f57658 EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffff888008bd0000 RCX: ffffffff8238e571
RDX: 0000000000000018 RSI: ffff888007f64844 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888008f57818
R13: ffff888007f64aa4 R14: 0000000000000000 R15: 0000000000000000
FS:  00000000014013c0(0000) GS:ffff88806d600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000054d8000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 ? __die+0x1f/0x70
 ? page_fault_oops+0x1e8/0x500
 ? __pfx_is_prefetch.constprop.0+0x10/0x10
 ? __pfx_page_fault_oops+0x10/0x10
 ? _raw_spin_unlock_irqrestore+0x11/0x40
 ? fixup_exception+0x36/0x460
 ? _raw_spin_unlock_irqrestore+0x11/0x40
 ? exc_page_fault+0x5e/0xc0
 ? asm_exc_page_fault+0x26/0x30
 ? xfrm_update_ae_params+0xd1/0x260
 ? memcpy_orig+0xad/0x140
 ? __pfx__raw_spin_lock_bh+0x10/0x10
 xfrm_update_ae_params+0xe7/0x260
 xfrm_new_ae+0x298/0x4e0
 ? __pfx_xfrm_new_ae+0x10/0x10
 ? __pfx_xfrm_new_ae+0x10/0x10
 xfrm_user_rcv_msg+0x25a/0x410
 ? __pfx_xfrm_user_rcv_msg+0x10/0x10
 ? __alloc_skb+0xcf/0x210
 ? stack_trace_save+0x90/0xd0
 ? filter_irq_stacks+0x1c/0x70
 ? __stack_depot_save+0x39/0x4e0
 ? __kasan_slab_free+0x10a/0x190
 ? kmem_cache_free+0x9c/0x340
 ? netlink_recvmsg+0x23c/0x660
 ? sock_recvmsg+0xeb/0xf0
 ? __sys_recvfrom+0x13c/0x1f0
 ? __x64_sys_recvfrom+0x71/0x90
 ? do_syscall_64+0x3f/0x90
 ? entry_SYSCALL_64_after_hwframe+0x72/0xdc
 ? copyout+0x3e/0x50
 netlink_rcv_skb+0xd6/0x210
 ? __pfx_xfrm_user_rcv_msg+0x10/0x10
 ? __pfx_netlink_rcv_skb+0x10/0x10
 ? __pfx_sock_has_perm+0x10/0x10
 ? mutex_lock+0x8d/0xe0
 ? __pfx_mutex_lock+0x10/0x10
 xfrm_netlink_rcv+0x44/0x50
 netlink_unicast+0x36f/0x4c0
 ? __pfx_netlink_unicast+0x10/0x10
 ? netlink_recvmsg+0x500/0x660
 netlink_sendmsg+0x3b7/0x700

This Null-ptr-deref bug is assigned CVE-2023-3772. And this commit
adds additional NULL check in xfrm_update_ae_params to fix the NPD.

Fixes: d8647b79c3 ("xfrm: Add user interface for esn and big anti-replay windows")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-31 08:06:34 +02:00
Eric Dumazet 8bf43be799 net: annotate data-races around sk->sk_priority
sk_getsockopt() runs locklessly. This means sk->sk_priority
can be read while other threads are changing its value.

Other reads also happen without socket lock being held.

Add missing annotations where needed.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet e5f0d2dd3c net: add missing data-race annotation for sk_ll_usec
In a prior commit I forgot that sk_getsockopt() reads
sk->sk_ll_usec without holding a lock.

Fixes: 0dbffbb533 ("net: annotate data race around sk_ll_usec")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet 11695c6e96 net: add missing data-race annotations around sk->sk_peek_off
sk_getsockopt() runs locklessly, thus we need to annotate the read
of sk->sk_peek_off.

While we are at it, add corresponding annotations to sk_set_peek_off()
and unix_set_peek_off().

Fixes: b9bb53f383 ("sock: convert sk_peek_offset functions to WRITE_ONCE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet 3c5b4d69c3 net: annotate data-races around sk->sk_mark
sk->sk_mark is often read while another thread could change the value.

Fixes: 4a19ec5800 ("[NET]: Introducing socket mark socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet b4b5532530 net: add missing READ_ONCE(sk->sk_rcvbuf) annotation
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_rcvbuf locklessly.

Fixes: ebb3b78db7 ("tcp: annotate sk->sk_rcvbuf lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet 74bc084327 net: add missing READ_ONCE(sk->sk_sndbuf) annotation
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_sndbuf locklessly.

Fixes: e292f05e0d ("tcp: annotate sk->sk_sndbuf lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet 285975dd67 net: annotate data-races around sk->sk_{rcv|snd}timeo
sk_getsockopt() runs without locks, we must add annotations
to sk->sk_rcvtimeo and sk->sk_sndtimeo.

In the future we might allow fetching these fields before
we lock the socket in TCP fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet e6d12bdb43 net: add missing READ_ONCE(sk->sk_rcvlowat) annotation
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_rcvlowat locklessly.

Fixes: eac66402d1 ("net: annotate sk->sk_rcvlowat lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet ea7f45ef77 net: annotate data-races around sk->sk_max_pacing_rate
sk_getsockopt() runs locklessly. This means sk->sk_max_pacing_rate
can be read while other threads are changing its value.

Fixes: 62748f32d5 ("net: introduce SO_MAX_PACING_RATE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet c76a032889 net: annotate data-race around sk->sk_txrehash
sk_getsockopt() runs locklessly. This means sk->sk_txrehash
can be read while other threads are changing its value.

Other locations were handled in commit cb6cd2cec7
("tcp: Change SYN ACK retransmit behaviour to account for rehash")

Fixes: 26859240e4 ("txhash: Add socket option to control TX hash rethink behavior")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:41 +01:00
Eric Dumazet fe11fdcb42 net: annotate data-races around sk->sk_reserved_mem
sk_getsockopt() runs locklessly. This means sk->sk_reserved_mem
can be read while other threads are changing its value.

Add missing annotations where they are needed.

Fixes: 2bb2f5fb21 ("net: add new socket option SO_RESERVE_MEM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 18:13:40 +01:00
Richard Gobert 7938cd1543 net: gro: fix misuse of CB in udp socket lookup
This patch fixes a misuse of IP{6}CB(skb) in GRO, while calling to
`udp6_lib_lookup2` when handling udp tunnels. `udp6_lib_lookup2` fetch the
device from CB. The fix changes it to fetch the device from `skb->dev`.
l3mdev case requires special attention since it has a master and a slave
device.

Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29 17:10:27 +01:00
Jamal Hadi Salim e68409db99 net: sched: cls_u32: Fix match key mis-addressing
A match entry is uniquely identified with an "address" or "path" in the
form of: hashtable ID(12b):bucketid(8b):nodeid(12b).

When creating table match entries all of hash table id, bucket id and
node (match entry id) are needed to be either specified by the user or
reasonable in-kernel defaults are used. The in-kernel default for a table id is
0x800(omnipresent root table); for bucketid it is 0x0. Prior to this fix there
was none for a nodeid i.e. the code assumed that the user passed the correct
nodeid and if the user passes a nodeid of 0 (as Mingi Cho did) then that is what
was used. But nodeid of 0 is reserved for identifying the table. This is not
a problem until we dump. The dump code notices that the nodeid is zero and
assumes it is referencing a table and therefore references table struct
tc_u_hnode instead of what was created i.e match entry struct tc_u_knode.

Ming does an equivalent of:
tc filter add dev dummy0 parent 10: prio 1 handle 0x1000 \
protocol ip u32 match ip src 10.0.0.1/32 classid 10:1 action ok

Essentially specifying a table id 0, bucketid 1 and nodeid of zero
Tableid 0 is remapped to the default of 0x800.
Bucketid 1 is ignored and defaults to 0x00.
Nodeid was assumed to be what Ming passed - 0x000

dumping before fix shows:
~$ tc filter ls dev dummy0 parent 10:
filter protocol ip pref 1 u32 chain 0
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor -30591

Note that the last line reports a table instead of a match entry
(you can tell this because it says "ht divisor...").
As a result of reporting the wrong data type (misinterpretting of struct
tc_u_knode as being struct tc_u_hnode) the divisor is reported with value
of -30591. Ming identified this as part of the heap address
(physmap_base is 0xffff8880 (-30591 - 1)).

The fix is to ensure that when table entry matches are added and no
nodeid is specified (i.e nodeid == 0) then we get the next available
nodeid from the table's pool.

After the fix, this is what the dump shows:
$ tc filter ls dev dummy0 parent 10:
filter protocol ip pref 1 u32 chain 0
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1
filter protocol ip pref 1 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 10:1 not_in_hw
  match 0a000001/ffffffff at 12
	action order 1: gact action pass
	 random type none pass val 0
	 index 1 ref 1 bind 1

Reported-by: Mingi Cho <mgcho.minic@gmail.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230726135151.416917-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 18:05:04 -07:00
Daniel Xu 91721c2d02 netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
This commit adds support for enabling IP defrag using pre-existing
netfilter defrag support. Basically all the flag does is bump a refcnt
while the link the active. Checks are also added to ensure the prog
requesting defrag support is run _after_ netfilter defrag hooks.

We also take care to avoid any issues w.r.t. module unloading -- while
defrag is active on a link, the module is prevented from unloading.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 16:52:08 -07:00
Daniel Xu 9abddac583 netfilter: defrag: Add glue hooks for enabling/disabling defrag
We want to be able to enable/disable IP packet defrag from core
bpf/netfilter code. In other words, execute code from core that could
possibly be built as a module.

To help avoid symbol resolution errors, use glue hooks that the modules
will register callbacks with during module init.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/f6a8824052441b72afe5285acedbd634bd3384c1.1689970773.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 16:52:08 -07:00
Jakub Kicinski 05191d8896 Merge branch 'in-kernel-support-for-the-tls-alert-protocol'
Chuck Lever says:

====================
In-kernel support for the TLS Alert protocol

IMO the kernel doesn't need user space (ie, tlshd) to handle the TLS
Alert protocol. Instead, a set of small helper functions can be used
to handle sending and receiving TLS Alerts for in-kernel TLS
consumers.
====================

Merged on top of a tag in case it's needed in the NFS tree.

Link: https://lore.kernel.org/r/169047923706.5241.1181144206068116926.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:08:02 -07:00
Chuck Lever b470985c76 net/handshake: Trace events for TLS Alert helpers
Add observability for the new TLS Alert infrastructure.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047947409.5241.14548832149596892717.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Chuck Lever 39067dda1d SUNRPC: Use new helpers to handle TLS Alerts
Use the helpers to parse the level and description fields in
incoming alerts. "Warning" alerts are discarded, and "fatal"
alerts mean the session is no longer valid.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047944747.5241.1974889594004407123.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Chuck Lever 39d0e38dcc net/handshake: Add helpers for parsing incoming TLS Alerts
Kernel TLS consumers can replace common TLS Alert parsing code with
these helpers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047942074.5241.13791647439480672048.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Chuck Lever 5dd5ad682c SUNRPC: Send TLS Closure alerts before closing a TCP socket
Before closing a TCP connection, the TLS protocol wants peers to
send session close Alert notifications. Add those in both the RPC
client and server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047939404.5241.14392506226409865832.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Chuck Lever 35b1b538d4 net/handshake: Add API for sending TLS Closure alerts
This helper sends an alert only if a TLS session was established.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047936730.5241.618595693821012638.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Chuck Lever 6a7eccef47 net/tls: Move TLS protocol elements to a separate header
Kernel TLS consumers will need definitions of various parts of the
TLS protocol, but often do not need the function declarations and
other infrastructure provided in <net/tls.h>.

Break out existing standardized protocol elements into a separate
header, and make room for a few more elements in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047931374.5241.7713175865185969309.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 14:07:59 -07:00
Patrick Rohr 5027d54a9c net: change accept_ra_min_rtr_lft to affect all RA lifetimes
accept_ra_min_rtr_lft only considered the lifetime of the default route
and discarded entire RAs accordingly.

This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and
applies the value to individual RA sections; in particular, router
lifetime, PIO preferred lifetime, and RIO lifetime. If any of those
lifetimes are lower than the configured value, the specific RA section
is ignored.

In order for the sysctl to be useful to Android, it should really apply
to all lifetimes in the RA, since that is what determines the minimum
frequency at which RAs must be processed by the kernel. Android uses
hardware offloads to drop RAs for a fraction of the minimum of all
lifetimes present in the RA (some networks have very frequent RAs (5s)
with high lifetimes (2h)). Despite this, we have encountered networks
that set the router lifetime to 30s which results in very frequent CPU
wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the
WiFi firmware) entirely on such networks, it seems better to ignore the
misconfigured routers while still processing RAs from other IPv6 routers
on the same network (i.e. to support IoT applications).

The previous implementation dropped the entire RA based on router
lifetime. This turned out to be hard to expand to the other lifetimes
present in the RA in a consistent manner; dropping the entire RA based
on RIO/PIO lifetimes would essentially require parsing the whole thing
twice.

Fixes: 1671bcfd76 ("net: add sysctl accept_ra_min_rtr_lft")
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230726230701.919212-1-prohr@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 13:30:51 -07:00
Jakub Kicinski 84e00d9bd4 net: convert some netlink netdev iterators to depend on the xarray
Reap the benefits of easier iteration thanks to the xarray.
Convert just the genetlink ones, those are easier to test.

Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 11:35:58 -07:00
Jakub Kicinski 759ab1edb5 net: store netdevs in an xarray
Iterating over the netdev hash table for netlink dumps is hard.
Dumps are done in "chunks" so we need to save the position
after each chunk, so we know where to restart from. Because
netdevs are stored in a hash table we remember which bucket
we were in and how many devices we dumped.

Since we don't hold any locks across the "chunks" - devices may
come and go while we're dumping. If that happens we may miss
a device (if device is deleted from the bucket we were in).
We indicate to user space that this may have happened by setting
NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
if it sees that. Somehow I doubt most user space gets this right..

To illustrate let's look at an example:

               System state:
  start:       # [A, B, C]
  del:  B      # [A, C]

with the hash table we may dump [A, B], missing C completely even
tho it existed both before and after the "del B".

Add an xarray and use it to allocate ifindexes. This way we
can iterate ifindexes in order, without the worry that we'll
skip one. We may still generate a dump of a state which "never
existed", for example for a set of values and sequence of ops:

               System state:
  start:       # [A, B]
  add:  C      # [A, C, B]
  del:  B      # [A, C]

we may generate a dump of [A], if C got an index between A and B.
System has never been in such state. But I'm 90% sure that's perfectly
fine, important part is that we can't _miss_ devices which exist before
and after. User space which wants to mirror kernel's state subscribes
to notifications and does periodic dumps so it will know that C exists
from the notification about its creation or from the next dump
(next dump is _guaranteed_ to include C, if it doesn't get removed).

To avoid any perf regressions keep the hash table for now. Most
net namespaces have very few devices and microbenchmarking 1M lookups
on Skylake I get the following results (not counting loopback
to number of devs):

 #devs | hash |  xa  | delta
    2  | 18.3 | 20.1 | + 9.8%
   16  | 18.3 | 20.1 | + 9.5%
   64  | 18.3 | 26.3 | +43.8%
  128  | 20.4 | 26.3 | +28.6%
  256  | 20.0 | 26.4 | +32.1%
 1024  | 26.6 | 26.7 | + 0.2%
 8192  |541.3 | 33.5 | -93.8%

No surprises since the hash table has 256 entries.
The microbenchmark scans indexes in order, if the pattern is more
random xa starts to win at 512 devices already. But that's a lot
of devices, in practice.

Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 11:35:58 -07:00
Linus Torvalds e62e26d3e9 A patch to reduce the potential for erroneous RBD exclusive lock
blocklisting (fencing) with a couple of prerequisites and a fixup to
 prevent metrics from being sent to the MDS even just once after that
 has been disabled by the user.  All marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTD7MgTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3SBB/4nHdfSQwy0z2+PM766sUKxSlmaRw8X
 4AJyGAIGj5BnHHhtluwLpEYfrh3wfyCRaNYgS64jdsudbUBxPKIIWn2lEFxtyWbC
 w0R2uEc+NGJLJOYfJ+lBP06Q2r6qk7N6OGNy6qLaN+v6xJ8WPw7H3fJBLVhnPgMq
 7lkACRN+0P5Xt6ZJ57kbWWiFQ+vjv7bbDa0P9zMl6uCgoYsIpvrskqygx+gHbdsq
 IcnpsHu3F0ycYAJT5eJ5GcCcThvwbNjWdbJy1fERah7U/LNcX/S3To9V5LPmydOQ
 tYAWMlC/1a99fr+jTYF0Pu5GLUdK0UMKeX04ZN3SKON4pORurpypw3n8
 =rw72
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch to reduce the potential for erroneous RBD exclusive lock
  blocklisting (fencing) with a couple of prerequisites and a fixup to
  prevent metrics from being sent to the MDS even just once after that
  has been disabled by the user. All marked for stable"

* tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client:
  rbd: retrieve and check lock owner twice before blocklisting
  rbd: harden get_lock_owner_info() a bit
  rbd: make get_lock_owner_info() return a single locker or NULL
  ceph: never send metrics if disable_send_metrics is set
2023-07-28 10:47:24 -07:00
Linus Torvalds 28d79b746c Misc set of fixes for 9p in 6.5
Most of these clean up warnings we've gotten out of compilation tools, but
 several of them were from inspection while hunting down a couple of
 regressions.
 
 The most important one to pull is 75b396821c
 (fs/9p: remove unnecessary and overrestrictive check)
 which caused a regression for some folks by restricting mmap
 in any case where writeback caches weren't enabled.
 
 Most of the other bugs caught via inspection were type mismatches.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmTDH8cACgkQiP/V+0pf
 /5hSJQ//b59cDliC7Knf9B2Of1UsLJ2wYIbxWVYKLwYarKFn3tmtO5dPtWZrQzjB
 Kz6fif5z1c0WdjNFLifs/XNqUq5znX/TY8bV/NmOg8VlaoJqmUQSSYnNQOWZCFKT
 zwxC6BO6gPNNIkJN2xQ8oOq11Qon/nbZbuN9P2VDcT5Yr2KmFx6FHRcrBNRYAm3E
 UzFdjkLrLef3VrvegJNGM3Wv2HqyNBA6QhifZBjDkydtDPMd9fRNns7Q60AARR9K
 aqXV6SihE/Ox7sSmVNjTzYF67eq5Xjt+sSzo2SdfOaZxVIa6wf0UXQuFqmHts6Zs
 QUCdXS5YbQAwQfdkm22rnTIxAwsbEpFOGGweUvMBXzZbl/sq/PK4Nt6DpCS8ZFi3
 81Z5Ey+Q4yaxwdirP521M4ao2Ae2Fzg12bqDTNssZdOYGcXBqBfWiR5IfMbbkgWq
 WzCVI3V/LshQ75pXQyS4BtW/29C2nN7g3jLrF3Q5OTe7XmHMCZFvtP4lKvY0piQy
 ++XoDs1LCJWSZebfkNa05L5nhQ1mYhwiZutHTtF3ejTTiJvcJXQ4xHYLzjOON+4i
 blLTpgWLO0rIRAmX0I8GwPi6q0xL4rFP4XGGz/LDppRQkRa13vtzFtMvuldPyEq7
 g4pcLkI3SPbL982qYbg8UO+GjO/Q9M/DafXQVlUyDw04TQbT8Jg=
 =TAC1
 -----END PGP SIGNATURE-----

Merge tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p fixes from Eric Van Hensbergen:
 "Misc set of fixes for 9p.

  Most of these clean up warnings we've gotten out of compilation tools,
  but several of them were from inspection while hunting down a couple
  of regressions.

  The most important one is 75b396821c ("fs/9p: remove unnecessary and
  overrestrictive check") which caused a regression for some folks by
  restricting mmap in any case where writeback caches weren't enabled.

  Most of the other bugs caught via inspection were type mismatches"

* tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: Remove unused extern declaration
  9p: remove dead stores (variable set again without being read)
  9p: virtio: skip incrementing unused variable
  9p: virtio: make sure 'offs' is initialized in zc_request
  9p: virtio: fix unlikely null pointer deref in handle_rerror
  9p: fix ignored return value in v9fs_dir_release
  fs/9p: remove unnecessary invalidate_inode_pages2
  fs/9p: fix type mismatch in file cache mode helper
  fs/9p: fix typo in comparison logic for cache mode
  fs/9p: remove unnecessary and overrestrictive check
  fs/9p: Fix a datatype used with V9FS_DIRECT_IO
2023-07-28 10:43:16 -07:00
Remi Pommarel eac27a41ab batman-adv: Do not get eth header before batadv_check_management_packet
If received skb in batadv_v_elp_packet_recv or batadv_v_ogm_packet_recv
is either cloned or non linearized then its data buffer will be
reallocated by batadv_check_management_packet when skb_cow or
skb_linearize get called. Thus geting ethernet header address inside
skb data buffer before batadv_check_management_packet had any chance to
reallocate it could lead to the following kernel panic:

  Unable to handle kernel paging request at virtual address ffffff8020ab069a
  Mem abort info:
    ESR = 0x96000007
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    FSC = 0x07: level 3 translation fault
  Data abort info:
    ISV = 0, ISS = 0x00000007
    CM = 0, WnR = 0
  swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000040f45000
  [ffffff8020ab069a] pgd=180000007fffa003, p4d=180000007fffa003, pud=180000007fffa003, pmd=180000007fefe003, pte=0068000020ab0706
  Internal error: Oops: 96000007 [#1] SMP
  Modules linked in: ahci_mvebu libahci_platform libahci dvb_usb_af9035 dvb_usb_dib0700 dib0070 dib7000m dibx000_common ath11k_pci ath10k_pci ath10k_core mwl8k_new nf_nat_sip nf_conntrack_sip xhci_plat_hcd xhci_hcd nf_nat_pptp nf_conntrack_pptp at24 sbsa_gwdt
  CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.15.42-00066-g3242268d425c-dirty #550
  Hardware name: A8k (DT)
  pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : batadv_is_my_mac+0x60/0xc0
  lr : batadv_v_ogm_packet_recv+0x98/0x5d0
  sp : ffffff8000183820
  x29: ffffff8000183820 x28: 0000000000000001 x27: ffffff8014f9af00
  x26: 0000000000000000 x25: 0000000000000543 x24: 0000000000000003
  x23: ffffff8020ab0580 x22: 0000000000000110 x21: ffffff80168ae880
  x20: 0000000000000000 x19: ffffff800b561000 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 00dc098924ae0032
  x14: 0f0405433e0054b0 x13: ffffffff00000080 x12: 0000004000000001
  x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
  x8 : 0000000000000000 x7 : ffffffc076dae000 x6 : ffffff8000183700
  x5 : ffffffc00955e698 x4 : ffffff80168ae000 x3 : ffffff80059cf000
  x2 : ffffff800b561000 x1 : ffffff8020ab0696 x0 : ffffff80168ae880
  Call trace:
   batadv_is_my_mac+0x60/0xc0
   batadv_v_ogm_packet_recv+0x98/0x5d0
   batadv_batman_skb_recv+0x1b8/0x244
   __netif_receive_skb_core.isra.0+0x440/0xc74
   __netif_receive_skb_one_core+0x14/0x20
   netif_receive_skb+0x68/0x140
   br_pass_frame_up+0x70/0x80
   br_handle_frame_finish+0x108/0x284
   br_handle_frame+0x190/0x250
   __netif_receive_skb_core.isra.0+0x240/0xc74
   __netif_receive_skb_list_core+0x6c/0x90
   netif_receive_skb_list_internal+0x1f4/0x310
   napi_complete_done+0x64/0x1d0
   gro_cell_poll+0x7c/0xa0
   __napi_poll+0x34/0x174
   net_rx_action+0xf8/0x2a0
   _stext+0x12c/0x2ac
   run_ksoftirqd+0x4c/0x7c
   smpboot_thread_fn+0x120/0x210
   kthread+0x140/0x150
   ret_from_fork+0x10/0x20
  Code: f9403844 eb03009f 54fffee1 f94

Thus ethernet header address should only be fetched after
batadv_check_management_packet has been called.

Fixes: 0da0035942 ("batman-adv: OGMv2 - add basic infrastructure")
Cc: stable@vger.kernel.org
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-07-28 15:39:38 +02:00
Hangbin Liu 7f6c40391a IPv6: add extack info for IPv6 address add/delete
Add extack info for IPv6 address add/delete, which would be useful for
users to understand the problem without having to read kernel code.

Suggested-by: Beniamino Galvani <bgalvani@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-28 11:01:56 +01:00
Joe Damato 801b27e880 net: ethtool: Unify ETHTOOL_{G,S}RXFH rxnfc copy
ETHTOOL_GRXFH correctly copies in the full struct ethtool_rxnfc when
FLOW_RSS is set; ETHTOOL_SRXFH needs a similar code path to handle the
FLOW_RSS case so that ethtool can set the flow hash for custom RSS
contexts (if supported by the driver).

The copy code from ETHTOOL_GRXFH has been pulled out in to a helper so
that it can be called in both ETHTOOL_{G,S}RXFH code paths.

Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-28 09:35:53 +01:00
Rob Herring 3d40aed862 net: Explicitly include correct DT includes
The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.

Acked-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230727014944.3972546-1-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 20:33:16 -07:00
Jakub Kicinski 5908a4c47c netfilter net-next pull request 2023-07-27
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTCcgkNHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AJmMD/9IPWnzSNLUgoAhSo0h2OkCKl2iIdRnkrPrruhE
 Su8bD8ohmU100iN1DMXT2a7C9o0BTog4EB7WtF21z+06dUhROiZizrSt8bTk/rRi
 0+Sm9xlDAdl3CZcU8fnVjwf6PLYgUv5zVjcQc4Ggf15MwEIdpviKCps2bbBtrozF
 PJEK6+UwTU6+z4GSTc957nhFHstEcwktyxoaAote98CD78G2YCQT5yVbfctHgRm0
 9qovT8S/zZmqHvqvUfrqJd+N5V/+40O7ZuFls93kYxK9Bttx9wRwEqALPldxXudU
 o0kG4QZ8NAwiIVsGqPwKu/cKi9PF0z/PUXYgVdnkKK+XofBDHbHyfR+BJO1ejOdX
 +ea9AoQ6lD6NVmvX01+lF9OI4D1zgc6pLGyjSsyVgv3x0iKJeZ8QOgb0DTGFiG1U
 MnFIeckedrh/dt3NXLG/blZvuAzhofHqEhH/DlvbI/QBtN2zEgIMJKxRfBAMs3OO
 WAIlaHASQFVbyrHOr/X3FoNDTsvZyrTppo9WwJVTj9F41lYXzWoiBY+nVj2brGDR
 SMW1M13sufRBQlk0aTpPYPvcS5FhsMf6ggxygi2rNxX5/AdFE02nnEU9ybpHAqcy
 NiZ8kCxJ2J9+aCj7yvJ7QQcAD7l2tAIeAZCKSlKteigqTI0PWoTUc0IYPT85URLm
 cy/l4A==
 =fgLz
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-23-07-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter updates for net-next

1.  silence a harmless warning for CONFIG_NF_CONNTRACK_PROCFS=n builds,
 from Zhu Wang.

2, 3:
Allow NLA_POLICY_MASK to be used with BE16/BE32 types, and replace a few
manual checks with nla_policy based one in nf_tables, from myself.

4: cleanup in ctnetlink to validate while parsing rather than
   using two steps, from Lin Ma.

5: refactor boyer-moore textsearch by moving a small chunk to
   a helper function, rom Jeremy Sowden.

* tag 'nf-next-23-07-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  lib/ts_bm: add helper to reduce indentation and improve readability
  netfilter: conntrack: validate cta_ip via parsing
  netfilter: nf_tables: use NLA_POLICY_MASK to test for valid flag options
  netlink: allow be16 and be32 types in all uint policy checks
  nf_conntrack: fix -Wunused-const-variable=
====================

Link: https://lore.kernel.org/r/20230727133604.8275-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 20:25:43 -07:00
Jakub Kicinski bb85e12f8f Merge branch 'net-tls-fixes-for-nvme-over-tls'
Hannes Reinecke says:

====================
net/tls: fixes for NVMe-over-TLS

here are some small fixes to get NVMe-over-TLS up and running.
The first set are just minor modifications to have MSG_EOR handled
for TLS, but the second set implements the ->read_sock() callback
for tls_sw.
The ->read_sock() callbacks return -EIO when encountering any TLS
Alert message, but as that's the default behaviour anyway I guess
we can get away with it.
====================

Applied on top of the tag in case Sagi gets convinced to pull it.

Link: https://lore.kernel.org/r/20230726191556.41714-1-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 20:11:48 -07:00
Eric Dumazet 4d50e50045 net: flower: fix stack-out-of-bounds in fl_set_key_cfm()
Typical misuse of

	nla_parse_nested(array, XXX_MAX, ...);

array must be declared as

	struct nlattr *array[XXX_MAX + 1];

v2: Based on feedbacks from Ido Schimmel and Zahari Doychev,
I also changed TCA_FLOWER_KEY_CFM_OPT_MAX and cfm_opt_policy
definitions.

syzbot reported:

BUG: KASAN: stack-out-of-bounds in __nla_validate_parse+0x136/0x2bd0 lib/nlattr.c:588
Write of size 32 at addr ffffc90003a0ee20 by task syz-executor296/5014

CPU: 0 PID: 5014 Comm: syz-executor296 Not tainted 6.5.0-rc2-syzkaller-00307-gd192f5382581 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x163/0x540 mm/kasan/report.c:475
kasan_report+0x175/0x1b0 mm/kasan/report.c:588
kasan_check_range+0x27e/0x290 mm/kasan/generic.c:187
__asan_memset+0x23/0x40 mm/kasan/shadow.c:84
__nla_validate_parse+0x136/0x2bd0 lib/nlattr.c:588
__nla_parse+0x40/0x50 lib/nlattr.c:700
nla_parse_nested include/net/netlink.h:1262 [inline]
fl_set_key_cfm+0x1e3/0x440 net/sched/cls_flower.c:1718
fl_set_key+0x2168/0x6620 net/sched/cls_flower.c:1884
fl_tmplt_create+0x1fe/0x510 net/sched/cls_flower.c:2666
tc_chain_tmplt_add net/sched/cls_api.c:2959 [inline]
tc_ctl_chain+0x131d/0x1ac0 net/sched/cls_api.c:3068
rtnetlink_rcv_msg+0x82b/0xf50 net/core/rtnetlink.c:6424
netlink_rcv_skb+0x1df/0x430 net/netlink/af_netlink.c:2549
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x7c3/0x990 net/netlink/af_netlink.c:1365
netlink_sendmsg+0xa2a/0xd60 net/netlink/af_netlink.c:1914
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x592/0x890 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2577
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f54c6150759
Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffe06c30578 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f54c619902d RCX: 00007f54c6150759
RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
RBP: 00007ffe06c30590 R08: 0000000000000000 R09: 00007ffe06c305f0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f54c61c35f0
R13: 00007ffe06c30778 R14: 0000000000000001 R15: 0000000000000001
</TASK>

The buggy address belongs to stack of task syz-executor296/5014
and is located at offset 32 in frame:
fl_set_key_cfm+0x0/0x440 net/sched/cls_flower.c:374

This frame has 1 object:
[32, 56) 'nla_cfm_opt'

The buggy address belongs to the virtual mapping at
[ffffc90003a08000, ffffc90003a11000) created by:
copy_process+0x5c8/0x4290 kernel/fork.c:2330

Fixes: 7cfffd5fed ("net: flower: add support for matching cfm fields")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Zahari Doychev <zdoychev@maxlinear.com>
Link: https://lore.kernel.org/r/20230726145815.943910-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 20:01:29 -07:00
Hannes Reinecke 662fbcec32 net/tls: implement ->read_sock()
Implement ->read_sock() function for use with nvme-tcp.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Cc: Boris Pismenny <boris.pismenny@gmail.com>
Link: https://lore.kernel.org/r/20230726191556.41714-7-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 19:49:35 -07:00
Hannes Reinecke f9ae3204fb net/tls: split tls_rx_reader_lock
Split tls_rx_reader_{lock,unlock} into an 'acquire/release' and
the actual locking part.
With that we can use the tls_rx_reader_lock in situations where
the socket is already locked.

Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-6-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 19:49:35 -07:00
Hannes Reinecke 11863c6d44 net/tls: Use tcp_read_sock() instead of ops->read_sock()
TLS resets the protocol operations, so the read_sock() callback might
be changed, too.
In this case using sock->ops->readsock() in tls_strp_read_copyin() will
enter an infinite recursion if the read_sock() callback is calling
tls_rx_rec_wait() which will call into sock->ops->readsock() via
tls_strp_read_copyin().
But as tls_strp_read_copyin() is supposed to produce data from the
consumed socket and that socket is always a TCP socket we can call
tcp_read_sock() directly without having to deal with callbacks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-5-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 19:49:35 -07:00
Hannes Reinecke c004b0e00c net/tls: handle MSG_EOR for tls_device TX flow
tls_push_data() MSG_MORE, but bails out on MSG_EOR.
Seeing that MSG_EOR is basically the opposite of MSG_MORE
this patch adds handling MSG_EOR by treating it as the
absence of MSG_MORE.
Consequently we should return an error when both are set.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-3-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 19:49:34 -07:00
Hannes Reinecke e22e358bbe net/tls: handle MSG_EOR for tls_sw TX flow
tls_sw_sendmsg() already handles MSG_MORE, but bails
out on MSG_EOR.
Seeing that MSG_EOR is basically the opposite of
MSG_MORE this patch adds handling MSG_EOR by treating
it as the negation of MSG_MORE.
And erroring out if MSG_EOR is specified with MSG_MORE.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-2-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 19:49:34 -07:00
Russell King (Oracle) 9945c1fb03 net: dsa: fix older DSA drivers using phylink
Older DSA drivers that do not provide an dsa_ops adjust_link method end
up using phylink. Unfortunately, a recent phylink change that requires
its supported_interfaces bitmap to be filled breaks these drivers
because the bitmap remains empty.

Rather than fixing each driver individually, fix it in the core code so
we have a sensible set of defaults.

Reported-by: Sergei Antonov <saproj@gmail.com>
Fixes: de5c9bf40c ("net: phylink: require supported_interfaces to be filled")
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com> # dsa_loop
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1qOflM-001AEz-D3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 17:19:46 -07:00
YueHaibing d4a80cc69a dccp: Remove unused declaration dccp_feat_initialise_sysctls()
This is never used, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230726143239.9904-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 17:16:26 -07:00
Lin Ma d73ef2d69c rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length
There are totally 9 ndo_bridge_setlink handlers in the current kernel,
which are 1) bnxt_bridge_setlink, 2) be_ndo_bridge_setlink 3)
i40e_ndo_bridge_setlink 4) ice_bridge_setlink 5)
ixgbe_ndo_bridge_setlink 6) mlx5e_bridge_setlink 7)
nfp_net_bridge_setlink 8) qeth_l2_bridge_setlink 9) br_setlink.

By investigating the code, we find that 1-7 parse and use nlattr
IFLA_BRIDGE_MODE but 3 and 4 forget to do the nla_len check. This can
lead to an out-of-attribute read and allow a malformed nlattr (e.g.,
length 0) to be viewed as a 2 byte integer.

To avoid such issues, also for other ndo_bridge_setlink handlers in the
future. This patch adds the nla_len check in rtnl_bridge_setlink and
does an early error return if length mismatches. To make it works, the
break is removed from the parsing for IFLA_BRIDGE_FLAGS to make sure
this nla_for_each_nested iterates every attribute.

Fixes: b1edc14a3f ("ice: Implement ice_bridge_getlink and ice_bridge_setlink")
Fixes: 51616018dd ("i40e: Add support for getlink, setlink ndo ops")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20230726075314.1059224-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 17:14:01 -07:00
YueHaibing 4d66f235c7 bridge: Remove unused declaration br_multicast_set_hash_max()
Since commit 19e3a9c90c ("net: bridge: convert multicast to generic rhashtable")
this is not used, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20230726143141.11704-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 17:11:29 -07:00
Patrick Rohr ef27ba5c84 net: remove comment in ndisc_router_discovery
Removes superfluous (and misplaced) comment from ndisc_router_discovery.

Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230726184742.342825-1-prohr@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 16:56:23 -07:00
Jakub Kicinski 014acf2668 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 15:22:46 -07:00
Lin Ma bcc29b7f5a bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
The nla_for_each_nested parsing in function bpf_sk_storage_diag_alloc
does not check the length of the nested attribute. This can lead to an
out-of-attribute read and allow a malformed nlattr (e.g., length 0) to
be viewed as a 4 byte integer.

This patch adds an additional check when the nlattr is getting counted.
This makes sure the latter nla_get_u32 can access the attributes with
the correct length.

Fixes: 1ed4d92458 ("bpf: INET_DIAG support in bpf_sk_storage")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230725023330.422856-1-linma@zju.edu.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-27 10:07:56 -07:00
Arseniy Krasnov a75f501de8 virtio/vsock: support MSG_PEEK for SOCK_SEQPACKET
This adds support of MSG_PEEK flag for SOCK_SEQPACKET type of socket.
Difference with SOCK_STREAM is that this callback returns either length
of the message or error.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-27 15:51:48 +02:00