OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
David S. Miller	5fc43ce03b	ipsec-2023-08-15 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmTbRoQACgkQrB3Eaf9P W7eY+A/8DJtqwFs1uAahS9jCX1bxf3vKKUPkEKu3IfcZVv2WcMVDI7XRLnBb93PC GR1RQskCErXMrVv7mmaBuk/uZAcAFQUkzna3MmyAw4lFfJSWORD6rzGiFeDVMsvx 7gczhYC6aPwhiyAqk6eoNTKxLaZ0zfGiW9ZKWdZXuTjp+ijksa56gEdKsPwMQIht FE4+CHia0dxFK0bUZMLHc4ixQbqKkHj/qVxB8k8zQnDgmCavjlEAnc+PAOX+SNxm uju4gDV/9qXYOkHTwRD9/aPcvCofTlD9XynSHkMC24yLS6Ir4A1mFUZywNSiwcgX //WxymD1N93inuHGzVluhm6Jy+4hTaS5p1y+H86L2TfC9b5SOrNYtj3yLB3aqDgq 1+4t4cVAtpk7uLfPYKzreDJH+CoxQDC8x+0dlzQUGnV11eIJ2RA0brJhFqHjOlbD SAQtBwkPqlAXnrdDr2pUhyrlrwAGXux8T5u5tF3NSS3FEwh7akRBfU2HV6vPEE80 qPIxHSbA9d0j+tOjbkHIYEv9fMHHFC/aFLZMYOew016TKGBJth4g+DJqJiEEDZZh iEIC62lrMV2qsyW5PdYdxGesZaAC/4koFCTBgkBYyIC/4gm5E74Ygu5B2A3xx89p H6MlF3Miofsf9aSh1vw4cyb69mPaP1XG95OGTqc0qpV2XhhjpE0= =UVTC -----END PGP SIGNATURE----- Merge tag 'ipsec-2023-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Fix a slab-out-of-bounds read in xfrm_address_filter. From Lin Ma. 2) Fix the pfkey sadb_x_filter validation. From Lin Ma. 3) Use the correct nla_policy structure for XFRMA_SEC_CTX. From Lin Ma. 4) Fix warnings triggerable by bad packets in the encap functions. From Herbert Xu. 5) Fix some slab-use-after-free in decode_session6. From Zhengchao Shao. 6) Fix a possible NULL piointer dereference in xfrm_update_ae_params. Lin Ma. 7) Add a forgotten nla_policy for XFRMA_MTIMER_THRESH. From Lin Ma. 8) Don't leak offloaded policies. From Leon Romanovsky. 9) Delete also the offloading part of an acquire state. From Leon Romanovsky. Please pull or let me know if there are problems.	2023-08-16 08:57:41 +01:00
Jakub Kicinski	956db0a13b	net: warn about attempts to register negative ifindex Since the xarray changes we mix returning valid ifindex and negative errno in a single int returned from dev_index_reserve(). This depends on the fact that ifindexes can't be negative. Otherwise we may insert into the xarray and return a very large negative value. This in turn may break ERR_PTR(). OvS is susceptible to this problem and lacking validation (fix posted separately for net). Reject negative ifindex explicitly. Add a warning because the input validation is better handled by the caller. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 19:18:34 -07:00
Jakub Kicinski	a552bfa16b	net: openvswitch: reject negative ifindex Recent changes in net-next (commit `759ab1edb5` ("net: store netdevs in an xarray")) refactored the handling of pre-assigned ifindexes and let syzbot surface a latent problem in ovs. ovs does not validate ifindex, making it possible to create netdev ports with negative ifindex values. It's easy to repro with YNL: $ ./cli.py --spec netlink/specs/ovs_datapath.yaml \ --do new \ --json '{"upcall-pid": 1, "name":"my-dp"}' $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' $ ip link show -65536: some-port0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 7a:48:21:ad:0b:fb brd ff:ff:ff:ff:ff:ff ... Validate the inputs. Now the second command correctly returns: $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' lib.ynl.NlError: Netlink error: Numerical result out of range nl_len = 108 (92) nl_flags = 0x300 nl_type = 2 error: -34 extack: {'msg': 'integer out of range', 'unknown': [[type:4 len:36] b'\x0c\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x00\xff\xff\xff\x7f\x00\x00\x00\x00\x08\x00\x01\x00\x08\x00\x00\x00'], 'bad-attr': '.ifindex'} Accept 0 since it used to be silently ignored. Fixes: `54c4ef34c4` ("openvswitch: allow specifying ifindex of new interfaces") Reported-by: syzbot+7456b5dcf65111553320@syzkaller.appspotmail.com Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814203840.2908710-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 19:07:52 -07:00
Ido Schimmel	db1428f66a	nexthop: Do not increment dump sentinel at the end of the dump The nexthop and nexthop bucket dump callbacks previously returned a positive return code even when the dump was complete, prompting the core netlink code to invoke the callback again, until returning zero. Zero was only returned by these callbacks when no information was filled in the provided skb, which was achieved by incrementing the dump sentinel at the end of the dump beyond the ID of the last nexthop. This is no longer necessary as when the dump is complete these callbacks return zero. Remove the unnecessary increment. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230813164856.2379822-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 18:54:53 -07:00
Ido Schimmel	23ab9324fd	nexthop: Simplify nexthop bucket dump Before commit `f10d3d9df4` ("nexthop: Make nexthop bucket dump more efficient"), rtm_dump_nexthop_bucket_nh() returned a non-zero return code for each resilient nexthop group whose buckets it dumped, regardless if it encountered an error or not. This meant that the sentinel ('dd->ctx->nh.idx') used by the function that walked the different nexthops could not be used as a sentinel for the bucket dump, as otherwise buckets from the same group would be dumped over and over again. This was dealt with by adding another sentinel ('dd->ctx->done_nh_idx') that was incremented by rtm_dump_nexthop_bucket_nh() after successfully dumping all the buckets from a given group. After the previously mentioned commit this sentinel is no longer necessary since the function no longer returns a non-zero return code when successfully dumping all the buckets from a given group. Remove this sentinel and simplify the code. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230813164856.2379822-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 18:54:52 -07:00
Andrea Mayer	7458575a07	seg6: add NEXT-C-SID support for SRv6 End.X behavior The NEXT-C-SID mechanism described in [1] offers the possibility of encoding several SRv6 segments within a single 128 bit SID address. Such a SID address is called a Compressed SID (C-SID) container. In this way, the length of the SID List can be drastically reduced. A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address logically structured in three main blocks: i) Locator-Block; ii) Locator-Node Function; iii) Argument. C-SID container +------------------------------------------------------------------+ \| Locator-Block \|Loc-Node\| Argument \| \| \|Function\| \| +------------------------------------------------------------------+ <--------- B -----------> <- NF -> <------------- A ---------------> (i) The Locator-Block can be any IPv6 prefix available to the provider; (ii) The Locator-Node Function represents the node and the function to be triggered when a packet is received on the node; (iii) The Argument carries the remaining C-SIDs in the current C-SID container. This patch leverages the NEXT-C-SID mechanism previously introduced in the Linux SRv6 subsystem [2] to support SID compression capabilities in the SRv6 End.X behavior [3]. An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior but it is capable of processing the compressed SID List encoded in C-SID containers. An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support user-provided Locator-Block and Locator-Node Function lengths. In this implementation, such lengths must be evenly divisible by 8 (i.e. must be byte-aligned), otherwise the kernel informs the user about invalid values with a meaningful error code and message through netlink_ext_ack. If Locator-Block and/or Locator-Node Function lengths are not provided by the user during configuration of an SRv6 End.X behavior instance with NEXT-C-SID flavor, the kernel will choose their default values i.e., 32-bit Locator-Block and 16-bit Locator-Node Function. [1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression [2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/ [3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 18:51:47 -07:00
Joel Granados	c899710fe7	networking: Update to register_net_sysctl_sz Move from register_net_sysctl to register_net_sysctl_sz for all the networking related files. Do this while making sure to mirror the NULL assignments with a table_size of zero for the unprivileged users. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. An additional size function was added to the following files in order to calculate the size of an array that is defined in another file: include/net/ipv6.h net/ipv6/icmp.c net/ipv6/route.c net/ipv6/sysctl_net_ipv6.c Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:18 -07:00
Joel Granados	385a5dc9e5	netfilter: Update to register_net_sysctl_sz Move from register_net_sysctl to register_net_sysctl_sz for all the netfilter related files. Do this while making sure to mirror the NULL assignments with a table_size of zero for the unprivileged users. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	7737e46d9d	ax.25: Update to register_net_sysctl_sz Move from register_net_sysctl to register_net_sysctl_sz and pass the ARRAY_SIZE of the ctl_table array that was used to create the table variable. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	95d4977876	sysctl: Add size to register_net_sysctl function This commit adds size to the register_net_sysctl indirection function to facilitate the removal of the sentinel elements (last empty markers) from the ctl_table arrays. Though we don't actually remove any sentinels in this commit, register_net_sysctl* now has the capability of forwarding table_size for when that happens. We create a new function register_net_sysctl_sz with an extra size argument. A macro replaces the existing register_net_sysctl. The size in the macro is SIZE_MAX instead of ARRAY_SIZE to avoid compilation errors while we systematically migrate to register_net_sysctl_sz. Will change to ARRAY_SIZE in subsequent commits. Care is taken to add table_size to the stopping criteria in such a way that when we remove the empty sentinel element, it will continue stopping in the last element of the ctl_table array. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	9edbfe92a0	sysctl: Add size to register_sysctl This commit adds table_size to register_sysctl in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do not remove any sentinels in this commit, we set things up by either passing the table_size explicitly or using ARRAY_SIZE on the ctl_table arrays. We replace the register_syctl function with a macro that will add the ARRAY_SIZE to the new register_sysctl_sz function. In this way the callers that are already using an array of ctl_table structs do not change. For the callers that pass a ctl_table array pointer, we pass the table_size to register_sysctl_sz instead of the macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	bff97cf11b	sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we do make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size does however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Pablo Neira Ayuso	23185c6aed	netfilter: nft_dynset: disallow object maps Do not allow to insert elements from datapath to objects maps. Fixes: `8aeff920dc` ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:15 +02:00
Pablo Neira Ayuso	02c6c24402	netfilter: nf_tables: GC transaction race with netns dismantle Use maybe_get_net() since GC workqueue might race with netns exit path. Fixes: `5f68718b34` ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:15 +02:00
Pablo Neira Ayuso	6a33d8b73d	netfilter: nf_tables: fix GC transaction races with netns and netlink event exit path Netlink event path is missing a synchronization point with GC transactions. Add GC sequence number update to netns release path and netlink event path, any GC transaction losing race will be discarded. Fixes: `5f68718b34` ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:15 +02:00
Sishuai Gong	5310760af1	ipvs: fix racy memcpy in proc_do_sync_threshold When two threads run proc_do_sync_threshold() in parallel, data races could happen between the two memcpy(): Thread-1 Thread-2 memcpy(val, valp, sizeof(val)); memcpy(valp, val, sizeof(val)); This race might mess up the (struct ctl_table *) table->data, so we add a mutex lock to serialize them. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.com/ Signed-off-by: Sishuai Gong <sishuai.system@gmail.com> Acked-by: Simon Horman <horms@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:15 +02:00
Xin Long	9bfab6d23a	netfilter: set default timeout to 3 secs for sctp shutdown send and recv state In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300 msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state. As Paolo Valerio noticed, this might cause unwanted expiration of the ct entry. In my test, with 1s tc netem delay set on the NAT path, after the SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is sent back from the peer, the sctp ct entry has expired and been deleted, and then the SHUTDOWN_ACK has to be dropped. Also, it is confusing these two sysctl options always show 0 due to all timeout values using sec as unit: net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0 net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0 This patch fixes it by also using 3 secs for sctp shutdown send and recv state in sctp conntrack, which is also RTO.initial value in SCTP protocol. Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV was probably used for a rare scenario where SHUTDOWN is sent on 1st path but SHUTDOWN_ACK is replied on 2nd path, then a new connection started immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV to CLOSE when receiving INIT in the ORIGINAL direction. Fixes: `9fb9cbb108` ("[NETFILTER]: Add nf_conntrack subsystem.") Reported-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:15 +02:00
Florian Westphal	7845914f45	netfilter: nf_tables: don't fail inserts if duplicate has expired nftables selftests fail: run-tests.sh testcases/sets/0044interval_overlap_0 Expected: 0-2 . 0-3, got: W: [FAILED] ./testcases/sets/0044interval_overlap_0: got 1 Insertion must ignore duplicate but expired entries. Moreover, there is a strange asymmetry in nft_pipapo_activate: It refetches the current element, whereas the other ->activate callbacks (bitmap, hash, rhash, rbtree) use elem->priv. Same for .remove: other set implementations take elem->priv, nft_pipapo_remove fetches elem->priv, then does a relookup, remove this. I suspect this was the reason for the change that prompted the removal of the expired check in pipapo_get() in the first place, but skipping exired elements there makes no sense to me, this helper is used for normal get requests, insertions (duplicate check) and deactivate callback. In first two cases expired elements must be skipped. For ->deactivate(), this gets called for DELSETELEM, so it seems to me that expired elements should be skipped as well, i.e. delete request should fail with -ENOENT error. Fixes: `24138933b9` ("netfilter: nf_tables: don't skip expired elements during walk") Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:15 +02:00
Florian Westphal	90e5b3462e	netfilter: nf_tables: deactivate catchall elements in next generation When flushing, individual set elements are disabled in the next generation via the ->flush callback. Catchall elements are not disabled. This is incorrect and may lead to double-deactivations of catchall elements which then results in memory leaks: WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730 CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60 RIP: 0010:nft_map_deactivate+0x549/0x730 [..] ? nft_map_deactivate+0x549/0x730 nf_tables_delset+0xb66/0xeb0 (the warn is due to nft_use_dec() detecting underflow). Fixes: `aaa31047a6` ("netfilter: nftables: add catch-all set element support") Reported-by: lonial con <kongln9170@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:15 +02:00
Florian Westphal	08713cb006	netfilter: nf_tables: fix kdoc warnings after gc rework Jakub Kicinski says: We've got some new kdoc warnings here: net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc' net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc' include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set' Fixes: `5f68718b34` ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: `f6c383b8c3` ("netfilter: nf_tables: adapt set backend to use GC transaction API") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:14 +02:00
Florian Westphal	b9f052dc68	netfilter: nf_tables: fix false-positive lockdep splat ->abort invocation may cause splat on debug kernels: WARNING: suspicious RCU usage net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage! [..] rcu_scheduler_active = 2, debug_locks = 1 1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid [..] lockdep_rcu_suspicious+0x1ad/0x260 nft_pipapo_abort+0x145/0x180 __nf_tables_abort+0x5359/0x63d0 nf_tables_abort+0x24/0x40 nfnetlink_rcv+0x1a0a/0x22c0 netlink_unicast+0x73c/0x900 netlink_sendmsg+0x7f0/0xc20 ____sys_sendmsg+0x48d/0x760 Transaction mutex is held, so parallel updates are not possible. Switch to _protected and check mutex is held for lockdep enabled builds. Fixes: `212ed75dc5` ("netfilter: nf_tables: integrate pipapo into commit protocol") Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-16 00:05:14 +02:00
Jakub Kicinski	f946270d05	ethtool: netlink: always pass genl_info to .prepare_data We had a number of bugs in the past because developers forgot to fully test dumps, which pass NULL as info to .prepare_data. .prepare_data implementations would try to access info->extack leading to a null-deref. Now that dumps and notifications can access struct genl_info we can pass it in, and remove the info null checks. Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 15:01:03 -07:00
Jakub Kicinski	ec0e5b09b8	ethtool: netlink: simplify arguments to ethnl_default_parse() Pass struct genl_info directly instead of its members. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 15:01:03 -07:00
Jakub Kicinski	0e19d3108a	netdev-genl: use struct genl_info for reply construction Use the just added APIs to make the code simpler. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 15:01:03 -07:00
Jakub Kicinski	5c670a010d	genetlink: add a family pointer to struct genl_info Having family in struct genl_info is quite useful. It cuts down the number of arguments which need to be passed to helpers which already take struct genl_info. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 15:01:03 -07:00
Jakub Kicinski	7288dd2fd4	genetlink: use attrs from struct genl_info Since dumps carry struct genl_info now, use the attrs pointer from genl_info and remove the one in struct genl_dumpit_info. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 15:00:45 -07:00
Jakub Kicinski	9272af109f	genetlink: add struct genl_info to struct genl_dumpit_info Netlink GET implementations must currently juggle struct genl_info and struct netlink_callback, depending on whether they were called from doit or dumpit. Add genl_info to the dump state and populate the fields. This way implementations can simply pass struct genl_info around. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 14:54:44 -07:00
Jakub Kicinski	bffcc6882a	genetlink: remove userhdr from struct genl_info Only three families use info->userhdr today and going forward we discourage using fixed headers in new families. So having the pointer to user header in struct genl_info is an overkill. Compute the header pointer at runtime. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 14:54:44 -07:00
Jakub Kicinski	fde9bd4a4d	genetlink: make genl_info->nlhdr const struct netlink_callback has a const nlh pointer, make the pointer in struct genl_info const as well, to make copying between the two easier. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 14:54:44 -07:00
Jakub Kicinski	84817d8c60	genetlink: push conditional locking into dumpit/done Add helpers which take/release the genl mutex based on family->parallel_ops. Remove the separation between handling of ops in locked and parallel families. Future patches would make the duplicated code grow even more. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-15 14:54:44 -07:00
Jason Xing	e4dd0d3a2f	net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled In the real workload, I encountered an issue which could cause the RTO timer to retransmit the skb per 1ms with linear option enabled. The amount of lost-retransmitted skbs can go up to 1000+ instantly. The root cause is that if the icsk_rto happens to be zero in the 6th round (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero due to the changed calculation method in tcp_retransmit_timer() as follows: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); Above line could be converted to icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0 Therefore, the timer expires so quickly without any doubt. I read through the RFC 6298 and found that the RTO value can be rounded up to a certain value, in Linux, say TCP_RTO_MIN as default, which is regarded as the lower bound in this patch as suggested by Eric. Fixes: `36e31b0af5` ("net: TCP thin linear timeouts") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-15 20:24:04 +01:00
Jeff Layton	c96e2a695e	sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsg svc_tcp_sendmsg used to factor in the xdr->page_base when sending pages, but commit `5df5dd03a8` ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") dropped that part of the handling. Fix it by setting the bv_offset of the first bvec. Fixes: `5df5dd03a8` ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-14 15:02:25 -04:00
Jiri Pirko	0149bca172	netlink: specs: devlink: extend health reporter dump attributes by port index Allow user to pass port index for health reporter dump request. Re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	b03f13cb67	devlink: extend health reporter dump selector by port index Introduce a possibility for devlink object to expose attributes it supports for selection of dumped objects. Use this by health reporter to indicate it supports port index based selection of dump objects. Implement this selection mechanism in devlink_nl_cmd_health_reporter_get_dump_one() Example: $ devlink health pci/0000:08:00.0: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32769: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32770: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98304: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98305: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98306: reporter vnic state healthy error 0 recover 0 $ devlink health show pci/0000:08:00.0 pci/0000:08:00.0: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32769: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32770: reporter vnic state healthy error 0 recover 0 $ devlink health show pci/0000:08:00.0/32768 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 The last command is possible because of this patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	34493336e7	netlink: specs: devlink: extend per-instance dump commands to accept instance attributes Extend per-instance dump command definitions to accept instance attributes. Allow parsing of devlink handle attributes so they could be used for instance selection. Re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	4a1b5aa8b5	devlink: allow user to narrow per-instance dumps by passing handle attrs For SFs, one devlink instance per SF is created. There might be thousands of these on a single host. When a user needs to know port handle for specific SF, he needs to dump all devlink ports on the host which does not scale good. Allow user to pass devlink handle attributes alongside the dump command and dump only objects which are under selected devlink instance. Example: $ devlink port show auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false $ devlink port show auxiliary/mlx5_core.eth.0 auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false $ devlink port show auxiliary/mlx5_core.eth.1 auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	833e479d33	devlink: remove converted commands from small ops As the commands are already defined in split ops, remove them from small ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	ddff283280	devlink: remove duplicate temporary netlink callback prototypes Remove the duplicate temporary netlink callback prototype as the generated ones are already in place. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	7199c86247	netlink: specs: devlink: add commands that do per-instance dump Add the definitions for the commands that do per-instance dump and re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	7d3c6fec61	devlink: pass flags as an arg of dump_one() callback In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the flags as an arg of dump_one() callback. Currently, it is always NLM_F_MULTI. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:25 -07:00
Jiri Pirko	24c8e56d4f	devlink: introduce dumpit callbacks for split ops Introduce dumpit callbacks for generated split ops. Have them as a thin wrapper around iteration function and allow to pass dump_one() function pointer directly without need to store in devlink_cmd structs. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:24 -07:00
Jiri Pirko	8fa995ad1f	devlink: rename doit callbacks for per-instance dump commands Rename netlink doit callback functions for the commands that do implement per-instance dump to match the generated names that are going to be introduce in the follow-up patch. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:24 -07:00
Jiri Pirko	ee6d78ac28	devlink: introduce devlink_nl_pre_doit_port*() helper functions Define port handling helpers what don't rely on internal_flags. Have __devlink_nl_pre_doit() to accept the flags as a function arg and make devlink_nl_pre_doit() a wrapper helper function calling it. Introduce new helpers devlink_nl_pre_doit_port() and devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up patch. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:24 -07:00
Jiri Pirko	41a1d4d139	devlink: parse rate attrs in doit() callbacks No need to give the rate any special treatment in netlink attributes parsing, as unlike for ports, there is only a couple of commands benefiting from that. Remove DEVLINK_NL_FLAG_NEED_RATE, make pre_doit() callback simpler by moving the rate attributes parsing to rate__doit() ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:24 -07:00
Jiri Pirko	63618463cb	devlink: parse linecard attr in doit() callbacks No need to give the linecards any special treatment in netlink attribute parsing, as unlike for ports, there is only a couple of commands benefiting from that. Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler by moving the linecard attribute parsing to linecard_[gs]et_doit() ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-14 11:47:24 -07:00
Sven Eckelmann	6f96d46f9a	batman-adv: Drop per algo GW section class code This code was only used in the past for the sysfs interface. But since this was replace with netlink, it was never executed. The function pointer was only checked to figure out whether the limit 255 (B.A.T.M.A.N. IV) or 2**32-1 (B.A.T.M.A.N. V) should be used as limit. So instead of keeping the function pointer, just store the limits directly in struct batadv_algo_gw_ops. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-14 18:01:21 +02:00
Sven Eckelmann	02e61f06a9	batman-adv: Keep batadv_netlink_notify_* static The batadv_netlink_notify_*() functions are not used by any other source file. Just keep them local to netlink.c to get informed by the compiler when they are not used anymore. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-14 18:01:21 +02:00
Sven Eckelmann	950c92bbaa	batman-adv: Drop unused function batadv_gw_bandwidth_set This function is no longer used since the sysfs support was removed from batman-adv. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-14 18:01:21 +02:00
Vlad Buslov	ace0ab3a4b	Revert "vlan: Fix VLAN 0 memory leak" This reverts commit `718cb09aaa`. The commit triggers multiple syzbot issues, probably due to possibility of manually creating VLAN 0 on netdevice which will cause the code to delete it since it can't distinguish such VLAN from implicit VLAN 0 automatically created for devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. Reported-by: syzbot+662f783a5cdf3add2719@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000090196d0602a6167d@google.com/ Reported-by: syzbot+4b4f06495414e92701d5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000096ae870602a61602@google.com/ Reported-by: syzbot+d810d3cd45ed1848c3f7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000009f0f9c0602a616ce@google.com/ Fixes: `718cb09aaa` ("vlan: Fix VLAN 0 memory leak") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 08:14:00 +01:00
Adrian Moreno	43d95b30cf	net: openvswitch: add misc error drop reasons Use drop reasons from include/net/dropreason-core.h when a reasonable candidate exists. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 08:01:06 +01:00
Adrian Moreno	f329d1bc1a	net: openvswitch: add meter drop reason By using an independent drop reason it makes it easy to distinguish between QoS-triggered or flow-triggered drop. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 08:01:06 +01:00
Eric Garver	e7bc7db9ba	net: openvswitch: add explicit drop action From: Eric Garver <eric@garver.life> This adds an explicit drop action. This is used by OVS to drop packets for which it cannot determine what to do. An explicit action in the kernel allows passing the reason _why_ the packet is being dropped or zero to indicate no particular error happened (i.e: OVS intentionally dropped the packet). Since the error codes coming from userspace mean nothing for the kernel, we squash all of them into only two drop reasons: - OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed - OVS_DROP_EXPLICIT to indicate a zero value was passed (no error) e.g. trace all OVS dropped skbs # perf trace -e skb:kfree_skb --filter="reason >= 0x30000" [..] 106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \ location:0xffffffffc0d9b462, protocol: 2048, reason: 196611) reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT) Also, this patch allows ovs-dpctl.py to add explicit drop actions as: "drop" -> implicit empty-action drop "drop(0)" -> explicit non-error action drop "drop(42)" -> explicit error action drop Signed-off-by: Eric Garver <eric@garver.life> Co-developed-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 08:01:06 +01:00
Adrian Moreno	ec7bfb5e5a	net: openvswitch: add action error drop reason Add a drop reason for packets that are dropped because an action returns a non-zero error code. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 08:01:06 +01:00
Adrian Moreno	9d802da40b	net: openvswitch: add last-action drop reason Create a new drop reason subsystem for openvswitch and add the first drop reason to represent last-action drops. Last-action drops happen when a flow has an empty action list or there is no action that consumes the packet (output, userspace, recirc, etc). It is the most common way in which OVS drops packets. Implementation-wise, most of these skb-consuming actions already call "consume_skb" internally and return directly from within the do_execute_actions() loop so with minimal changes we can assume that any skb that exits the loop normally is a packet drop. Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 08:01:06 +01:00
Kuniyuki Iwashima	e263691773	mptcp: Remove unnecessary test for __mptcp_init_sock() __mptcp_init_sock() always returns 0 because mptcp_init_sock() used to return the value directly. But after commit `18b683bff8` ("mptcp: queue data for mptcp level retransmission"), __mptcp_init_sock() need not return value anymore. Let's remove the unnecessary test for __mptcp_init_sock() and make it return void. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:14 +01:00
Paolo Abeni	39880bd808	mptcp: get rid of msk->subflow Such field is now unused just as a flag to control the first subflow deletion at close() time. Introduce a new bit flag for that and finally drop the mentioned field. As an intended side effect, now the first subflow sock is not freed before close() even for passive sockets. The msk has no open/active subflows if the first one is closed and the subflow list is singular, update accordingly the state check in mptcp_stream_accept(). Among other benefits, the subflow removal, reduces the amount of memory used on the client side for each mptcp connection, allows passive sockets to go through successful accept()/disconnect()/connect() and makes return error code consistent for failing both passive and active sockets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/290 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:14 +01:00
Paolo Abeni	3f326a821b	mptcp: change the mpc check helper to return a sk After the previous patch the __mptcp_nmpc_socket helper is used only to ensure that the MPTCP socket is a suitable status - that is, the mptcp capable handshake is not started yet. Change the return value to the relevant subflow sock, to finally remove the last references to first subflow socket in the MPTCP stack. As a bonus, we can get rid of a few local variables in different functions. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:14 +01:00
Paolo Abeni	3aa3624941	mptcp: avoid ssock usage in mptcp_pm_nl_create_listen_socket() This is one of the few remaining spots actually manipulating the first subflow socket. We can leverage the recently introduced inet helpers to get rid of ssock there. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:14 +01:00
Paolo Abeni	f0bc514bd5	mptcp: avoid additional indirection in sockopt The mptcp sockopt infrastructure unneedly uses the first subflow socket struct in a few spots. We are going to remove such field soon, so use directly the first subflow sock instead. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:14 +01:00
Paolo Abeni	1f6610b92a	mptcp: avoid unneeded indirection in mptcp_stream_accept() We are going to remove the first subflow socket soon, so avoid the additional indirection at accept() time. Instead access directly the first subflow sock, and update mptcp_accept() to operate on it. This allows dropping a duplicated check in mptcp_accept(). No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:14 +01:00
Paolo Abeni	5426a4ef64	mptcp: avoid additional indirection in mptcp_poll() We are going to remove the first subflow socket soon, so avoid the additional indirection at poll() time. Instead access directly the first subflow sock. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:14 +01:00
Paolo Abeni	40f56d0c70	mptcp: avoid additional indirection in mptcp_listen() We are going to remove the first subflow socket soon, so avoid the additional indirection via at listen() time. Instead call directly the recently introduced helper on the first subflow sock. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:13 +01:00
Paolo Abeni	71a9a874cd	net: factor out __inet_listen_sk() helper The mptcp protocol maintains an additional socket just to easily invoke a few stream operations on the first subflow. One of them is inet_listen(). Factor out an helper operating directly on the (locked) struct sock, to allow get rid of the above dependency in the next patch without duplicating the existing code. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:13 +01:00
Paolo Abeni	8cf2ebdc00	mptcp: mptcp: avoid additional indirection in mptcp_bind() We are going to remove the first subflow socket soon, so avoid the additional indirection via at bind() time. Instead call directly the recently introduced helpers on the first subflow sock. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:13 +01:00
Paolo Abeni	e6d360ff87	net: factor out inet{,6}_bind_sk helpers The mptcp protocol maintains an additional socket just to easily invoke a few stream operations on the first subflow. One of them is bind(). Factor out the helpers operating directly on the struct sock, to allow get rid of the above dependency in the next patch without duplicating the existing code. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:13 +01:00
Paolo Abeni	cfb63e50d3	mptcp: avoid subflow socket usage in mptcp_get_port() We are going to remove the first subflow socket soon, so avoid accessing it in mptcp_get_port(). Instead, access directly the first subflow sock. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:13 +01:00
Paolo Abeni	ccae357c1c	mptcp: avoid additional __inet_stream_connect() call The mptcp protocol maintains an additional socket just to easily invoke a few stream operations on the first subflow. One of them is __inet_stream_connect(). We are going to remove the first subflow socket soon, so avoid the additional indirection via at connect time, calling directly into the sock-level connect() ops. The sk-level connect never return -EINPROGRESS, cleanup the error path accordingly. Additionally, the ssk status on error is always TCP_CLOSE. Avoid unneeded access to the subflow sk state. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:13 +01:00
Paolo Abeni	131a627751	mptcp: avoid unneeded mptcp_token_destroy() calls The MPTCP protocol currently clears the msk token both at connect() and listen() time. That is needed to deal with failing connect() calls that can create a new token while leaving the sk in TCP_CLOSE,SS_UNCONNECTED status and thus allowing later connect() and/or listen() calls. Let's deal with such failures explicitly, cleaning the token in a timely manner and avoid the confusing early mptcp_token_destroy(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 07:06:13 +01:00
David S. Miller	3d3829363b	bluetooth-next pull request for net-next: - Add new VID/PID for Mediatek MT7922 - Add support multiple BIS/BIG - Add support for Intel Gale Peak - Add support for Qualcomm WCN3988 - Add support for BT_PKT_STATUS for ISO sockets - Various fixes for experimental ISO support - Load FW v2 for RTL8852C - Add support for NXP AW693 chipset - Add support for Mediatek MT2925 -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmTWigQZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKd8sD/92kBczbO3v+nSNyiYcbVmB x3Z7x1l2ExxHnPdW8xBmEzHlDErYB/KBKYdJWM8y6Bam5z1lnsX7LflXSy+bhZeX iOFYl94Gh/9/ooyYOwwYUKC2fLKWT54PLg1TcJzyfp8uUizQNWAg9QD7vjvxe7lN HXrW6CaA4Oohcq2YXagZV1h6Q/jl3BjcfEe7N0E6YYjeonplsJsv6rYG8Ku5n0Pi 9YhB5IkX5zszTGKBSSWURKvaJjbFd7pr3mYkgLZG2pIMGQcUAFJZ9kL7de9xeBWI TRfgehZZPB2bUac1LxGLcAfONTmzUmo3/trjL1opdxreVCAX565JlaVSJwd0zuQk cBrmtU3Q8peFSOgJRb1Ci5junE8tqjEWzFRIgw7/wL1Ys3mrbbVDDGKqPhwhvjdq grOBf6UGaDpEO797yWWpBl5DLV3klMQDi4v84J0yTdvf4GXF8t8fuZU+zIpknVou BwdeeF33yzqtk01BjomQcLVOrrGOP7+Salc5g7eEVU1jZnaw0MH9aH+o6R2JYtP8 uIiH4QOUJh7NA543F+/wPdZU+OV1E+Io+b34pTZ1oIyM2UT9Dy57Tex/DDKq2UCe 69WV6aVM+FTt2VSMUS2J0XrXkxbI4f6/ABOLht5hHKxT1m6LhOh8mCSTof+UENrr G0sVoCodRrSljSMS/VltTA== =akZ8 -----END PGP SIGNATURE----- Merge tag 'for-net-next-2023-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next bluetooth-next pull request for net-next: - Add new VID/PID for Mediatek MT7922 - Add support multiple BIS/BIG - Add support for Intel Gale Peak - Add support for Qualcomm WCN3988 - Add support for BT_PKT_STATUS for ISO sockets - Various fixes for experimental ISO support - Load FW v2 for RTL8852C - Add support for NXP AW693 chipset - Add support for Mediatek MT2925	2023-08-13 14:53:53 +01:00
Yue Haibing	2b8893b639	net/rds: Remove unused function declarations Commit `39de828179` ("RDS: Main header file") declared but never implemented rds_trans_init() and rds_trans_exit(), remove it. Commit `d37c935905` ("RDS: Move loop-only function to loop.c") removed the implementation rds_message_inc_free() but not the declaration. Since commit `55b7ed0b58` ("RDS: Common RDMA transport code") rds_rdma_conn_connect() is never implemented and used. rds_tcp_map_seq() is never implemented and used since commit `70041088e3` ("RDS: Add TCP transport to RDS"). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-13 12:25:42 +01:00
Eric Dumazet	8fe08d70a2	netlink: convert nlk->flags to atomic flags sk_diag_put_flags(), netlink_setsockopt(), netlink_getsockopt() and others use nlk->flags without correct locking. Use set_bit(), clear_bit(), test_bit(), assign_bit() to remove data-races. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-13 12:23:19 +01:00
Menglong Dong	031c44b752	net: tcp: refactor the dbg message in tcp_retransmit_timer() The debug message in tcp_retransmit_timer() is slightly wrong, because they could be printed even if we did not receive a new ACK packet from the remote peer. Change it to probing zero-window, as it is a expected case now. The description may be not correct. Adding the duration since the last ACK we received, and the duration of the retransmission, which are useful for debugging. And the message now like this: Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-13 12:21:38 +01:00
Menglong Dong	e89688e3e9	net: tcp: fix unexcepted socket die when snd_wnd is 0 In tcp_retransmit_timer(), a window shrunk connection will be regarded as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not right all the time. The retransmits will become zero-window probes in tcp_retransmit_timer() if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to TCP_RTO_MAX sooner or later. However, the timer can be delayed and be triggered after 122877ms, not TCP_RTO_MAX, as I tested. Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true once the RTO come up to TCP_RTO_MAX, and the socket will die. Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout', which is exact the timestamp of the timeout. However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp could already be a long time (minutes or hours) in the past even on the first RTO. So we double check the timeout with the duration of the retransmission. Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket dying too soon. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/ Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-13 12:21:37 +01:00
Menglong Dong	800a666141	net: tcp: allow zero-window ACK update the window Fow now, an ACK can update the window in following case, according to the tcp_may_update_window(): 1. the ACK acknowledged new data 2. the ACK has new data 3. the ACK expand the window and the seq of it is valid Now, we allow the ACK update the window if the window is 0, and the seq/ack of it is valid. This is for the case that the receiver replies an zero-window ACK when it is under memory stress and can't queue the new data. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-13 12:21:37 +01:00
Menglong Dong	e2142825c1	net: tcp: send zero-window ACK when no memory For now, skb will be dropped when no memory, which makes client keep retrans util timeout and it's not friendly to the users. In this patch, we reply an ACK with zero-window in this case to update the snd_wnd of the sender to 0. Therefore, the sender won't timeout the connection and will probe the zero-window with the retransmits. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-13 12:21:37 +01:00
Jiri Slaby (SUSE)	c70fd7c0e9	tty: rfcomm: convert counts to size_t Unify the type of tty_operations::write() counters with the 'count' parameter. I.e. use size_t for them. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Link: https://lore.kernel.org/r/20230810091510.13006-37-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 21:12:47 +02:00
Jiri Slaby (SUSE)	49b8220cee	tty: ldops: unify to u8 Some hooks in struct tty_ldisc_ops still reference buffers by 'unsigned char'. Unify to 'u8' as the rest of the tty layer does. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20230810091510.13006-32-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 21:12:47 +02:00
Jiri Slaby (SUSE)	95713967ba	tty: make tty_operations::write()'s count size_t Unify with the rest of the code. Use size_t for counts and ssize_t for retval. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Link: https://lore.kernel.org/r/20230810091510.13006-30-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 21:12:46 +02:00
Jiri Slaby (SUSE)	69851e4ab8	tty: propagate u8 data to tty_operations::write() Data are now typed as u8. Propagate this change to tty_operations::write(). Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Vaibhav Gupta <vaibhavgupta40@gmail.com> Cc: Jens Taprogge <jens.taprogge@taprogge.org> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Scott Branden <scott.branden@broadcom.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: David Lin <dtwlin@gmail.com> Cc: Johan Hovold <johan@kernel.org> Cc: Alex Elder <elder@kernel.org> Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Pengutronix Kernel Team <kernel@pengutronix.de> Cc: Fabio Estevam <festevam@gmail.com> Cc: NXP Linux Team <linux-imx@nxp.com> Cc: Arnaud Pouliquen <arnaud.pouliquen@foss.st.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Mathias Nyman <mathias.nyman@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Link: https://lore.kernel.org/r/20230810091510.13006-28-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 21:12:46 +02:00
Jiri Slaby (SUSE)	892bc209f2	tty: use u8 for flags This makes all those 'char's an explicit 'u8'. This is part of the continuing unification of chars and flags to be consistent u8. This approaches tty_port_default_receive_buf(). Note that we do not change signedness as we compile with -funsigned-char. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: William Hubbs <w.d.hubbs@gmail.com> Cc: Chris Brannon <chris@the-brannons.com> Cc: Kirk Reiser <kirk@reisers.ca> Cc: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Max Staudt <max@enpas.org> Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com> Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de> Cc: Jeremy Kerr <jk@codeconstruct.com.au> Cc: Matt Johnston <matt@codeconstruct.com.au> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Acked-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230810091510.13006-18-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 21:12:45 +02:00
Jiri Slaby (SUSE)	e8161447bb	tty: make tty_ldisc_ops::buf() hooks operate on size_t Count passed to tty_ldisc_ops::receive_buf*(), ::lookahead_buf(), and returned from ::receive_buf2() is expected to be size_t. So set it to size_t to unify with the rest of the code. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: William Hubbs <w.d.hubbs@gmail.com> Cc: Chris Brannon <chris@the-brannons.com> Cc: Kirk Reiser <kirk@reisers.ca> Cc: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Max Staudt <max@enpas.org> Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com> Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de> Cc: Jeremy Kerr <jk@codeconstruct.com.au> Cc: Matt Johnston <matt@codeconstruct.com.au> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Acked-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230810091510.13006-16-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 21:12:45 +02:00
Jiri Slaby (SUSE)	6e5710e71d	tty: remove dummy tty_ldisc_ops::poll() implementations tty_ldisc_ops::poll() is optional and needs not be provided. It is equal to returning 0. So remove all those from the code. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20230810091510.13006-4-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 21:12:44 +02:00
Pauli Virtanen	b5793de3cf	Bluetooth: hci_conn: avoid checking uninitialized CIG/CIS ids The CIS/CIG ids of ISO connections are defined only when the connection is unicast. Fix the lookup functions to check for unicast first. Ensure CIG/CIS IDs have valid value also in state BT_OPEN. Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:57:54 -07:00
Pauli Virtanen	66dee21524	Bluetooth: hci_event: drop only unbound CIS if Set CIG Parameters fails When user tries to connect a new CIS when its CIG is not configurable, that connection shall fail, but pre-existing connections shall not be affected. However, currently hci_cc_le_set_cig_params deletes all CIS of the CIG on error so it doesn't work, even though controller shall not change CIG/CIS configuration if the command fails. Fix by failing on command error only the connections that are not yet bound, so that we keep the previous CIS configuration like the controller does. Fixes: `26afbd826e` ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:57:33 -07:00
Ziyang Xuan	3cd43dd15f	Bluetooth: Remove unnecessary NULL check before vfree() Remove unnecessary NULL check which causes coccinelle warning: net/bluetooth/coredump.c:104:2-7: WARNING: NULL check before some freeing functions is not needed. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:56:54 -07:00
Manish Mandlik	a2bcd2b632	Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_add_adv_monitor() KSAN reports use-after-free in hci_add_adv_monitor(). While adding an adv monitor, hci_add_adv_monitor() calls -> msft_add_monitor_pattern() calls -> msft_add_monitor_sync() calls -> msft_le_monitor_advertisement_cb() calls in an error case -> hci_free_adv_monitor() which frees the *moniter. This is referenced by bt_dev_dbg() in hci_add_adv_monitor(). Fix the bt_dev_dbg() by using handle instead of monitor->handle. Fixes: `b747a83690` ("Bluetooth: hci_sync: Refactor add Adv Monitor") Signed-off-by: Manish Mandlik <mmandlik@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:56:35 -07:00
Min Li	3673952cf0	Bluetooth: Fix potential use-after-free when clear keys Similar to commit `c5d2b6fa26` ("Bluetooth: Fix use-after-free in hci_remove_ltk/hci_remove_irk"). We can not access k after kfree_rcu() call. Fixes: `d7d41682ef` ("Bluetooth: Fix Suspicious RCU usage warnings") Signed-off-by: Min Li <lm0963hack@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:56:16 -07:00
Luiz Augusto von Dentz	a1f6c3aef1	Bluetooth: hci_sync: Introduce PTR_UINT/UINT_PTR macros This introduces PTR_UINT/UINT_PTR macros and replace the use of PTR_ERR/ERR_PTR. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:55:48 -07:00
Luiz Augusto von Dentz	a091289218	Bluetooth: hci_conn: Fix hci_le_set_cig_params When running with concurrent task only one CIS was being assigned so this attempts to rework the way the PDU is constructed so it is handled later at the callback instead of in place. Fixes: `26afbd826e` ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:55:28 -07:00
Luiz Augusto von Dentz	f88670161e	Bluetooth: hci_core: Make hci_is_le_conn_scanning public This moves hci_is_le_conn_scanning to hci_core.h so it can be used by different files without having to duplicate its code. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:54:59 -07:00
Luiz Augusto von Dentz	f2f84a70f9	Bluetooth: hci_conn: Fix not allowing valid CIS ID Only the number of CIS shall be limited to 0x1f, the CIS ID in the other hand is up to 0xef. Fixes: `26afbd826e` ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:54:39 -07:00
Luiz Augusto von Dentz	16e3b64291	Bluetooth: hci_conn: Fix modifying handle while aborting This introduces hci_conn_set_handle which takes care of verifying the conditions where the hci_conn handle can be modified, including when hci_conn_abort has been called and also checks that the handles is valid as well. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:54:10 -07:00
Luiz Augusto von Dentz	b7f923b1ef	Bluetooth: ISO: Fix not checking for valid CIG/CIS IDs Valid range of CIG/CIS are 0x00 to 0xEF, so this checks they are properly checked before attempting to use HCI_OP_LE_SET_CIG_PARAMS. Fixes: `ccf74f2390` ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:53:50 -07:00
Luiz Augusto von Dentz	5af1f84ed1	Bluetooth: hci_sync: Fix UAF on hci_abort_conn_sync Connections may be cleanup while waiting for the commands to complete so this attempts to check if the connection handle remains valid in case of errors that would lead to call hci_conn_failed: BUG: KASAN: slab-use-after-free in hci_conn_failed+0x1f/0x160 Read of size 8 at addr ffff888001376958 by task kworker/u3:0/52 CPU: 0 PID: 52 Comm: kworker/u3:0 Not tainted 6.5.0-rc1-00527-g2dfe76d58d3a #5615 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 Workqueue: hci0 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x1d/0x70 print_report+0xce/0x620 ? __virt_addr_valid+0xd4/0x150 ? hci_conn_failed+0x1f/0x160 kasan_report+0xd1/0x100 ? hci_conn_failed+0x1f/0x160 hci_conn_failed+0x1f/0x160 hci_abort_conn_sync+0x237/0x360 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:53:30 -07:00
Luiz Augusto von Dentz	094e363962	Bluetooth: hci_sync: Fix handling of HCI_OP_CREATE_CONN_CANCEL When sending HCI_OP_CREATE_CONN_CANCEL it shall Wait for HCI_EV_CONN_COMPLETE, not HCI_EV_CMD_STATUS, when the reason is anything but HCI_ERROR_REMOTE_POWER_OFF. This reason is used when suspending or powering off, where we don't want to wait for the peer's response. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:53:10 -07:00
Pauli Virtanen	2889bdd0a9	Bluetooth: hci_sync: delete CIS in BT_OPEN/CONNECT/BOUND when aborting Dropped CIS that are in state BT_OPEN/BT_BOUND, and in state BT_CONNECT with HCI_CONN_CREATE_CIS unset, should be cleaned up immediately. Closing CIS ISO sockets should result to the hci_conn be deleted, so that potentially pending CIG removal can run. hci_abort_conn cannot refer to them by handle, since their handle is still unset if Set CIG Parameters has not yet completed. This fixes CIS not being terminated if the socket is shut down immediately after connection, so that the hci_abort_conn runs before Set CIG Parameters completes. See new BlueZ test "ISO Connect Close - Success" Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:52:51 -07:00
Pauli Virtanen	69997d50ec	Bluetooth: ISO: handle bound CIS cleanup via hci_conn Calling hci_conn_del in __iso_sock_close is invalid. It needs hdev->lock, but it cannot be acquired there due to lock ordering. Fix this by doing cleanup via hci_conn_drop. Return hci_conn with refcount 1 from hci_bind_cis and hci_connect_cis, so that the iso_conn always holds one reference. This also fixes refcounting when error handling. Since hci_conn_abort shall handle termination of connections in any state properly, we can handle BT_CONNECT socket state in the same way as BT_CONNECTED. Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:52:32 -07:00
Yue Haibing	90005880a6	Bluetooth: Remove unused declaration amp_read_loc_info() This is introduced in commit `903e454110` but was never implemented. Fixes: `903e454110` ("Bluetooth: AMP: Use HCI cmd to Read Loc AMP Assoc") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:52:13 -07:00
Luiz Augusto von Dentz	0731c5ab4d	Bluetooth: ISO: Add support for BT_PKT_STATUS This adds support for BT_PKT_STATUS socketopt by setting BT_SK_PKT_STATUS. Then upon receiving an ISO packet the code would attempt to store the Packet_Status_Flag to hci_skb_pkt_status which is then forward to userspace in the form of BT_SCM_PKT_STATUS whenever BT_PKT_STATUS has been enabled/set. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:49:45 -07:00
Luiz Augusto von Dentz	3f19ffb2f9	Bluetooth: af_bluetooth: Make BT_PKT_STATUS generic This makes the handling of BT_PKT_STATUS more generic so it can be reused by sockets other than SCO like BT_DEFER_SETUP, etc. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:49:16 -07:00
Ying Hsu	573ebae162	Bluetooth: Fix hci_suspend_sync crash If hci_unregister_dev() frees the hci_dev object but hci_suspend_notifier may still be accessing it, it can cause the program to crash. Here's the call trace: <4>[102152.653246] Call Trace: <4>[102152.653254] hci_suspend_sync+0x109/0x301 [bluetooth] <4>[102152.653259] hci_suspend_dev+0x78/0xcd [bluetooth] <4>[102152.653263] hci_suspend_notifier+0x42/0x7a [bluetooth] <4>[102152.653268] notifier_call_chain+0x43/0x6b <4>[102152.653271] __blocking_notifier_call_chain+0x48/0x69 <4>[102152.653273] __pm_notifier_call_chain+0x22/0x39 <4>[102152.653276] pm_suspend+0x287/0x57c <4>[102152.653278] state_store+0xae/0xe5 <4>[102152.653281] kernfs_fop_write+0x109/0x173 <4>[102152.653284] __vfs_write+0x16f/0x1a2 <4>[102152.653287] ? selinux_file_permission+0xca/0x16f <4>[102152.653289] ? security_file_permission+0x36/0x109 <4>[102152.653291] vfs_write+0x114/0x21d <4>[102152.653293] __x64_sys_write+0x7b/0xdb <4>[102152.653296] do_syscall_64+0x59/0x194 <4>[102152.653299] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 This patch holds the reference count of the hci_dev object while processing it in hci_suspend_notifier to avoid potential crash caused by the race condition. Signed-off-by: Ying Hsu <yinghsu@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:48:20 -07:00
Christophe JAILLET	82eae9dc43	Bluetooth: hci_debugfs: Use kstrtobool() instead of strtobool() strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:47:44 -07:00
Luiz Augusto von Dentz	112b5090c2	Bluetooth: MGMT: Fix always using HCI_MAX_AD_LENGTH HCI_MAX_AD_LENGTH shall only be used if the controller doesn't support extended advertising, otherwise HCI_MAX_EXT_AD_LENGTH shall be used instead. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:45:26 -07:00
Douglas Anderson	6f55eea116	Bluetooth: hci_sync: Don't double print name in add/remove adv_monitor The hci_add_adv_monitor() hci_remove_adv_monitor() functions call bt_dev_dbg() to print some debug statements. The bt_dev_dbg() macro automatically adds in the device's name. That means that we shouldn't include the name in the bt_dev_dbg() calls. Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:45:07 -07:00
Dan Carpenter	528b2acf43	Bluetooth: msft: Fix error code in msft_cancel_address_filter_sync() Return negative -EIO instead of positive EIO. Fixes: 926df8962f3f ("Bluetooth: msft: Extended monitor tracking by address filter") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:44:12 -07:00
Iulia Tanasescu	f777d88278	Bluetooth: ISO: Notify user space about failed bis connections Some use cases require the user to be informed if BIG synchronization fails. This commit makes it so that even if the BIG sync established event arrives with error status, a new hconn is added for each BIS, and the iso layer is notified about the failed connections. Unsuccesful bis connections will be marked using the HCI_CONN_BIG_SYNC_FAILED flag. From the iso layer, the POLLERR event is triggered on the newly allocated bis sockets, before adding them to the accept list of the parent socket. From user space, a new fd for each failed bis connection will be obtained by calling accept. The user should check for the POLLERR event on the new socket, to determine if the connection was successful or not. The HCI_CONN_BIG_SYNC flag has been added to mark whether the BIG sync has been successfully established. This flag is checked at bis cleanup, so the HCI LE BIG Terminate Sync command is only issued if needed. The BT_SK_BIG_SYNC flag indicates if BIG create sync has been called for a listening socket, to avoid issuing the command everytime a BIGInfo advertising report is received. Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:43:44 -07:00
Luiz Augusto von Dentz	9f78191cc9	Bluetooth: hci_conn: Always allocate unique handles This attempts to always allocate a unique handle for connections so they can be properly aborted by the likes of hci_abort_conn, so this uses the invalid range as a pool of unset handles that way if userspace is trying to create multiple connections at once each will be given a unique handle which will be considered unset. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:43:02 -07:00
Luiz Augusto von Dentz	04a51d6169	Bluetooth: hci_sync: Fix not handling ISO_LINK in hci_abort_conn_sync ISO_LINK connections where not being handled properly on hci_abort_conn_sync which sometimes resulted in sending the wrong commands, or in case of having the reject command being sent by the socket code (iso.c) which is sort of a layer violation. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:42:43 -07:00
Luiz Augusto von Dentz	a13f316e90	Bluetooth: hci_conn: Consolidate code for aborting connections This consolidates code for aborting connections using hci_cmd_sync_queue so it is synchronized with other threads, but because of the fact that some commands may block the cmd_sync_queue while waiting specific events this attempt to cancel those requests by using hci_cmd_sync_cancel. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:42:15 -07:00
Claudia Draghicescu	c33362a528	Bluetooth: hci_sync: Enable events for BIS capable devices In the case of a Synchronized Receiver capable device, enable at start-up the events for PA reports, PA Sync Established and Big Info Adv reports. Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:40:25 -07:00
Hilda Wu	9e14606d8f	Bluetooth: msft: Extended monitor tracking by address filter Since limited tracking device per condition, this feature is to support tracking multiple devices concurrently. When a pattern monitor detects the device, this feature issues an address monitor for tracking that device. Let pattern monitor can keep monitor new devices. This feature adds an address filter when receiving a LE monitor device event which monitor handle is for a pattern, and the controller started monitoring the device. And this feature also has cancelled the monitor advertisement from address filters when receiving a LE monitor device event when the controller stopped monitoring the device specified by an address and monitor handle. Below is an example to know the feature adds the address filter. //Add MSFT pattern monitor < HCI Command: Vendor (0x3f\|0x00f0) plen 14 #142 [hci0] 55.552420 03 b8 a4 03 ff 01 01 06 09 05 5f 52 45 46 .........._REF > HCI Event: Command Complete (0x0e) plen 6 #143 [hci0] 55.653960 Vendor (0x3f\|0x00f0) ncmd 2 Status: Success (0x00) 03 00 //Got event from the pattern monitor > HCI Event: Vendor (0xff) plen 18 #148 [hci0] 58.384953 23 79 54 33 77 88 97 68 02 00 fb c1 29 eb 27 b8 #yT3w..h....).'. 00 01 .. //Add MSFT address monitor (Sample address: B8:27:EB:29:C1:FB) < HCI Command: Vendor (0x3f\|0x00f0) plen 13 #149 [hci0] 58.385067 03 b8 a4 03 ff 04 00 fb c1 29 eb 27 b8 .........).'. //Report to userspace about found device (ADV Monitor Device Found) @ MGMT Event: Unknown (0x002f) plen 38 {0x0003} [hci0] 58.680042 01 00 fb c1 29 eb 27 b8 01 ce 00 00 00 00 16 00 ....).'......... 0a 09 4b 45 59 42 44 5f 52 45 46 02 01 06 03 19 ..KEYBD_REF..... c1 03 03 03 12 18 ...... //Got event from address monitor > HCI Event: Vendor (0xff) plen 18 #152 [hci0] 58.672956 23 79 54 33 77 88 97 68 02 00 fb c1 29 eb 27 b8 #yT3w..h....).'. 01 01 Signed-off-by: Alex Lu <alex_lu@realsil.com.cn> Signed-off-by: Hilda Wu <hildawu@realtek.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:39:58 -07:00
Iulia Tanasescu	6a42e9bfd1	Bluetooth: ISO: Support multiple BIGs This adds support for creating multiple BIGs. According to spec, each BIG shall have an unique handle, and each BIG should be associated with a different advertising handle. Otherwise, the LE Create BIG command will fail, with error code Command Disallowed (for reusing a BIG handle), or Unknown Advertising Identifier (for reusing an advertising handle). The btmon snippet below shows an exercise for creating two BIGs for the same controller, by opening two isotest instances with the following command: tools/isotest -i hci0 -s 00:00:00:00:00:00 < HCI Command: LE Create Broadcast Isochronous Group (0x08\|0x0068) plen 31 Handle: 0x00 Advertising Handle: 0x01 Number of BIS: 1 SDU Interval: 10000 us (0x002710) Maximum SDU size: 40 Maximum Latency: 10 ms (0x000a) RTN: 0x02 PHY: LE 2M (0x02) Packing: Sequential (0x00) Framing: Unframed (0x00) Encryption: 0x00 Broadcast Code: 00000000000000000000000000000000 > HCI Event: Command Status (0x0f) plen 4 LE Create Broadcast Isochronous Group (0x08\|0x0068) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 21 LE Broadcast Isochronous Group Complete (0x1b) Status: Success (0x00) Handle: 0x00 BIG Synchronization Delay: 912 us (0x000390) Transport Latency: 912 us (0x000390) PHY: LE 2M (0x02) NSE: 3 BN: 1 PTO: 1 IRC: 3 Maximum PDU: 40 ISO Interval: 10.00 msec (0x0008) Connection Handle #0: 10 < HCI Command: LE Create Broadcast Isochronous Group (0x08\|0x0068) Handle: 0x01 Advertising Handle: 0x02 Number of BIS: 1 SDU Interval: 10000 us (0x002710) Maximum SDU size: 40 Maximum Latency: 10 ms (0x000a) RTN: 0x02 PHY: LE 2M (0x02) Packing: Sequential (0x00) Framing: Unframed (0x00) Encryption: 0x00 Broadcast Code: 00000000000000000000000000000000 > HCI Event: Command Status (0x0f) plen 4 LE Create Broadcast Isochronous Group (0x08\|0x0068) ncmd 1 Status: Success (0x00) Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:38:12 -07:00
Luiz Augusto von Dentz	69ae506506	Bluetooth: hci_sock: Forward credentials to monitor This stores scm_creds into hci_skb_cb so they can be properly forwarded to the likes of btmon which is then able to print information about the process who is originating the traffic: bluetoothd[35]: @ MGMT Command: Rea.. (0x0001) plen 0 {0x0001} @ MGMT Event: Command Complete (0x0001) plen 6 {0x0001} Read Management Version Information (0x0001) plen 3 bluetoothd[35]: < ACL Data T.. flags 0x00 dlen 41 ATT: Write Command (0x52) len 36 Handle: 0x0043 Type: ASE Control Point (0x2bc6) Data: 020203000110270000022800020a00409c0001000110270000022800020a00409c00 Opcode: QoS Configuration (0x02) Number of ASE(s): 2 ASE: #0 ASE ID: 0x03 CIG ID: 0x00 CIS ID: 0x01 SDU Interval: 10000 usec Framing: Unframed (0x00) PHY: 0x02 LE 2M PHY (0x02) Max SDU: 40 RTN: 2 Max Transport Latency: 10 Presentation Delay: 40000 us ASE: #1 ASE ID: 0x01 CIG ID: 0x00 CIS ID: 0x01 SDU Interval: 10000 usec Framing: Unframed (0x00) PHY: 0x02 LE 2M PHY (0x02) Max SDU: 40 RTN: 2 Max Transport Latency: 10 Presentation Delay: 40000 us Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:37:42 -07:00
Luiz Augusto von Dentz	464c702fb9	Bluetooth: Init sk_peer_* on bt_sock_alloc This makes sure peer information is always available via sock when using bt_sock_alloc. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:37:22 -07:00
Luiz Augusto von Dentz	6bfa273e53	Bluetooth: Consolidate code around sk_alloc into a helper function This consolidates code around sk_alloc into bt_sock_alloc which does take care of common initialization. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:36:50 -07:00
Pauli Virtanen	7f74563e61	Bluetooth: ISO: do not emit new LE Create CIS if previous is pending LE Create CIS command shall not be sent before all CIS Established events from its previous invocation have been processed. Currently it is sent via hci_sync but that only waits for the first event, but there can be multiple. Make it wait for all events, and simplify the CIS creation as follows: Add new flag HCI_CONN_CREATE_CIS, which is set if Create CIS has been sent for the connection but it is not yet completed. Make BT_CONNECT state to mean the connection wants Create CIS. On events after which new Create CIS may need to be sent, send it if possible and some connections need it. These events are: hci_connect_cis, iso_connect_cfm, hci_cs_le_create_cis, hci_le_cis_estabilished_evt. The Create CIS status/completion events shall queue new Create CIS only if at least one of the connections transitions away from BT_CONNECT, so that we don't loop if controller is sending bogus events. This fixes sending multiple CIS Create for the same CIS in the "ISO AC 6(i) - Success" BlueZ test case: < HCI Command: LE Create Co.. (0x08\|0x0064) plen 9 #129 [hci0] Number of CIS: 2 CIS Handle: 257 ACL Handle: 42 CIS Handle: 258 ACL Handle: 42 > HCI Event: Command Status (0x0f) plen 4 #130 [hci0] LE Create Connected Isochronous Stream (0x08\|0x0064) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 29 #131 [hci0] LE Connected Isochronous Stream Established (0x19) Status: Success (0x00) Connection Handle: 257 ... < HCI Command: LE Setup Is.. (0x08\|0x006e) plen 13 #132 [hci0] ... > HCI Event: Command Complete (0x0e) plen 6 #133 [hci0] LE Setup Isochronous Data Path (0x08\|0x006e) ncmd 1 ... < HCI Command: LE Create Co.. (0x08\|0x0064) plen 5 #134 [hci0] Number of CIS: 1 CIS Handle: 258 ACL Handle: 42 > HCI Event: Command Status (0x0f) plen 4 #135 [hci0] LE Create Connected Isochronous Stream (0x08\|0x0064) ncmd 1 Status: ACL Connection Already Exists (0x0b) > HCI Event: LE Meta Event (0x3e) plen 29 #136 [hci0] LE Connected Isochronous Stream Established (0x19) Status: Success (0x00) Connection Handle: 258 ... Fixes: `c09b80be6f` ("Bluetooth: hci_conn: Fix not waiting for HCI_EVT_LE_CIS_ESTABLISHED") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:36:01 -07:00
Iulia Tanasescu	a0bfde167b	Bluetooth: ISO: Add support for connecting multiple BISes It is required for some configurations to have multiple BISes as part of the same BIG. Similar to the flow implemented for unicast, DEFER_SETUP will also be used to bind multiple BISes for the same BIG, before starting Periodic Advertising and creating the BIG. The user will have to open a new socket for each BIS. By setting the BT_DEFER_SETUP socket option and calling connect, a new connection will be added for the BIG and advertising handle set by the socket QoS parameters. Since all BISes will be bound for the same BIG and advertising handle, the socket QoS options and base parameters should match for all connections. By calling connect on a socket that does not have the BT_DEFER_SETUP option set, periodic advertising will be started and the BIG will be created, with a BIS for each previously bound connection. Since a BIG cannot be reconfigured with additional BISes after creation, no more connections can be bound for the BIG after the start periodic advertising and create BIG commands have been queued. The bis_cleanup function has also been updated, so that the advertising set and the BIG will not be terminated unless there are no more bound or connected BISes. The HCI_CONN_BIG_CREATED connection flag has been added to indicate that the BIG has been successfully created. This flag is checked at bis_cleanup, so that the BIG is only terminated if the HCI_LE_Create_BIG_Complete has been received. This implementation has been tested on hardware, using the "isotest" tool with an additional command line option, to specify the number of BISes to create as part of the desired BIG: tools/isotest -i hci0 -s 00:00:00:00:00:00 -N 2 -G 1 -T 1 The btmon log shows that a BIG containing 2 BISes has been created: < HCI Command: LE Create Broadcast Isochronous Group (0x08\|0x0068) plen 31 Handle: 0x01 Advertising Handle: 0x01 Number of BIS: 2 SDU Interval: 10000 us (0x002710) Maximum SDU size: 40 Maximum Latency: 10 ms (0x000a) RTN: 0x02 PHY: LE 2M (0x02) Packing: Sequential (0x00) Framing: Unframed (0x00) Encryption: 0x00 Broadcast Code: 00000000000000000000000000000000 > HCI Event: Command Status (0x0f) plen 4 LE Create Broadcast Isochronous Group (0x08\|0x0068) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 23 LE Broadcast Isochronous Group Complete (0x1b) Status: Success (0x00) Handle: 0x01 BIG Synchronization Delay: 1974 us (0x0007b6) Transport Latency: 1974 us (0x0007b6) PHY: LE 2M (0x02) NSE: 3 BN: 1 PTO: 1 IRC: 3 Maximum PDU: 40 ISO Interval: 10.00 msec (0x0008) Connection Handle #0: 10 Connection Handle #1: 11 < HCI Command: LE Setup Isochronous Data Path (0x08\|0x006e) plen 13 Handle: 10 Data Path Direction: Input (Host to Controller) (0x00) Data Path: HCI (0x00) Coding Format: Transparent (0x03) Company Codec ID: Ericsson Technology Licensing (0) Vendor Codec ID: 0 Controller Delay: 0 us (0x000000) Codec Configuration Length: 0 Codec Configuration: > HCI Event: Command Complete (0x0e) plen 6 LE Setup Isochronous Data Path (0x08\|0x006e) ncmd 1 Status: Success (0x00) Handle: 10 < HCI Command: LE Setup Isochronous Data Path (0x08\|0x006e) plen 13 Handle: 11 Data Path Direction: Input (Host to Controller) (0x00) Data Path: HCI (0x00) Coding Format: Transparent (0x03) Company Codec ID: Ericsson Technology Licensing (0) Vendor Codec ID: 0 Controller Delay: 0 us (0x000000) Codec Configuration Length: 0 Codec Configuration: > HCI Event: Command Complete (0x0e) plen 6 LE Setup Isochronous Data Path (0x08\|0x006e) ncmd 1 Status: Success (0x00) Handle: 11 < ISO Data TX: Handle 10 flags 0x02 dlen 44 < ISO Data TX: Handle 11 flags 0x02 dlen 44 > HCI Event: Number of Completed Packets (0x13) plen 5 Num handles: 1 Handle: 10 Count: 1 > HCI Event: Number of Completed Packets (0x13) plen 5 Num handles: 1 Handle: 11 Count: 1 Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:35:33 -07:00
Claudia Draghicescu	ae75336131	Bluetooth: Check for ISO support in controller This patch checks for ISO_BROADCASTER and ISO_SYNC_RECEIVER in controller. Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-08-11 11:31:23 -07:00
Jakub Kicinski	4d016ae42e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/intel/igc/igc_main.c `06b412589e` ("igc: Add lock to safeguard global Qbv variables") `d3750076d4` ("igc: Add TransmissionOverrun counter") drivers/net/ethernet/microsoft/mana/mana_en.c `a7dfeda6fd` ("net: mana: Fix MANA VF unload when hardware is unresponsive") `a9ca9f9cef` ("page_pool: split types and declarations from page_pool.h") `92272ec410` ("eth: add missing xdp.h includes in drivers") net/mptcp/protocol.h `511b90e392` ("mptcp: fix disconnect vs accept race") `b8dc6d6ce9` ("mptcp: fix rcv buffer auto-tuning") tools/testing/selftests/net/mptcp/mptcp_join.sh `c8c101ae39` ("selftests: mptcp: join: fix 'implicit EP' test") `03668c65d1` ("selftests: mptcp: join: rework detailed report") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-10 14:10:53 -07:00
Linus Torvalds	25aa0bebba	Including fixes from netfilter, wireless and bpf. Still trending up in size but the good news is that the "current" regressions are resolved, AFAIK. We're getting weirdly many fixes for Wake-on-LAN and suspend/resume handling on embedded this week (most not merged yet), not sure why. But those are all for older bugs. Current release - regressions: - tls: set MSG_SPLICE_PAGES consistently when handing encrypted data over to TCP Current release - new code bugs: - eth: mlx5: correct IDs on VFs internal to the device (IPU) Previous releases - regressions: - phy: at803x: fix WoL support / reporting on AR8032 - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from slaves, leading to BUG_ON() - tun: prevent tun_build_skb() from exceeding the packet size limit - wifi: rtw89: fix 8852AE disconnection caused by RX full flags - eth/PCI: enetc: fix probing after `6fffbc7ae1` ("PCI: Honor firmware's device disabled status"), keep PCI devices around even if they are disabled / not going to be probed to be able to apply quirks on them - eth: prestera: fix handling IPv4 routes with nexthop IDs Previous releases - always broken: - netfilter: re-work garbage collection to avoid races between user-facing API and timeouts - tunnels: fix generating ipv4 PMTU error on non-linear skbs - nexthop: fix infinite nexthop bucket dump when using maximum nexthop ID - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems() Misc: - unix: use consistent error code in SO_PEERPIDFD - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for upcoming IETF RFC Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTVMSsACgkQMUZtbf5S Irul3g//RlSANV/MWkiDmHIS5IhqkVWbvjGhFXFfdqZPH4gfgcX9VrsMuxgNM1Xu YXGx+rIu408qNNkVG2hpFMxPerRiqVB/XsH1TxRr0Mi6AMFoKGXS+cGwzSOaoMQj FYlcC6j2SnQ9N4I0qQuKOSOffyvyxrx/l9ozpVXsbGsOic1k6j1Ipwtf3+WP7dEe kkAPUlsQPdCIhMyQdK3X4xI1PGLtAXFgY3VV9bZ7u99l7QBwmconkl3GHq/xnPa8 Uyll005ThyYce0c4EPVcrY1YBXyY0LjOBIRtiTFAk6CMWc0Su8Ug/i4+K2KTq0eh yjqqHkpR//ruLgtAXBLLE9mxma8448vmmex/cSLIBaMAttlnj9n2LvCqvbzNfTZA ssnKO4D3HhoQvHqbeOOW6VzVX7XyhomOvQXihfdLUs9u2tKE3nQoU+QCnrnIUXZO VF5/ubCERRdZDPQ1SSAktimlC0R1qVL7JPMRaQF0aW5xByabbEWwMaNiwkYQOh2o w2KsbhM/vWyd+5JB412LrNsEgK1BV6WjgwzC+27YQ7QD/JxUZBUghL0ps2jgU2Lu d4YdbBOgYz+xyUBPByeYzcac0SIeMkB/UEcaO54ySWU8GcWYLt4KXwydUq/cXlw0 rUDCO5bikMxmygLKtnTSwmwvGbGByEXbGvVKwUwNPqTnR+vPIbM= =NZgp -----END PGP SIGNATURE----- Merge tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, wireless and bpf. Still trending up in size but the good news is that the "current" regressions are resolved, AFAIK. We're getting weirdly many fixes for Wake-on-LAN and suspend/resume handling on embedded this week (most not merged yet), not sure why. But those are all for older bugs. Current release - regressions: - tls: set MSG_SPLICE_PAGES consistently when handing encrypted data over to TCP Current release - new code bugs: - eth: mlx5: correct IDs on VFs internal to the device (IPU) Previous releases - regressions: - phy: at803x: fix WoL support / reporting on AR8032 - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from slaves, leading to BUG_ON() - tun: prevent tun_build_skb() from exceeding the packet size limit - wifi: rtw89: fix 8852AE disconnection caused by RX full flags - eth/PCI: enetc: fix probing after `6fffbc7ae1` ("PCI: Honor firmware's device disabled status"), keep PCI devices around even if they are disabled / not going to be probed to be able to apply quirks on them - eth: prestera: fix handling IPv4 routes with nexthop IDs Previous releases - always broken: - netfilter: re-work garbage collection to avoid races between user-facing API and timeouts - tunnels: fix generating ipv4 PMTU error on non-linear skbs - nexthop: fix infinite nexthop bucket dump when using maximum nexthop ID - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems() Misc: - unix: use consistent error code in SO_PEERPIDFD - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for upcoming IETF RFC" * tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) net: hns3: fix strscpy causing content truncation issue net: tls: set MSG_SPLICE_PAGES consistently ibmvnic: Ensure login failure recovery is safe from other resets ibmvnic: Do partial reset on login failure ibmvnic: Handle DMA unmapping of login buffs in release functions ibmvnic: Unmap DMA login rsp buffer on send login fail ibmvnic: Enforce stronger sanity checks on login response net: mana: Fix MANA VF unload when hardware is unresponsive netfilter: nf_tables: remove busy mark and gc batch API netfilter: nft_set_hash: mark set element as dead when deleting from packet path netfilter: nf_tables: adapt set backend to use GC transaction API netfilter: nf_tables: GC transaction API to avoid race with control plane selftests/bpf: Add sockmap test for redirecting partial skb data selftests/bpf: fix a CI failure caused by vsock sockmap test bpf, sockmap: Fix bug that strp_done cannot be called bpf, sockmap: Fix map type error in sock_map_del_link xsk: fix refcount underflow in error path ipv6: adjust ndisc_is_useropt() to also return true for PIO selftests: forwarding: bridge_mdb: Make test more robust selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet ...	2023-08-10 12:37:24 -07:00
Jakub Kicinski	6b486676b4	net: tls: set MSG_SPLICE_PAGES consistently We used to change the flags for the last segment, because non-last segments had the MSG_SENDPAGE_NOTLAST flag set. That flag is no longer a thing so remove the setting. Since flags most likely don't have MSG_SPLICE_PAGES set this avoids passing parts of the sg as splice and parts as non-splice. Before commit under Fixes we'd have called tcp_sendpage() which would add the MSG_SPLICE_PAGES. Why this leads to trouble remains unclear but Tariq reports hitting the WARN_ON(!sendpage_ok()) due to page refcount of 0. Fixes: `e117dcfd64` ("tls: Inline do_tcp_sendpages()") Reported-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/all/4c49176f-147a-4283-f1b1-32aac7b4b996@gmail.com/ Tested-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230808180917.1243540-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-10 11:36:57 -07:00
Jakub Kicinski	3e91b0ebd9	netfilter pull request 23-08-10 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmTUhbUACgkQ1V2XiooU IORz1Q//a2fDuMsK5iW1BlF4y0P9aQUSVV//r3DYaoYOspJhsB2yZu4HtL+XQJvY yncwg+ub24yQh5sUNSJnZztQVTN+NPY9Vl2TkXXMx6Wxs2XenmgzZmDdghUDzhTd DuOjIGVEJ2M6XpPAOub89sqL+E0K7J0/q0aIcV0K0/xKo7U/z3vgLv4aZx/ZjPCV daj3gcGpYQ1JJ9pi2se2yh89dzT321U7yYde9ek0TUeKGdCFJkfHkqMurwbcgoJ8 jkx5NOtrp+GLbhd+ME86IUtD+Edm46+bJUxvG0My99CVlak7y5gJh/aPxpAPACuW NhWWy26kivVRWyttLQk0ScZfbO1CIwvaPzQC+QdlFdNA1eWTMhEk6AG2dVaU9CNB V9WKWv59CPaDwPCKhXiPLQ9J+Kds7oyHPXGlV2dDOuSmJ9QbOh/HBQGEm/mI93qX Fr+qqP3A9/juXZ5FdSLT2pJPuVlXdhQdgyHgiunyDPHoL9q7GFn5aQL/BVKE23tc bgMez0GKzBR0waS9cycFSVls1rQN1XUIdoD6SLaRYq9FkKcCx+YGn3LH44Y1feL/ UnLMFlt9xIG4dPbGcGGy4r7mB53JpglHEqJEftvsNcBEd/r/f+4JP+/fa9FJ70uZ GpGmv7Wo5DZT5V8LaMeWDWpJl6G7UcxrFOyDTw27l2OOVNaD2Ic= =KNf7 -----END PGP SIGNATURE----- Merge tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The existing attempt to resolve races between control plane and GC work is error prone, as reported by Bien Pham <phamnnb@sea.com>, some places forgot to call nft_set_elem_mark_busy(), leading to double-deactivation of elements. This series contains the following patches: 1) Do not skip expired elements during walk otherwise elements might never decrement the reference counter on data, leading to memleak. 2) Add a GC transaction API to replace the former attempt to deal with races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT on elements and it creates a GC transaction to remove the expired elements, GC transaction could abort in case of interference with control plane and retried later (GC async). Set backends such as rbtree and pipapo also perform GC from control plane (GC sync), in such case, element deactivation and removal is safe because mutex is held then collected elements are released via call_rcu(). 3) Adapt existing set backends to use the GC transaction API. 4) Update rhash set backend to set on _DEAD bit to report deleted elements from datapath for GC. 5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT. * tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: remove busy mark and gc batch API netfilter: nft_set_hash: mark set element as dead when deleting from packet path netfilter: nf_tables: adapt set backend to use GC transaction API netfilter: nf_tables: GC transaction API to avoid race with control plane netfilter: nf_tables: don't skip expired elements during walk ==================== Link: https://lore.kernel.org/r/20230810070830.24064-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-10 10:47:08 -07:00
Jakub Kicinski	62d02fca8b	bpf pull-request 2023-08-09 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRuIQAKCRBar9k/UBDW 4++9AP9ymOcPOKTKdQwZ6cnq3vkmvN37H6teufTyM8vsCha9NAD+OQE+vg1304RM aETtG6d5Nb+byIHZGJrdUyT7g9jRzgw= =qr/C -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Martin KaFai Lau says: ==================== pull-request: bpf 2023-08-09 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 6 files changed, 102 insertions(+), 8 deletions(-). The main changes are: 1) A bpf sockmap memleak fix and a fix in accessing the programs of a sockmap under the incorrect map type from Xu Kuohai. 2) A refcount underflow fix in xsk from Magnus Karlsson. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add sockmap test for redirecting partial skb data selftests/bpf: fix a CI failure caused by vsock sockmap test bpf, sockmap: Fix bug that strp_done cannot be called bpf, sockmap: Fix map type error in sock_map_del_link xsk: fix refcount underflow in error path ==================== Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-10 10:41:36 -07:00
Pablo Neira Ayuso	a2dd0233cb	netfilter: nf_tables: remove busy mark and gc batch API Ditch it, it has been replace it by the GC transaction API and it has no clients anymore. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-08-10 08:25:27 +02:00
Pablo Neira Ayuso	c92db30304	netfilter: nft_set_hash: mark set element as dead when deleting from packet path Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of performing element removal which might race with an ongoing transaction. Enable gc when dynamic flag is set on since dynset deletion requires garbage collection after this patch. Fixes: `d0a8d877da` ("netfilter: nft_dynset: support for element deletion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-08-10 08:25:27 +02:00
Pablo Neira Ayuso	f6c383b8c3	netfilter: nf_tables: adapt set backend to use GC transaction API Use the GC transaction API to replace the old and buggy gc API and the busy mark approach. No set elements are removed from async garbage collection anymore, instead the _DEAD bit is set on so the set element is not visible from lookup path anymore. Async GC enqueues transaction work that might be aborted and retried later. rbtree and pipapo set backends does not set on the _DEAD bit from the sync GC path since this runs in control plane path where mutex is held. In this case, set elements are deactivated, removed and then released via RCU callback, sync GC never fails. Fixes: `3c4287f620` ("nf_tables: Add set type for arbitrary concatenation of ranges") Fixes: `8d8540c4f5` ("netfilter: nft_set_rbtree: add timeout support") Fixes: `9d0982927e` ("netfilter: nft_hash: add support for timeouts") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-08-10 08:25:27 +02:00
Pablo Neira Ayuso	5f68718b34	netfilter: nf_tables: GC transaction API to avoid race with control plane The set types rhashtable and rbtree use a GC worker to reclaim memory. From system work queue, in periodic intervals, a scan of the table is done. The major caveat here is that the nft transaction mutex is not held. This causes a race between control plane and GC when they attempt to delete the same element. We cannot grab the netlink mutex from the work queue, because the control plane has to wait for the GC work queue in case the set is to be removed, so we get following deadlock: cpu 1 cpu2 GC work transaction comes in , lock nft mutex `acquire nft mutex // BLOCKS transaction asks to remove the set set destruction calls cancel_work_sync() cancel_work_sync will now block forever, because it is waiting for the mutex the caller already owns. This patch adds a new API that deals with garbage collection in two steps: 1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT so they are not visible via lookup. Annotate current GC sequence in the GC transaction. Enqueue GC transaction work as soon as it is full. If ruleset is updated, then GC transaction is aborted and retried later. 2) GC work grabs the mutex. If GC sequence has changed then this GC transaction lost race with control plane, abort it as it contains stale references to objects and let GC try again later. If the ruleset is intact, then this GC transaction deactivates and removes the elements and it uses call_rcu() to destroy elements. Note that no elements are removed from GC lockless path, the _DEAD bit is set and pointers are collected. GC catchall does not remove the elements anymore too. There is a new set->dead flag that is set on to abort the GC transaction to deal with set->ops->destroy() path which removes the remaining elements in the set from commit_release, where no mutex is held. To deal with GC when mutex is held, which allows safe deactivate and removal, add sync GC API which releases the set element object via call_rcu(). This is used by rbtree and pipapo backends which also perform garbage collection from control plane path. Since element removal from sets can happen from control plane and element garbage collection/timeout, it is necessary to keep the set structure alive until all elements have been deactivated and destroyed. We cannot do a cancel_work_sync or flush_work in nft_set_destroy because its called with the transaction mutex held, but the aforementioned async work queue might be blocked on the very mutex that nft_set_destroy() callchain is sitting on. This gives us the choice of ABBA deadlock or UaF. To avoid both, add set->refs refcount_t member. The GC API can then increment the set refcount and release it once the elements have been free'd. Set backends are adapted to use the GC transaction API in a follow up patch entitled: ("netfilter: nf_tables: use gc transaction API in set backends") This is joint work with Florian Westphal. Fixes: `cfed7e1b1f` ("netfilter: nf_tables: add set garbage collection helpers") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-08-10 08:25:16 +02:00
Xu Kuohai	809e4dc71a	bpf, sockmap: Fix bug that strp_done cannot be called strp_done is only called when psock->progs.stream_parser is not NULL, but stream_parser was set to NULL by sk_psock_stop_strp(), called by sk_psock_drop() earlier. So, strp_done can never be called. Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock. Change the condition for calling strp_done from judging whether stream_parser is set to judging whether this flag is set. This flag is only set once when strp_init() succeeds, and will never be cleared later. Fixes: `c0d95d3380` ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-09 20:29:02 -07:00
Xu Kuohai	7e96ec0e66	bpf, sockmap: Fix map type error in sock_map_del_link sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although both types have member named "progs", the offset of "progs" member in these two types is different, so "progs" should be accessed with the real map type. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230804073740.194770-2-xukuohai@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-09 20:29:02 -07:00
Magnus Karlsson	85c2c79a07	xsk: fix refcount underflow in error path Fix a refcount underflow problem reported by syzbot that can happen when a system is running out of memory. If xp_alloc_tx_descs() fails, and it can only fail due to not having enough memory, then the error path is triggered. In this error path, the refcount of the pool is decremented as it has incremented before. However, the reference to the pool in the socket was not nulled. This means that when the socket is closed later, the socket teardown logic will think that there is a pool attached to the socket and try to decrease the refcount again, leading to a refcount underflow. I chose this fix as it involved adding just a single line. Another option would have been to move xp_get_pool() and the assignment of xs->pool to after the if-statement and using xs_umem->pool instead of xs->pool in the whole if-statement resulting in somewhat simpler code, but this would have led to much more churn in the code base perhaps making it harder to backport. Fixes: `ba3beec2ec` ("xsk: Fix possible crash when multiple sockets are created") Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-09 20:08:04 -07:00
Vladimir Oltean	665338b2a7	net/sched: taprio: dump class stats for the actual q->qdiscs[] This makes a difference for the software scheduling mode, where dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself, but when we're talking about what Qdisc and stats get reported for a traffic class, the root taprio isn't what comes to mind, but q->qdiscs[] is. To understand the difference, I've attempted to send 100 packets in software mode through class 8001:5, and recorded the stats before and after the change. Here is before: $ tc -s class show dev eth0 class taprio 8001:1 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:2 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:3 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:4 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:5 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:6 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:7 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:8 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 and here is after: class taprio 8001:1 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:2 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:3 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:4 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:5 root Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:6 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:7 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:8 root leaf 800d: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 The most glaring (and expected) difference is that before, all class stats reported the global stats, whereas now, they really report just the counters for that traffic class. Finally, Pedro Tammela points out that there is a tc selftest which checks specifically which handle do the child Qdiscs corresponding to each class have. That's changing here - taprio no longer reports tcm->tcm_info as the same handle "1:" as itself (the root Qdisc), but 0 (the handle of the default pfifo child Qdiscs). Since iproute2 does not print a child Qdisc handle of 0, adjust the test's expected output. Link: https://lore.kernel.org/netdev/3b83fcf6-a5e8-26fb-8c8a-ec34ec4c3342@mojatatu.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-6-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:59:21 -07:00
Vladimir Oltean	6e0ec800c1	net/sched: taprio: delete misleading comment about preallocating child qdiscs As mentioned in commit `af7b29b1de` ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport between Qdisc_ops :: init() and Qdisc_ops :: attach(). Delete the comment, which is just stolen from mqprio, but there, the usage patterns are a lot different, and this is nothing but confusing. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-5-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:59:20 -07:00
Vladimir Oltean	98766add2d	net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf() This is another stab at commit `1461d212ab` ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later reverted in commit `af7b29b1de` ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs""). I believe that the problems that caused the revert were fixed, and thus, this change is identical to the original patch. Its purpose is to properly reject attaching a software taprio child qdisc to a software taprio parent. Because unoffloaded taprio currently reports itself (the root Qdisc) as the return value from qdisc_leaf(), then the process of attaching another taprio as child to a Qdisc class of the root will just result in a Qdisc_ops :: change() call for the root. Whereas that's not we want. We want Qdisc_ops :: init() to be called for the taprio child, in order to give the taprio child a chance to check whether its sch->parent is TC_H_ROOT or not (and reject this configuration). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:59:20 -07:00
Vladimir Oltean	25b0d4e4e4	net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode Normally, Qdiscs have one reference on them held by their owner and one held for each TXQ to which they are attached, however this is not the case with the children of an offloaded taprio. Instead, the taprio qdisc currently lives in the following fragile equilibrium. In the software scheduling case, taprio attaches itself (the root Qdisc) to all TXQs, thus having a refcount of 1 + the number of TX queues. In this mode, the q->qdiscs[] children are not visible directly to the Qdisc API. The lifetime of the Qdiscs from this private array lasts until qdisc_destroy() -> taprio_destroy(). In the fully offloaded case, the root taprio has a refcount of 1, and all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[] are attached to the netdev TXQs directly and thus are visible to the Qdisc API, however taprio loses a reference to them very early - during qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees the q->qdiscs[] array to not leak memory, but interestingly, it does not release a reference on these qdiscs because it doesn't effectively own them - they are created by taprio but owned by the Qdisc core, and will be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when the Qdisc is deleted or when the child Qdisc is replaced with something else. My interest is to change this equilibrium such that taprio also owns a reference on the q->qdiscs[] child Qdiscs for the lifetime of the root Qdisc, including in full offload mode. I want this because I would like taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have insight into q->qdiscs[] for the software scheduling mode - currently they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same as the root taprio. The following set of changes is necessary: - don't free q->qdiscs[] early in taprio_attach(), free it late in taprio_destroy() for consistency with software mode. But: - currently that's not possible, because taprio doesn't own a reference on q->qdiscs[]. So hold that reference - once during the initial attach() and once during subsequent graft() calls when the child is changed. - always keep track of the current child in q->qdiscs[], even for full offload mode, so that we free in taprio_destroy() what we should, and not something stale. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:59:20 -07:00
Vladimir Oltean	09e0c3bbde	net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach() This is a simple code transformation with no intended behavior change, just to make it absolutely clear that q->qdiscs[] is only attached to the child taprio classes in full offload mode. Right now we use the q->qdiscs[] variable in taprio_attach() for software mode too, but that is quite confusing and avoidable. We use it only to reach the netdev TX queue, but we could as well just use netdev_get_tx_queue() for that. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:59:20 -07:00
Maciej Żenczykowski	048c796beb	ipv6: adjust ndisc_is_useropt() to also return true for PIO The upcoming (and nearly finalized): https://datatracker.ietf.org/doc/draft-collink-6man-pio-pflag/ will update the IPv6 RA to include a new flag in the PIO field, which will serve as a hint to perform DHCPv6-PD. As we don't want DHCPv6 related logic inside the kernel, this piece of information needs to be exposed to userspace. The simplest option is to simply expose the entire PIO through the already existing mechanism. Even without this new flag, the already existing PIO R (router address) flag (from RFC6275) cannot AFAICT be handled entirely in kernel, and provides useful information that should be exposed to userspace (the router's global address, for use by Mobile IPv6). Also cc'ing stable@ for inclusion in LTS, as while technically this is not quite a bugfix, and instead more of a feature, it is absolutely trivial and the alternative is manually cherrypicking into all Android Common Kernel trees - and I know Greg will ask for it to be sent in via LTS instead... Cc: Jen Linkova <furry@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> Cc: stable@vger.kernel.org Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20230807102533.1147559-1-maze@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:36:12 -07:00
Nick Desaulniers	fa1891aeb7	net/llc/llc_conn.c: fix 4 instances of -Wmissing-variable-declarations I'm looking to enable -Wmissing-variable-declarations behind W=1. 0day bot spotted the following instances: net/llc/llc_conn.c:44:5: warning: no previous extern declaration for non-static variable 'sysctl_llc2_ack_timeout' [-Wmissing-variable-declarations] 44 \| int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ; \| ^ net/llc/llc_conn.c:44:1: note: declare 'static' if the variable is not intended to be used outside of this translation unit 44 \| int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ; \| ^ net/llc/llc_conn.c:45:5: warning: no previous extern declaration for non-static variable 'sysctl_llc2_p_timeout' [-Wmissing-variable-declarations] 45 \| int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ; \| ^ net/llc/llc_conn.c:45:1: note: declare 'static' if the variable is not intended to be used outside of this translation unit 45 \| int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ; \| ^ net/llc/llc_conn.c:46:5: warning: no previous extern declaration for non-static variable 'sysctl_llc2_rej_timeout' [-Wmissing-variable-declarations] 46 \| int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ; \| ^ net/llc/llc_conn.c:46:1: note: declare 'static' if the variable is not intended to be used outside of this translation unit 46 \| int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ; \| ^ net/llc/llc_conn.c:47:5: warning: no previous extern declaration for non-static variable 'sysctl_llc2_busy_timeout' [-Wmissing-variable-declarations] 47 \| int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ; \| ^ net/llc/llc_conn.c:47:1: note: declare 'static' if the variable is not intended to be used outside of this translation unit 47 \| int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ; \| ^ These symbols are referenced by more than one translation unit, so make include the correct header for their declarations. Finally, sort the list of includes to help keep them tidy. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/llvm/202308081000.tTL1ElTr-lkp@intel.com/ Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230808-llc_static-v1-1-c140c4c297e4@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:34:28 -07:00
Eric Dumazet	1ded5e5a59	net: annotate data-races around sock->ops IPV6_ADDRFORM socket option is evil, because it can change sock->ops while other threads might read it. Same issue for sk->sk_family being set to AF_INET. Adding READ_ONCE() over sock->ops reads is needed for sockets that might be impacted by IPV6_ADDRFORM. Note that mptcp_is_tcpsk() can also overwrite sock->ops. Adding annotations for all sk->sk_family reads will require more patches :/ BUG: KCSAN: data-race in ____sys_sendmsg / do_ipv6_setsockopt write to 0xffff888109f24ca0 of 8 bytes by task 4470 on cpu 0: do_ipv6_setsockopt+0x2c5e/0x2ce0 net/ipv6/ipv6_sockglue.c:491 ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012 udpv6_setsockopt+0x95/0xa0 net/ipv6/udp.c:1690 sock_common_setsockopt+0x61/0x70 net/core/sock.c:3663 __sys_setsockopt+0x1c3/0x230 net/socket.c:2273 __do_sys_setsockopt net/socket.c:2284 [inline] __se_sys_setsockopt net/socket.c:2281 [inline] __x64_sys_setsockopt+0x66/0x80 net/socket.c:2281 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888109f24ca0 of 8 bytes by task 4469 on cpu 1: sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x349/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmmsg+0x263/0x500 net/socket.c:2643 __do_sys_sendmmsg net/socket.c:2672 [inline] __se_sys_sendmmsg net/socket.c:2669 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2669 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0xffffffff850e32b8 -> 0xffffffff850da890 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 4469 Comm: syz-executor.1 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023 Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230808135809.2300241-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:32:43 -07:00
Jakub Kicinski	15c8795dbf	Just a few small updates: * fix an integer overflow in nl80211 * fix rtw89 8852AE disconnections * fix a buffer overflow in ath12k * fix AP_VLAN configuration lookups * fix allocation failure handling in brcm80211 * update MAINTAINERS for some drivers -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmTTitIACgkQ10qiO8sP aAAF/hAAnyF2Q4rjtfelRRj0ghR5uLxzIItNtkeWG5Z2KyGpbzF94ESMGJ/PnD/9 rcpEhj+KCKB7ZojHRgcleBSOds6yMTj0m9XJ7iMA/QYnV45Gi+cnlIiKyxSmpHBT jSpddG4BLEUGNd8qwghJlK6ApqtVuFRDw3nBXhPEnc9z6ohNHVAOXXjNP2FWAwWA 3Xh4/IVK8ayLlmwyWFOKs1V2dx+rqfcOa/PXt4NK+/sIPrPOwbhgGSJed+QFosI7 btuKjG1uQAXBbL5/zRwFrVnKqUBcqnX3Fk4NJgJDhxhh1ei9hfdxNDFECjjI6mb+ rnPjZMBGv+3u7SgyH0avdUulb5j5tLHZJMMhbDNPgccIL/sxsi6iErUbhbYsmo72 HqHRLw4Cw5OaFFAZZhlmyeUzVDSD67MElqiyV2sBSU6/QQG4BYqCfo9EkuQLQ7g/ TE9zsklzpMIjgBL3ERl8r5LpbJqU7m4mmjncTQrB/o6SDbvmXmzIZoD7HuCM0z7r SVgMcPig6i7taL/UkdzsqI/nmyo3TtRMD6pcxW3UIUJkFBJ+qwJIdCeDj3UNaOtY xfMXnemx0C628Gdtbwrsyd3v5pbE0tWYXbG7vJIqE4cuNc2x5K+lQSyaefaKau+e wamtQ6+hkv6kVYYBYvZ7yA/7Tfi3G3msrh8Oof0DQ93n9uA6EL8= =Q+kn -----END PGP SIGNATURE----- Merge tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Just a few small updates: * fix an integer overflow in nl80211 * fix rtw89 8852AE disconnections * fix a buffer overflow in ath12k * fix AP_VLAN configuration lookups * fix allocation failure handling in brcm80211 * update MAINTAINERS for some drivers * tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: ath12k: Fix buffer overflow when scanning with extraie wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems() wifi: cfg80211: fix sband iftype data lookup for AP_VLAN wifi: rtw89: fix 8852AE disconnection caused by RX full flags MAINTAINERS: Remove tree entry for rtl8180 MAINTAINERS: Update entry for rtl8187 wifi: brcm80211: handle params_v1 allocation failure ==================== Link: https://lore.kernel.org/r/20230809124818.167432-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 15:04:44 -07:00
Jakub Kicinski	052059b663	nf-next pull request 2023-08-08 -----BEGIN PGP SIGNATURE----- iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTSNyYNHGZ3QHN0cmxl bi5kZQAKCRBwkajZrV/2AOpyD/4pRjBvLgU0O+33x3l/X8z80QW6VMwq6PDUDmNA t0AFnhIx0v6yky0abLzGGV9q2N9SLdNltTsqb0pem00f1TncKR1BlHNttfU+yjS+ qmkuUTJ8HixxkbBRKB9E7kA3IPM2aj1gxd3sji/QHKMT8XtcE5ufoad8/jcM9px2 FHNHJ8Onwl8ohjV4qNeuPe8XWm47pN/FeaxK7jRrKJaCal0P96sT8AGf/Rvx5VNY jCysb2+fIKMzHssbLcRr1UDMJJFqtcQx0alnzwxh4sEPsmgYYR7UGmcDku4pSbtB uJBrjMnLpORpw1l2syuYiiyEy+VRAAIjWAUxb6oTOvDhj1Yj2ki/915b/Hl/jnqa q8EUm6i+B4CuiE8LCj0WLG2gKO7vRjFnDH/Li/qiFMUHzW/HmnLRituxTVTXwopC 1CxFkekNIklxLr+n21dP6f+NJU9hIs1nw+iy0JhLffTV8u6TosZ3Ve5ZInUQ4Bna hZUyksy1s42fi0oGHN1Gi3AWPhUIlH69lKXTOher/9PvC+rL5/l8LFVywqtYxT4t HWJwBtpm58IrsDttgPm6fnJqIgrm5mcW+FJYqP9Td6NjUvyfafhJsdhXJ9Bs1lJV SfJp4iEUkOx4bfUi0rDFb/8NR2fwrCpbbgYx2xdS+tHQ2BkRlAOxYJC9Q8YP2vxH Mjgc9w== =XfrS -----END PGP SIGNATURE----- Merge tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter updates for net-next First 4 Patches, from Yue Haibing, remove unused prototypes in various netfilter headers. Last patch makes nfnetlink_log to always include a packet timestamp, up to now it was only included if the skb had assigned previously. From Maciej Żenczykowski. * tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nfnetlink_log: always add a timestamp netfilter: h323: Remove unused function declarations netfilter: conntrack: Remove unused function declarations netfilter: helper: Remove unused function declarations netfilter: gre: Remove unused function declaration nf_ct_gre_keymap_flush() ==================== Link: https://lore.kernel.org/r/20230808124159.19046-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:52:36 -07:00
Ido Schimmel	8743aeff5b	nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID A netlink dump callback can return a positive number to signal that more information needs to be dumped or zero to signal that the dump is complete. In the second case, the core netlink code will append the NLMSG_DONE message to the skb in order to indicate to user space that the dump is complete. The nexthop bucket dump callback always returns a positive number if nexthop buckets were filled in the provided skb, even if the dump is complete. This means that a dump will span at least two recvmsg() calls as long as nexthop buckets are present. In the last recvmsg() call the dump callback will not fill in any nexthop buckets because the previous call indicated that the dump should restart from the last dumped nexthop ID plus one. # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id 10 group 1 type resilient buckets 2 # strace -e sendto,recvmsg -s 5 ip nexthop bucket sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST\|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? /, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_TRUNC) = 128 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128 id 10 index 0 idle_time 6.66 nhid 1 id 10 index 1 idle_time 6.66 nhid 1 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_TRUNC) = 20 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 +++ exited with 0 +++ This behavior is both inefficient and buggy. If the last nexthop to be dumped had the maximum ID of 0xffffffff, then the dump will restart from 0 (0xffffffff + 1) and never end: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id $((232-1)) group 1 type resilient buckets 2 # ip nexthop bucket id 4294967295 index 0 idle_time 5.55 nhid 1 id 4294967295 index 1 idle_time 5.55 nhid 1 id 4294967295 index 0 idle_time 5.55 nhid 1 id 4294967295 index 1 idle_time 5.55 nhid 1 [...] Fix by adjusting the dump callback to return zero when the dump is complete. After the fix only one recvmsg() call is made and the NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id $((232-1)) group 1 type resilient buckets 2 # strace -e sendto,recvmsg -s 5 ip nexthop bucket sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST\|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 / NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_TRUNC) = 148 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148 id 4294967295 index 0 idle_time 6.61 nhid 1 id 4294967295 index 1 idle_time 6.61 nhid 1 +++ exited with 0 +++ Note that if the NLMSG_DONE message cannot be appended because of size limitations, then another recvmsg() will be needed, but the core netlink code will not invoke the dump callback and simply reply with a NLMSG_DONE message since it knows that the callback previously returned zero. Add a test that fails before the fix: # ./fib_nexthops.sh -t basic_res [...] TEST: Maximum nexthop ID dump [FAIL] [...] And passes after it: # ./fib_nexthops.sh -t basic_res [...] TEST: Maximum nexthop ID dump [ OK ] [...] Fixes: `8a1bbabb03` ("nexthop: Add netlink handlers for bucket dump") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:45:12 -07:00
Ido Schimmel	f10d3d9df4	nexthop: Make nexthop bucket dump more efficient rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging to a specific resilient nexthop group. The function returns a positive return code (the skb length) upon both success and failure. The above behavior is problematic. When a complete nexthop bucket dump is requested, the function that walks the different nexthops treats the non-zero return code as an error. This causes buckets belonging to different resilient nexthop groups to be dumped using different buffers even if they can all fit in the same buffer: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id 10 group 1 type resilient buckets 1 # ip nexthop add id 20 group 1 type resilient buckets 1 # strace -e recvmsg -s 0 ip nexthop bucket [...] recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64 id 10 index 0 idle_time 10.27 nhid 1 [...] recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64 id 20 index 0 idle_time 6.44 nhid 1 [...] Fix by only returning a non-zero return code when an error occurred and restarting the dump from the bucket index we failed to fill in. This allows buckets belonging to different resilient nexthop groups to be dumped using the same buffer: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id 10 group 1 type resilient buckets 1 # ip nexthop add id 20 group 1 type resilient buckets 1 # strace -e recvmsg -s 0 ip nexthop bucket [...] recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128 id 10 index 0 idle_time 30.21 nhid 1 id 20 index 0 idle_time 26.7 nhid 1 [...] While this change is more of a performance improvement change than an actual bug fix, it is a prerequisite for a subsequent patch that does fix a bug. Fixes: `8a1bbabb03` ("nexthop: Add netlink handlers for bucket dump") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:45:04 -07:00
Ido Schimmel	913f60cacd	nexthop: Fix infinite nexthop dump when using maximum nexthop ID A netlink dump callback can return a positive number to signal that more information needs to be dumped or zero to signal that the dump is complete. In the second case, the core netlink code will append the NLMSG_DONE message to the skb in order to indicate to user space that the dump is complete. The nexthop dump callback always returns a positive number if nexthops were filled in the provided skb, even if the dump is complete. This means that a dump will span at least two recvmsg() calls as long as nexthops are present. In the last recvmsg() call the dump callback will not fill in any nexthops because the previous call indicated that the dump should restart from the last dumped nexthop ID plus one. # ip nexthop add id 1 blackhole # strace -e sendto,recvmsg -s 5 ip nexthop sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST\|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? /, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_TRUNC) = 36 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36 id 1 blackhole recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_TRUNC) = 20 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 +++ exited with 0 +++ This behavior is both inefficient and buggy. If the last nexthop to be dumped had the maximum ID of 0xffffffff, then the dump will restart from 0 (0xffffffff + 1) and never end: # ip nexthop add id $((232-1)) blackhole # ip nexthop id 4294967295 blackhole id 4294967295 blackhole [...] Fix by adjusting the dump callback to return zero when the dump is complete. After the fix only one recvmsg() call is made and the NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response: # ip nexthop add id $((232-1)) blackhole # strace -e sendto,recvmsg -s 5 ip nexthop sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST\|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 / NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK\|MSG_TRUNC) = 56 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56 id 4294967295 blackhole +++ exited with 0 +++ Note that if the NLMSG_DONE message cannot be appended because of size limitations, then another recvmsg() will be needed, but the core netlink code will not invoke the dump callback and simply reply with a NLMSG_DONE message since it knows that the callback previously returned zero. Add a test that fails before the fix: # ./fib_nexthops.sh -t basic [...] TEST: Maximum nexthop ID dump [FAIL] [...] And passes after it: # ./fib_nexthops.sh -t basic [...] TEST: Maximum nexthop ID dump [ OK ] [...] Fixes: `ab84be7e54` ("net: Initial nexthop code") Reported-by: Petr Machata <petrm@nvidia.com> Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:44:36 -07:00
Vlad Buslov	718cb09aaa	vlan: Fix VLAN 0 memory leak The referenced commit intended to fix memleak of VLAN 0 that is implicitly created on devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. However, it doesn't take into account that the feature can be re-set during the netdevice lifetime which will cause memory leak if feature is disabled during the device deletion as illustrated by [0]. Fix the leak by unconditionally deleting VLAN 0 on NETDEV_DOWN event. [0]: > modprobe 8021q > ip l set dev eth2 up > ethtool -K eth2 rx-vlan-filter off > modprobe -r mlx5_ib > modprobe -r mlx5_core > cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888103dcd900 (size 256): comm "ip", pid 1490, jiffies 4294907305 (age 325.364s) hex dump (first 32 bytes): 00 80 5d 03 81 88 ff ff 00 00 00 00 00 00 00 00 ..]............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000899f3bb9>] kmalloc_trace+0x25/0x80 [<000000002889a7a2>] vlan_vid_add+0xa0/0x210 [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q] [<000000009a0716b1>] notifier_call_chain+0x35/0xb0 [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0 [<0000000053d2b05d>] dev_change_flags+0x4d/0x60 [<00000000982807e9>] do_setlink+0x28d/0x10a0 [<0000000058c1be00>] __rtnl_newlink+0x545/0x980 [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70 [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390 [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100 [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0 [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0 [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60 [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200 [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0 unreferenced object 0xffff88813354fde0 (size 32): comm "ip", pid 1490, jiffies 4294907305 (age 325.364s) hex dump (first 32 bytes): a0 d9 dc 03 81 88 ff ff a0 d9 dc 03 81 88 ff ff ................ 81 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000899f3bb9>] kmalloc_trace+0x25/0x80 [<000000002da64724>] vlan_vid_add+0xdf/0x210 [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q] [<000000009a0716b1>] notifier_call_chain+0x35/0xb0 [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0 [<0000000053d2b05d>] dev_change_flags+0x4d/0x60 [<00000000982807e9>] do_setlink+0x28d/0x10a0 [<0000000058c1be00>] __rtnl_newlink+0x545/0x980 [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70 [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390 [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100 [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0 [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0 [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60 [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200 [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0 Fixes: `efc73f4bbc` ("net: Fix memory leak - vlan_info struct") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20230808093521.1468929-1-vladbu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:44:27 -07:00
Russell King (Oracle)	145622771d	net: dsa: mark parsed interface mode for legacy switch drivers If we successfully parsed an interface mode with a legacy switch driver, populate that mode into phylink's supported interfaces rather than defaulting to the internal and gmii interfaces. This hasn't caused an issue so far, because when the interface doesn't match a supported one, phylink_validate() doesn't clear the supported mask, but instead returns -EINVAL. phylink_parse_fixedlink() doesn't check this return value, and merely relies on the supported ethtool link modes mask being cleared. Therefore, the fixed link settings end up being allowed despite validation failing. Before this causes a problem, arrange for DSA to more accurately populate phylink's supported interfaces mask so validation can correctly succeed. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/E1qTKdM-003Cpx-Eh@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:08:09 -07:00
Jiri Pirko	832140804e	devlink: clear flag on port register error path When xarray insertion fails, clear the flag. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230808082020.1363497-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:06:21 -07:00
Yue Haibing	ca76b386d4	tipc: Remove unused declaration tipc_link_build_bc_sync_msg() Commit `5266698661` ("tipc: let broadcast packet reception use new link receive function") declared but never implemented this. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230807142926.45752-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-09 13:03:14 -07:00
Breno Leitao	8e9fad0e70	io_uring: Add io_uring command support for sockets Enable io_uring commands on network sockets. Create two new SOCKET_URING_OP commands that will operate on sockets. In order to call ioctl on sockets, use the file_operations->io_uring_cmd callbacks, and map it to a uring socket function, which handles the SOCKET_URING_OP accordingly, and calls socket ioctls. This patches was tested by creating a new test case in liburing. Link: https://github.com/leitao/liburing/tree/io_uring_cmd Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-09 10:46:15 -06:00
Remi Pommarel	421d467dc2	batman-adv: Fix batadv_v_ogm_aggr_send memory leak When batadv_v_ogm_aggr_send is called for an inactive interface, the skb is silently dropped by batadv_v_ogm_send_to_if() but never freed causing the following memory leak: unreferenced object 0xffff00000c164800 (size 512): comm "kworker/u8:1", pid 2648, jiffies 4295122303 (age 97.656s) hex dump (first 32 bytes): 00 80 af 09 00 00 ff ff e1 09 00 00 75 01 60 83 ............u.`. 1f 00 00 00 b8 00 00 00 15 00 05 00 da e3 d3 64 ...............d backtrace: [<0000000007ad20f6>] __kmalloc_track_caller+0x1a8/0x310 [<00000000d1029e55>] kmalloc_reserve.constprop.0+0x70/0x13c [<000000008b9d4183>] __alloc_skb+0xec/0x1fc [<00000000c7af5051>] __netdev_alloc_skb+0x48/0x23c [<00000000642ee5f5>] batadv_v_ogm_aggr_send+0x50/0x36c [<0000000088660bd7>] batadv_v_ogm_aggr_work+0x24/0x40 [<0000000042fc2606>] process_one_work+0x3b0/0x610 [<000000002f2a0b1c>] worker_thread+0xa0/0x690 [<0000000059fae5d4>] kthread+0x1fc/0x210 [<000000000c587d3a>] ret_from_fork+0x10/0x20 Free the skb in that case to fix this leak. Cc: stable@vger.kernel.org Fixes: `0da0035942` ("batman-adv: OGMv2 - add basic infrastructure") Signed-off-by: Remi Pommarel <repk@triplefau.lt> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-09 17:33:03 +02:00
Keith Yeo	6311071a05	wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems() nl80211_parse_mbssid_elems() uses a u8 variable num_elems to count the number of MBSSID elements in the nested netlink attribute attrs, which can lead to an integer overflow if a user of the nl80211 interface specifies 256 or more elements in the corresponding attribute in userspace. The integer overflow can lead to a heap buffer overflow as num_elems determines the size of the trailing array in elems, and this array is thereafter written to for each element in attrs. Note that this vulnerability only affects devices with the wiphy->mbssid_max_interfaces member set for the wireless physical device struct in the device driver, and can only be triggered by a process with CAP_NET_ADMIN capabilities. Fix this by checking for a maximum of 255 elements in attrs. Cc: stable@vger.kernel.org Fixes: `dc1e3cb8da` ("nl80211: MBSSID and EMA support in AP mode") Signed-off-by: Keith Yeo <keithyjy@gmail.com> Link: https://lore.kernel.org/r/20230731034719.77206-1-keithyjy@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-08-09 14:43:35 +02:00
Florian Westphal	24138933b9	netfilter: nf_tables: don't skip expired elements during walk There is an asymmetry between commit/abort and preparation phase if the following conditions are met: 1. set is a verdict map ("1.2.3.4 : jump foo") 2. timeouts are enabled In this case, following sequence is problematic: 1. element E in set S refers to chain C 2. userspace requests removal of set S 3. kernel does a set walk to decrement chain->use count for all elements from preparation phase 4. kernel does another set walk to remove elements from the commit phase (or another walk to do a chain->use increment for all elements from abort phase) If E has already expired in 1), it will be ignored during list walk, so its use count won't have been changed. Then, when set is culled, ->destroy callback will zap the element via nf_tables_set_elem_destroy(), but this function is only safe for elements that have been deactivated earlier from the preparation phase: lack of earlier deactivate removes the element but leaks the chain use count, which results in a WARN splat when the chain gets removed later, plus a leak of the nft_chain structure. Update pipapo_get() not to skip expired elements, otherwise flush command reports bogus ENOENT errors. Fixes: `3c4287f620` ("nf_tables: Add set type for arbitrary concatenation of ranges") Fixes: `8d8540c4f5` ("netfilter: nft_set_rbtree: add timeout support") Fixes: `9d0982927e` ("netfilter: nft_hash: add support for timeouts") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-08-09 14:39:28 +02:00
Gerd Bayer	30c3c4a449	net/smc: Use correct buffer sizes when switching between TCP and SMC Tuning of the effective buffer size through setsockopts was working for SMC traffic only but not for TCP fall-back connections even before commit `0227f058aa` ("net/smc: Unbind r/w buffer size from clcsock and make them tunable"). That change made it apparent that TCP fall-back connections would use net.smc.[rw]mem as buffer size instead of net.ipv4_tcp_[rw]mem. Amend the code that copies attributes between the (TCP) clcsock and the SMC socket and adjust buffer sizes appropriately: - Copy over sk_userlocks so that both sockets agree on whether tuning via setsockopt is active. - When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with setsockopt. Otherwise, use the sysctl value for TCP/IPv4. - Likewise, use either values from setsockopt or from sysctl for SMC (duplicated) on successful SMC connect. In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that is taken care of by the attribute copy. Fixes: `0227f058aa` ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-09 11:20:29 +01:00
Gerd Bayer	833bac7ec3	net/smc: Fix setsockopt and sysctl to specify same buffer size again Commit `0227f058aa` ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") introduced the net.smc.rmem and net.smc.wmem sysctls to specify the size of buffers to be used for SMC type connections. This created a regression for users that specified the buffer size via setsockopt() as the effective buffer size was now doubled. Re-introduce the division by 2 in the SMC buffer create code and level this out by duplicating the net.smc.[rw]mem values used for initializing sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both methods (setsockopt or sysctl) the effective buffer size that they expect. Initialize net.smc.[rw]mem from its own constant of 64kB, respectively. Internal performance tests show that this value is a good compromise between throughput/latency and memory consumption. Also, this decouples it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the module for SMC protocol was loaded. Check that no more than INT_MAX / 2 is assigned to net.smc.[rw]mem, in order to avoid any overflow condition when that is doubled for use in sk_sndbuf or sk_rcvbuf. While at it, drop the confusing sk_buf_size variable from __smc_buf_create and name "compressed" buffer size variables more consistently. Background: Before the commit mentioned above, SMC's buffer allocator in __smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf value as initial value to search for appropriate buffers. If the search resorted to using a bigger buffer when all buffers of the specified size were busy, the duplicate of the used effective buffer size is stored back to sk_rcvbuf/sk_sndbuf. When available, buffers of exactly the size that a user had specified as input to setsockopt() were used, despite setsockopt()'s documentation in "man 7 socket" talking of a mandatory duplication: [...] SO_SNDBUF Sets or gets the maximum socket send buffer in bytes. The kernel doubles this value (to allow space for book‐ keeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2). The default value is set by the /proc/sys/net/core/wmem_default file and the maximum allowed value is set by the /proc/sys/net/core/wmem_max file. The minimum (doubled) value for this option is 2048. [...] Fixes: `0227f058aa` ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") Co-developed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-09 11:20:28 +01:00
David Rheinsberg	b6f79e826f	net/unix: use consistent error code in SO_PEERPIDFD Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA rather than ESRCH if a socket type does not support remote peer-PID queries. Currently, SO_PEERPIDFD returns ESRCH when the socket in question is not an AF_UNIX socket. This is quite unexpected, given that one would assume ESRCH means the peer process already exited and thus cannot be found. However, in that case the sockopt actually returns EINVAL (via pidfd_prepare()). This is rather inconsistent with other syscalls, which usually return ESRCH if a given PID refers to a non-existant process. This changes SO_PEERPIDFD to return ENODATA instead. This is also what SO_PEERGROUPS returns, and thus keeps a consistent behavior across sockopts. Note that this code is returned in 2 cases: First, if the socket type is not AF_UNIX, and secondly if the socket was not yet connected. In both cases ENODATA seems suitable. Signed-off-by: David Rheinsberg <david@readahead.eu> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Luca Boccassi <bluca@debian.org> Fixes: `7b26952a91` ("net: core: add getsockopt SO_PEERPIDFD") Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-08 15:56:48 -07:00
Hannes Reinecke	ba4a734e1a	net/tls: avoid TCP window full during ->read_sock() When flushing the backlog after decoding a record we don't really know how much data the caller want us to evaluate, so use INT_MAX and 0 as arguments to tls_read_flush_backlog() to ensure we flush at 128k of data. Otherwise we might be reading too much data and trigger a TCP window full. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20230807071022.10091-1-hare@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-08 15:53:49 -07:00
Ziyang Xuan	794529c448	ipv6: exthdrs: Replace opencoded swap() implementation Get a coccinelle warning as follows: net/ipv6/exthdrs.c:800:29-30: WARNING opportunity for swap() Use swap() to replace opencoded implementation. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230807020947.1991716-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-08 15:36:47 -07:00
xu xin	c67180efc5	net/ipv4: return the real errno instead of -EINVAL For now, No matter what error pointer ip_neigh_for_gw() returns, ip_finish_output2() always return -EINVAL, which may mislead the upper users. For exemple, an application uses sendto to send an UDP packet, but when the neighbor table overflows, sendto() will get a value of -EINVAL, and it will cause users to waste a lot of time checking parameters for errors. Return the real errno instead of -EINVAL. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Si Hao <si.hao@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20230807015408.248237-1-xu.xin16@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-08 15:35:51 -07:00
Maciej Żenczykowski	1d85594fd3	netfilter: nfnetlink_log: always add a timestamp Compared to all the other work we're already doing to deliver an skb to userspace this is very cheap - at worse an extra call to ktime_get_real() - and very useful. (and indeed it may even be cheaper if we're running from other hooks) (background: Android occasionally logs packets which caused wake from sleep/suspend and we'd like to have timestamps reliably associated with these events) Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-08-08 13:03:36 +02:00
Andrew Kanner	d14eea09ed	net: core: remove unnecessary frame_sz check in bpf_xdp_adjust_tail() Syzkaller reported the following issue: ======================================= Too BIG xdp->frame_sz = 131072 WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121 ____bpf_xdp_adjust_tail net/core/filter.c:4121 [inline] WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121 bpf_xdp_adjust_tail+0x466/0xa10 net/core/filter.c:4103 ... Call Trace: <TASK> bpf_prog_4add87e5301a4105+0x1a/0x1c __bpf_prog_run include/linux/filter.h:600 [inline] bpf_prog_run_xdp include/linux/filter.h:775 [inline] bpf_prog_run_generic_xdp+0x57e/0x11e0 net/core/dev.c:4721 netif_receive_generic_xdp net/core/dev.c:4807 [inline] do_xdp_generic+0x35c/0x770 net/core/dev.c:4866 tun_get_user+0x2340/0x3ca0 drivers/net/tun.c:1919 tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2043 call_write_iter include/linux/fs.h:1871 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x650/0xe40 fs/read_write.c:584 ksys_write+0x12f/0x250 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd xdp->frame_sz > PAGE_SIZE check was introduced in commit `c8741e2bfe` ("xdp: Allow bpf_xdp_adjust_tail() to grow packet size"). But Jesper Dangaard Brouer <jbrouer@redhat.com> noted that after introducing the xdp_init_buff() which all XDP driver use - it's safe to remove this check. The original intend was to catch cases where XDP drivers have not been updated to use xdp.frame_sz, but that is not longer a concern (since xdp_init_buff). Running the initial syzkaller repro it was discovered that the contiguous physical memory allocation is used for both xdp paths in tun_get_user(), e.g. tun_build_skb() and tun_alloc_skb(). It was also stated by Jesper Dangaard Brouer <jbrouer@redhat.com> that XDP can work on higher order pages, as long as this is contiguous physical memory (e.g. a page). Reported-and-tested-by: syzbot+f817490f5bd20541b90a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000774b9205f1d8a80d@google.com/T/ Link: https://syzkaller.appspot.com/bug?extid=f817490f5bd20541b90a Link: https://lore.kernel.org/all/20230725155403.796-1-andrew.kanner@gmail.com/T/ Fixes: `43b5169d83` ("net, xdp: Introduce xdp_init_buff utility routine") Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20230803190316.2380231-1-andrew.kanner@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-07 19:14:41 -07:00
Alexander Lobakin	4a36d0180c	net: skbuff: always try to recycle PP pages directly when in softirq Commit `8c48eea3ad` ("page_pool: allow caching from safely localized NAPI") allowed direct recycling of skb pages to their PP for some cases, but unfortunately missed a couple of other majors. For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(), which unconditionally passes `false` as @napi_safe. Thus, all pages go through ptr_ring and locks, although most of time we're actually inside the NAPI polling this PP is linked with, so that it would be perfectly safe to recycle pages directly. Let's address such. If @napi_safe is true, we're fine, don't change anything for this path. But if it's false, check whether we are in the softirq context. It will most likely be so and then if ->list_owner is our current CPU, we're good to use direct recycling, even though @napi_safe is false -- concurrent access is excluded. in_softirq() protection is needed mostly due to we can hit this place in the process context (not the hardirq though). For the mentioned xdp-drop-skb-mode case, the improvement I got is 3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and alloc_slow counter doesn't change most of time, which means the MM layer is not even called to allocate any new pages. Suggested-by: Jakub Kicinski <kuba@kernel.org> # in_softirq() Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-7-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-07 13:05:53 -07:00
Jakub Kicinski	ff4e538c8c	page_pool: add a lockdep check for recycling in hardirq Page pool use in hardirq is prohibited, add debug checks to catch misuses. IIRC we previously discussed using DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns that people will have DEBUG_NET enabled in perf testing. I don't think anyone enables lockdep in perf testing, so use lockdep to avoid pushback and arguing :) Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-6-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-07 13:05:53 -07:00
Alexander Lobakin	5b899c33b3	net: skbuff: avoid accessing page_pool if !napi_safe when returning page Currently, pp->p.napi is always read, but the actual variable it gets assigned to is read-only when @napi_safe is true. For the !napi_safe cases, which yet is still a pack, it's an unneeded operation. Moreover, it can lead to premature or even redundant page_pool cacheline access. For example, when page_pool_is_last_frag() returns false (with the recent frag improvements). Thus, read it only when @napi_safe is true. This also allows moving @napi inside the condition block itself. Constify it while we are here, because why not. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-5-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-07 13:05:53 -07:00
Alexander Lobakin	75eaf63ea7	net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h> Currently, touching <net/page_pool/types.h> triggers a rebuild of more than half of the kernel. That's because it's included in <linux/skbuff.h>. And each new include to page_pool/types.h adds more [useless] data for the toolchain to process per each source file from that pile. In commit `6a5bcd84e8` ("page_pool: Allow drivers to hint on SKB recycling"), Matteo included it to be able to call a couple of functions defined there. Then, in commit `57f05bc2ab` ("page_pool: keep pp info as long as page pool owns the page") one of the calls was removed, so only one was left. It's the call to page_pool_return_skb_page() in napi_frag_unref(). The function is external and doesn't have any dependencies. Having very niche page_pool_types.h included only for that looks like an overkill. As %PP_SIGNATURE is not local to page_pool.c (was only in the early submissions), nothing holds this function there. Teleport page_pool_return_skb_page() to skbuff.c, just next to the main consumer, skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't work with skbs at all and the former name tells nothing. The #if guards here are only to not compile and have it in the vmlinux when not needed -- both call sites are already guarded. Now, touching page_pool_types.h only triggers rebuilding of the drivers using it and a couple of core networking files. Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-07 13:05:53 -07:00
Yunsheng Lin	a9ca9f9cef	page_pool: split types and declarations from page_pool.h Split types and pure function declarations from page_pool.h and add them in page_page/types.h, so that C sources can include page_pool.h and headers should generally only include page_pool/types.h as suggested by jakub. Rename page_pool.h to page_pool/helpers.h to have both in one place. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com [Jakub: change microsoft/mana, fix kdoc paths in Documentation] Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-07 13:05:19 -07:00
Yue Haibing	f6ecb68b38	net/tls: Remove unused function declarations Commit `3c4d755915` ("tls: kernel TLS support") declared but never implemented these functions. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-07 08:53:54 +01:00
Vladimir Oltean	c35e927cbe	net: omit ndo_hwtstamp_get() call when possible in dev_set_hwtstamp_phylib() Setting dev->priv_flags & IFF_SEE_ALL_HWTSTAMP_REQUESTS is only legal for drivers which were converted to ndo_hwtstamp_get() and ndo_hwtstamp_set(), and it is only there that we call ndo_hwtstamp_set() for a request that otherwise goes to phylib (for stuff like packet traps, which need to be undone if phylib failed, hence the old_cfg logic). The problem is that we end up calling ndo_hwtstamp_get() when we don't need to (even if the SIOCSHWTSTAMP wasn't intended for phylib, or if it was, but the driver didn't set IFF_SEE_ALL_HWTSTAMP_REQUESTS). For those unnecessary conditions, we share a code path with virtual drivers (vlan, macvlan, bonding) where ndo_hwtstamp_get() is implemented as generic_hwtstamp_get_lower(), and may be resolved through generic_hwtstamp_ioctl_lower() if the lower device is unconverted. I.e. this situation: $ ip link add link eno0 name eno0.100 type vlan id 100 $ hwstamp_ctl -i eno0.100 -t 1 We are unprepared to deal with this, because if ndo_hwtstamp_get() is resolved through a legacy ndo_eth_ioctl(SIOCGHWTSTAMP) lower_dev implementation, that needs a non-NULL old_cfg.ifr pointer, and we don't have it. But we don't even need to deal with it either. In the general case, drivers may not even implement SIOCGHWTSTAMP handling, only SIOCSHWTSTAMP, so it makes sense to completely avoid a SIOCGHWTSTAMP call if we can. The solution is to split the single "if" condition into 3 smaller ones, thus separating the decision to call ndo_hwtstamp_get() from the decision to call ndo_hwtstamp_set(). The third "if" condition is identical to the first one, and both are subsets of the second one. Thus, the "cfg" argument of kernel_hwtstamp_config_changed() is always valid. Reported-by: Eric Dumazet <edumazet@google.com> Closes: https://lore.kernel.org/netdev/CANn89iLOspJsvjPj+y8jikg7erXDomWe8sqHMdfL_2LQSFrPAg@mail.gmail.com/ Fixes: `fd770e856e` ("net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 13:25:10 +01:00
Jakub Kicinski	6b47808f22	net: tls: avoid discarding data on record close TLS records end with a 16B tag. For TLS device offload we only need to make space for this tag in the stream, the device will generate and replace it with the actual calculated tag. Long time ago the code would just re-reference the head frag which mostly worked but was suboptimal because it prevented TCP from combining the record into a single skb frag. I'm not sure if it was correct as the first frag may be shorter than the tag. The commit under fixes tried to replace that with using the page frag and if the allocation failed rolling back the data, if record was long enough. It achieves better fragment coalescing but is also buggy. We don't roll back the iterator, so unless we're at the end of send we'll skip the data we designated as tag and start the next record as if the rollback never happened. There's also the possibility that the record was constructed with MSG_MORE and the data came from a different syscall and we already told the user space that we "got it". Allocate a single dummy page and use it as fallback. Found by code inspection, and proven by forcing allocation failures. Fixes: `e7b159a48b` ("net/tls: remove the record tail optimization") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 08:32:18 +01:00
Eric Dumazet	6e97ba552b	tcp: set TCP_DEFER_ACCEPT locklessly rskq_defer_accept field can be read/written without the need of holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 08:24:56 +01:00
Eric Dumazet	a81722ddd7	tcp: set TCP_LINGER2 locklessly tp->linger2 can be set locklessly as long as readers use READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 08:24:55 +01:00
Eric Dumazet	84485080cb	tcp: set TCP_KEEPCNT locklessly tp->keepalive_probes can be set locklessly, readers are already taking care of this field being potentially set by other threads. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 08:24:55 +01:00
Eric Dumazet	6fd70a6b4e	tcp: set TCP_KEEPINTVL locklessly tp->keepalive_intvl can be set locklessly, readers are already taking care of this field being potentially set by other threads. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 08:24:55 +01:00
Eric Dumazet	d58f2e15aa	tcp: set TCP_USER_TIMEOUT locklessly icsk->icsk_user_timeout can be set locklessly, if all read sides use READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 08:24:55 +01:00
Eric Dumazet	d44fd4a767	tcp: set TCP_SYNCNT locklessly icsk->icsk_syn_retries can safely be set without locking the socket. We have to add READ_ONCE() annotations in tcp_fastopen_synack_timer() and tcp_write_timeout(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 08:24:55 +01:00
Remi Pommarel	d25ddb7e78	batman-adv: Fix TT global entry leak when client roamed back When a client roamed back to a node before it got time to destroy the pending local entry (i.e. within the same originator interval) the old global one is directly removed from hash table and left as such. But because this entry had an extra reference taken at lookup (i.e using batadv_tt_global_hash_find) there is no way its memory will be reclaimed at any time causing the following memory leak: unreferenced object 0xffff0000073c8000 (size 18560): comm "softirq", pid 0, jiffies 4294907738 (age 228.644s) hex dump (first 32 bytes): 06 31 ac 12 c7 7a 05 00 01 00 00 00 00 00 00 00 .1...z.......... 2c ad be 08 00 80 ff ff 6c b6 be 08 00 80 ff ff ,.......l....... backtrace: [<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300 [<000000000ff2fdbc>] batadv_tt_global_add+0x700/0xe20 [<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790 [<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110 [<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10 [<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0 [<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4 [<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0 [<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90 [<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74 [<000000000f39a009>] __netif_receive_skb+0x48/0xe0 [<00000000f2cd8888>] process_backlog+0x174/0x344 [<00000000507d6564>] __napi_poll+0x58/0x1f4 [<00000000b64ef9eb>] net_rx_action+0x504/0x590 [<00000000056fa5e4>] _stext+0x1b8/0x418 [<00000000878879d6>] run_ksoftirqd+0x74/0xa4 unreferenced object 0xffff00000bae1a80 (size 56): comm "softirq", pid 0, jiffies 4294910888 (age 216.092s) hex dump (first 32 bytes): 00 78 b1 0b 00 00 ff ff 0d 50 00 00 00 00 00 00 .x.......P...... 00 00 00 00 00 00 00 00 50 c8 3c 07 00 00 ff ff ........P.<..... backtrace: [<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300 [<00000000d9aaa49e>] batadv_tt_global_add+0x53c/0xe20 [<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790 [<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110 [<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10 [<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0 [<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4 [<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0 [<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90 [<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74 [<000000000f39a009>] __netif_receive_skb+0x48/0xe0 [<00000000f2cd8888>] process_backlog+0x174/0x344 [<00000000507d6564>] __napi_poll+0x58/0x1f4 [<00000000b64ef9eb>] net_rx_action+0x504/0x590 [<00000000056fa5e4>] _stext+0x1b8/0x418 [<00000000878879d6>] run_ksoftirqd+0x74/0xa4 Releasing the extra reference from batadv_tt_global_hash_find even at roam back when batadv_tt_global_free is called fixes this memory leak. Cc: stable@vger.kernel.org Fixes: `068ee6e204` ("batman-adv: roaming handling mechanism redesign") Signed-off-by: Remi Pommarel <repk@triplefau.lt> Signed-off-by; Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-05 08:02:01 +02:00
Kuniyuki Iwashima	b205153689	tcp: Update stale comment for MD5 in tcp_parse_options(). Since commit `9ea88a1530` ("tcp: md5: check md5 signature without socket lock"), the MD5 option is checked in tcp_v[46]_rcv(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230803224552.69398-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 18:28:36 -07:00
Kuniyuki Iwashima	d0f2b7a9ca	tcp: Disable header prediction for MD5 flow. TCP socket saves the minimum required header length in tcp_header_len of struct tcp_sock, and later the value is used in __tcp_fast_path_on() to generate a part of TCP header in tcp_sock(sk)->pred_flags. In tcp_rcv_established(), if the incoming packet has the same pattern with pred_flags, we enter the fast path and skip full option parsing. The MD5 option is parsed in tcp_v[46]_rcv(), so we need not parse it again later in tcp_rcv_established() unless other options exist. We add TCPOLEN_MD5SIG_ALIGNED to tcp_header_len in two paths to avoid the slow path. For passive open connections with MD5, we add TCPOLEN_MD5SIG_ALIGNED to tcp_header_len in tcp_create_openreq_child() after 3WHS. On the other hand, we do it in tcp_connect_init() for active open connections. However, the value is overwritten while processing SYN+ACK or crossed SYN in tcp_rcv_synsent_state_process(). These two cases will have the wrong value in pred_flags and never go into the fast path. We could update tcp_header_len in tcp_rcv_synsent_state_process(), but a test with slightly modified netperf which uses MD5 for each flow shows that the slow path is actually a bit faster than the fast path. On c5.4xlarge EC2 instance (16 vCPU, 32 GiB mem) $ for i in {1..10}; do ./super_netperf $(nproc) -H localhost -l 10 -- -m 256 -M 256; done Avg of 10 * `36e68eadd3` : 10.376 Gbps * all fast path : 10.374 Gbps (patch v2, See Link) * all slow path : 10.394 Gbps The header prediction is not worth adding complexity for MD5, so let's disable it for MD5. Link: https://lore.kernel.org/netdev/20230803042214.38309-1-kuniyu@amazon.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230803224552.69398-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 18:28:36 -07:00
Eric Dumazet	a47e598fbd	dccp: fix data-race around dp->dccps_mss_cache dccp_sendmsg() reads dp->dccps_mss_cache before locking the socket. Same thing in do_dccp_getsockopt(). Add READ_ONCE()/WRITE_ONCE() annotations, and change dccp_sendmsg() to check again dccps_mss_cache after socket is locked. Fixes: `7c657876b6` ("[DCCP]: Initial implementation") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230803163021.2958262-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 18:27:58 -07:00
Paolo Abeni	511b90e392	mptcp: fix disconnect vs accept race Despite commit `0ad529d9fd` ("mptcp: fix possible divide by zero in recvmsg()"), the mptcp protocol is still prone to a race between disconnect() (or shutdown) and accept. The root cause is that the mentioned commit checks the msk-level flag, but mptcp_stream_accept() does acquire the msk-level lock, as it can rely directly on the first subflow lock. As reported by Christoph than can lead to a race where an msk socket is accepted after that mptcp_subflow_queue_clean() releases the listener socket lock and just before it takes destructive actions leading to the following splat: BUG: kernel NULL pointer dereference, address: 0000000000000012 PGD 5a4ca067 P4D 5a4ca067 PUD 37d4c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 2 PID: 10955 Comm: syz-executor.5 Not tainted 6.5.0-rc1-gdc7b257ee5dd #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:mptcp_stream_accept+0x1ee/0x2f0 include/net/inet_sock.h:330 Code: 0a 09 00 48 8b 1b 4c 39 e3 74 07 e8 bc 7c 7f fe eb a1 e8 b5 7c 7f fe 4c 8b 6c 24 08 eb 05 e8 a9 7c 7f fe 49 8b 85 d8 09 00 00 <0f> b6 40 12 88 44 24 07 0f b6 6c 24 07 bf 07 00 00 00 89 ee e8 89 RSP: 0018:ffffc90000d07dc0 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888037e8d020 RCX: ffff88803b093300 RDX: 0000000000000000 RSI: ffffffff833822c5 RDI: ffffffff8333896a RBP: 0000607f82031520 R08: ffff88803b093300 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000003e83 R12: ffff888037e8d020 R13: ffff888037e8c680 R14: ffff888009af7900 R15: ffff888009af6880 FS: 00007fc26d708640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000012 CR3: 0000000066bc5001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> do_accept+0x1ae/0x260 net/socket.c:1872 __sys_accept4+0x9b/0x110 net/socket.c:1913 __do_sys_accept4 net/socket.c:1954 [inline] __se_sys_accept4 net/socket.c:1951 [inline] __x64_sys_accept4+0x20/0x30 net/socket.c:1951 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Address the issue by temporary removing the pending request socket from the accept queue, so that racing accept() can't touch them. After depleting the msk - the ssk still exists, as plain TCP sockets, re-insert them into the accept queue, so that later inet_csk_listen_stop() will complete the tcp socket disposal. Fixes: `2a6a870e44` ("mptcp: stops worker on unaccepted sockets at listener close") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/423 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-4-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 18:26:27 -07:00
Paolo Abeni	ff18f9ef30	mptcp: avoid bogus reset on fallback close Since the blamed commit, the MPTCP protocol unconditionally sends TCP resets on all the subflows on disconnect(). That fits full-blown MPTCP sockets - to implement the fastclose mechanism - but causes unexpected corruption of the data stream, caught as sporadic self-tests failures. Fixes: `d21f834855` ("mptcp: use fastclose on more edge scenarios") Cc: stable@vger.kernel.org Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/419 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-3-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 18:26:27 -07:00
Florian Westphal	6a7ac3d205	tunnels: fix kasan splat when generating ipv4 pmtu error If we try to emit an icmp error in response to a nonliner skb, we get BUG: KASAN: slab-out-of-bounds in ip_compute_csum+0x134/0x220 Read of size 4 at addr ffff88811c50db00 by task iperf3/1691 CPU: 2 PID: 1691 Comm: iperf3 Not tainted 6.5.0-rc3+ #309 [..] kasan_report+0x105/0x140 ip_compute_csum+0x134/0x220 iptunnel_pmtud_build_icmp+0x554/0x1020 skb_tunnel_check_pmtu+0x513/0xb80 vxlan_xmit_one+0x139e/0x2ef0 vxlan_xmit+0x1867/0x2760 dev_hard_start_xmit+0x1ee/0x4f0 br_dev_queue_push_xmit+0x4d1/0x660 [..] ip_compute_csum() cannot deal with nonlinear skbs, so avoid it. After this change, splat is gone and iperf3 is no longer stuck. Fixes: `4cb47a8644` ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230803152653.29535-2-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 18:24:52 -07:00
Eric Dumazet	8a98961777	net/packet: annotate data-races around tp->status Another syzbot report [1] is about tp->status lockless reads from __packet_get_status() [1] BUG: KCSAN: data-race in __packet_rcv_has_room / __packet_set_status write to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 0: __packet_set_status+0x78/0xa0 net/packet/af_packet.c:407 tpacket_rcv+0x18bb/0x1a60 net/packet/af_packet.c:2483 deliver_skb net/core/dev.c:2173 [inline] __netif_receive_skb_core+0x408/0x1e80 net/core/dev.c:5337 __netif_receive_skb_one_core net/core/dev.c:5491 [inline] __netif_receive_skb+0x57/0x1b0 net/core/dev.c:5607 process_backlog+0x21f/0x380 net/core/dev.c:5935 __napi_poll+0x60/0x3b0 net/core/dev.c:6498 napi_poll net/core/dev.c:6565 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6698 __do_softirq+0xc1/0x265 kernel/softirq.c:571 invoke_softirq kernel/softirq.c:445 [inline] __irq_exit_rcu+0x57/0xa0 kernel/softirq.c:650 sysvec_apic_timer_interrupt+0x6d/0x80 arch/x86/kernel/apic/apic.c:1106 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645 smpboot_thread_fn+0x33c/0x4a0 kernel/smpboot.c:112 kthread+0x1d7/0x210 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 read to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 1: __packet_get_status net/packet/af_packet.c:436 [inline] packet_lookup_frame net/packet/af_packet.c:524 [inline] __tpacket_has_room net/packet/af_packet.c:1255 [inline] __packet_rcv_has_room+0x3f9/0x450 net/packet/af_packet.c:1298 tpacket_rcv+0x275/0x1a60 net/packet/af_packet.c:2285 deliver_skb net/core/dev.c:2173 [inline] dev_queue_xmit_nit+0x38a/0x5e0 net/core/dev.c:2243 xmit_one net/core/dev.c:3574 [inline] dev_hard_start_xmit+0xcf/0x3f0 net/core/dev.c:3594 __dev_queue_xmit+0xefb/0x1d10 net/core/dev.c:4244 dev_queue_xmit include/linux/netdevice.h:3088 [inline] can_send+0x4eb/0x5d0 net/can/af_can.c:276 bcm_can_tx+0x314/0x410 net/can/bcm.c:302 bcm_tx_timeout_handler+0xdb/0x260 __run_hrtimer kernel/time/hrtimer.c:1685 [inline] __hrtimer_run_queues+0x217/0x700 kernel/time/hrtimer.c:1749 hrtimer_run_softirq+0xd6/0x120 kernel/time/hrtimer.c:1766 __do_softirq+0xc1/0x265 kernel/softirq.c:571 run_ksoftirqd+0x17/0x20 kernel/softirq.c:939 smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164 kthread+0x1d7/0x210 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 value changed: 0x0000000000000000 -> 0x0000000020000081 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 6.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 Fixes: `69e3c75f4d` ("net: TX_RING and packet mmap") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230803145600.2937518-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 18:03:16 -07:00
Eric Dumazet	c4a6b2da4b	tcp_metrics: hash table allocation cleanup After commit `098a697b49` ("tcp_metrics: Use a single hash table for all network namespaces.") we can avoid calling tcp_net_metrics_init() for each new netns. Instead, rename tcp_net_metrics_init() to tcp_metrics_hash_alloc(), and move it to __init section. Also move tcpmhash_entries to __initdata section. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230803135417.2716879-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 15:33:39 -07:00
Xiang Yang	17ebf8a4c3	mptcp: fix the incorrect judgment for msk->cb_flags Coccicheck reports the error below: net/mptcp/protocol.c:3330:15-28: ERROR: test of a variable/field address Since the address of msk->cb_flags is used in __test_and_clear_bit, the address should not be NULL. The judgment for if (unlikely(msk->cb_flags)) will always be true, we should check the real value of msk->cb_flags here. Fixes: `65a569b03c` ("mptcp: optimize release_cb for the common case") Signed-off-by: Xiang Yang <xiangyang3@huawei.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803072438.1847500-1-xiangyang3@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 15:22:13 -07:00
Jiri Pirko	6e067d0cab	devlink: use generated split ops and remove duplicated commands from small ops Do the switch and use generated split ops for get and info_get commands. Remove those from small ops array. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230803111340.1074067-13-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 14:03:02 -07:00
Jiri Pirko	b2551b1517	devlink: include the generated netlink header Put the newly added generated header to the include list. Remove the duplicated temporary function prototypes. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230803111340.1074067-12-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 14:03:02 -07:00
Jiri Pirko	6b7c486cae	devlink: add split ops generated according to spec Improve the existing devlink spec in order to serve as a source for generation of valid devlink split ops for the existing commands. Add the generated sources. Node that the policies are narrowed down only to the attributes that are actually parsed. The dont-validate-strict parsing policy makes sure that other possibly passed garbage attributes from userspace are ignored during validation. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230803111340.1074067-11-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 14:03:01 -07:00
Jiri Pirko	8300dce542	devlink: un-static devlink_nl_pre/post_doit() To be prepared for the follow-up generated split ops addition, make the functions devlink_nl_pre_doit() and devlink_nl_post_doit() usable outside of netlink.c. Introduce temporary prototypes which are going to be removed once the generated header will be included. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230803111340.1074067-9-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 14:03:01 -07:00
Jiri Pirko	491a24872a	devlink: introduce couple of dumpit callbacks for split ops Introduce couple of dumpit callbacks for generated split ops. Have them as a thin wrapper around iteration function and allow to pass dump_one() function pointer directly without need to store in devlink_cmd structs. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230803111340.1074067-8-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 14:03:01 -07:00
Jiri Pirko	d61aedcf62	devlink: rename couple of doit netlink callbacks to match generated names The generated names of the doit netlink callback are missing "cmd" in their names. Change names to be ready to switch to generated split ops header. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230803111340.1074067-7-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 14:03:01 -07:00
Jiri Pirko	ba0f66c95f	devlink: rename devlink_nl_ops to devlink_nl_small_ops In order to avoid name collision with the generated split ops array which is going to be introduced as a follow-up patch, rename the existing ops array to devlink_nl_small_ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230803111340.1074067-6-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-04 14:03:01 -07:00
Linus Torvalds	4593f3c2c6	Two patches to improve RBD exclusive lock interaction with osd_request_timeout option and another fix to reduce the potential for erroneous blocklisting -- this time in CephFS. All going to stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTNFFUTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi5I8B/9a8C5ed0XfTadHcHX5VQsY3b//4rgp 0VYkQbjYnSCwrYRIPsvnL8LeLHzbcPGLpFAQXg7uUlmJ5dpaOz303hKmKt5GdyOR qvWka3K4zeG177b6yc1srqs0cEsCLpQrn+krnvOl5v87QdFsCP/bsJMOrJ9mlhdM 9GjkjDRn6jvNyOLGbn3kIvwCRF9NH6/nHzjBcTUzvS8fBUye02o9C1H6ZQ7sYjKH sJnmQCNCFHEqdaVjDZ7mw/doIrAbmTV6sgusuPjiF5bHILzX4oWG4UJmRpHFV//S JPQgMp2DNjP8tW9aCVLVVVV5t5AKBr84etF59DaFNflk27U3COJWkE0a =gw7n -----END PGP SIGNATURE----- Merge tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "Two patches to improve RBD exclusive lock interaction with osd_request_timeout option and another fix to reduce the potential for erroneous blocklisting -- this time in CephFS. All going to stable" * tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client: libceph: fix potential hang in ceph_osdc_notify() rbd: prevent busy loop when requesting exclusive lock ceph: defer stopping mdsc delayed_work	2023-08-04 11:29:38 -07:00
Jakub Kicinski	d07b7b32da	pull-request: bpf-next 2023-08-03 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP Tngj6Yid60o39/IZXXblhV37HfSiyQ8= =/kVE -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2023-08-03 We've added 54 non-merge commits during the last 10 day(s) which contain a total of 84 files changed, 4026 insertions(+), 562 deletions(-). The main changes are: 1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer, Daniel Borkmann 2) Support new insns from cpu v4 from Yonghong Song 3) Non-atomically allocate freelist during prefill from YiFei Zhu 4) Support defragmenting IPv(4\|6) packets in BPF from Daniel Xu 5) Add tracepoint to xdp attaching failure from Leon Hwang 6) struct netdev_rx_queue and xdp.h reshuffling to reduce rebuild time from Jakub Kicinski * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits) net: invert the netdevice.h vs xdp.h dependency net: move struct netdev_rx_queue out of netdevice.h eth: add missing xdp.h includes in drivers selftests/bpf: Add testcase for xdp attaching failure tracepoint bpf, xdp: Add tracepoint to xdp attaching failure selftests/bpf: fix static assert compilation issue for test_cls_*.c bpf: fix bpf_probe_read_kernel prototype mismatch riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework libbpf: fix typos in Makefile tracing: bpf: use struct trace_entry in struct syscall_tp_t bpf, devmap: Remove unused dtab field from bpf_dtab_netdev bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry netfilter: bpf: Only define get_proto_defrag_hook() if necessary bpf: Fix an array-index-out-of-bounds issue in disasm.c net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn docs/bpf: Fix malformed documentation bpf: selftests: Add defrag selftests bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Support not connecting client socket netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link ... ==================== Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 15:34:36 -07:00
Jakub Kicinski	35b1b1fd96	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: net/dsa/port.c `9945c1fb03` ("net: dsa: fix older DSA drivers using phylink") `a88dd75384` ("net: dsa: remove legacy_pre_march2020 detection") https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/ net/xdp/xsk.c `3c5b4d69c3` ("net: annotate data-races around sk->sk_mark") `b7f72a30e9` ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path") https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/ drivers/net/ethernet/broadcom/bnxt/bnxt.c `37b61cda9c` ("bnxt: don't handle XDP in netpoll") `2b56b3d992` ("eth: bnxt: handle invalid Tx completions more gracefully") https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/ Adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c `62da08331f` ("net/mlx5e: Set proper IPsec source port in L4 selector") `fbd517549c` ("net/mlx5e: Add function to get IPsec offload namespace") drivers/net/ethernet/sfc/selftest.c `55c1528f9b` ("sfc: fix field-spanning memcpy in selftest") `ae9d445cd4` ("sfc: Miscellaneous comment removals") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 14:34:37 -07:00
Linus Torvalds	999f663186	Including fixes from bpf and wireless. Nothing scary here. Feels like the first wave of regressions from v6.5 is addressed - one outstanding fix still to come in TLS for the sendpage rework. Current release - regressions: - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES - dsa: fix older DSA drivers using phylink Previous releases - regressions: - gro: fix misuse of CB in udp socket lookup - mlx5: unregister devlink params in case interface is down - Revert "wifi: ath11k: Enable threaded NAPI" Previous releases - always broken: - sched: cls_u32: fix match key mis-addressing - sched: bind logic fixes for cls_fw, cls_u32 and cls_route - add bound checks to a number of places which hand-parse netlink - bpf: disable preemption in perf_event_output helpers code - qed: fix scheduling in a tasklet while getting stats - avoid using APIs which are not hardirq-safe in couple of drivers, when we may be in a hard IRQ (netconsole) - wifi: cfg80211: fix return value in scan logic, avoid page allocator warning - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D (DBDC) Misc: - drop handful of inactive maintainers, put some new in place Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTMCRwACgkQMUZtbf5S Irv1tRAArN6rfYrr2ulaTOfMqhWb1Q+kAs00nBCKqC+OdWgT0hqw2QAuqTAVjhje 8HBYlNGyhJ10yp0Q5y4Fp9CsBDHDDNjIp/YGEbr0vC/9mUDOhYD8WV07SmZmzEJu gmt4LeFPTk07yZy7VxMLY5XKuwce6MWGHArehZE7PSa9+07yY2Ov9X02ntr9hSdH ih+VdDI12aTVSj208qb0qNb2JkefFHW9dntVxce4/mtYJE9+47KMR2aXDXtCh0C6 ECgx0LQkdEJ5vNSYfypww0SXIG5aj7sE6HMTdJkjKH7ws4xrW8H+P9co77Hb/DTH TsRBS4SgB20hFNxz3OQwVmAvj+2qfQssL7SeIkRnaEWeTBuVqCwjLdoIzKXJxxq+ cvtUAAM8XUPqec5cPiHPkeAJV6aJhrdUdMjjbCI9uFYU32AWFBQEqvVGP9xdhXHK QIpTLiy26Vw8PwiJdROuGiZJCXePqQRLDuMX1L43ZO1rwIrZcWGHjCNtsR9nXKgQ apbbxb2/rq2FBMB+6obKeHzWDy3JraNCsUspmfleqdjQ2mpbRokd4Vw2564FJgaC 5OznPIX6OuoCY5sftLUcRcpH5ncNj01BvyqjWyCIfJdkCqCUL7HSAgxfm5AUnZip ZIXOzZnZ6uTUQFptXdjey/jNEQ6qpV8RmwY0CMsmJoo88DXI34Y= =HYkl -----END PGP SIGNATURE----- Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and wireless. Nothing scary here. Feels like the first wave of regressions from v6.5 is addressed - one outstanding fix still to come in TLS for the sendpage rework. Current release - regressions: - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES - dsa: fix older DSA drivers using phylink Previous releases - regressions: - gro: fix misuse of CB in udp socket lookup - mlx5: unregister devlink params in case interface is down - Revert "wifi: ath11k: Enable threaded NAPI" Previous releases - always broken: - sched: cls_u32: fix match key mis-addressing - sched: bind logic fixes for cls_fw, cls_u32 and cls_route - add bound checks to a number of places which hand-parse netlink - bpf: disable preemption in perf_event_output helpers code - qed: fix scheduling in a tasklet while getting stats - avoid using APIs which are not hardirq-safe in couple of drivers, when we may be in a hard IRQ (netconsole) - wifi: cfg80211: fix return value in scan logic, avoid page allocator warning - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D (DBDC) Misc: - drop handful of inactive maintainers, put some new in place" * tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits) MAINTAINERS: update TUN/TAP maintainers test/vsock: remove vsock_perf executable on `make clean` tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen tcp_metrics: annotate data-races around tm->tcpm_net tcp_metrics: annotate data-races around tm->tcpm_vals[] tcp_metrics: annotate data-races around tm->tcpm_lock tcp_metrics: annotate data-races around tm->tcpm_stamp tcp_metrics: fix addr_same() helper prestera: fix fallback to previous version on same major version udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES net/mlx5e: Set proper IPsec source port in L4 selector net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio net/mlx5: fs_core: Make find_closest_ft more generic wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1() vxlan: Fix nexthop hash size ip6mr: Fix skb_under_panic in ip6mr_cache_report() s390/qeth: Don't call dev_close/dev_open (DOWN/UP) net: tap_open(): set sk_uid from current_fsuid() net: tun_chr_open(): set sk_uid from current_fsuid() net: dcb: choose correct policy to parse DCB_ATTR_BCN ...	2023-08-03 14:00:02 -07:00
Sven Eckelmann	112cbcb4af	batman-adv: Check hardif MTU against runtime MTU If the MTU of the soft/mesh interface was already reduced (enough), it is not necessary to print a warning about a hard interface not having a MTU to transport ethernet payloads of 1500 bytes. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-03 21:11:42 +02:00
Sven Eckelmann	e4b8178045	batman-adv: Avoid magic value for minimum MTU The header linux/if_ether.h already defines a constant for the minimum MTU. So simply use it instead of having a magic constant in the code. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-03 21:11:42 +02:00
YueHaibing	bbfb428a0c	batman-adv: Remove unused declarations Since commit `335fbe0f5d` ("batman-adv: tvlv - convert tt query packet to use tvlv unicast packets") batadv_recv_tt_query() is not used. And commit `122edaa059` ("batman-adv: tvlv - convert roaming adv packet to use tvlv unicast packets") left behind batadv_recv_roam_adv(). Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-03 21:11:42 +02:00
Simon Wunderlich	2744cefe03	batman-adv: Start new development cycle This version will contain all the (major or even only minor) changes for Linux 6.6. The version number isn't a semantic version number with major and minor information. It is just encoding the year of the expected publishing as Linux -rc1 and the number of published versions this year (starting at 0). Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-08-03 21:11:42 +02:00
Jakub Kicinski	3932f22723	pull-request: bpf 2023-08-03 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW 48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf 61d6hoiXj/sIibgMQT/ihODgeJ4wfQE= =u7qn -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Martin KaFai Lau says: ==================== pull-request: bpf 2023-08-03 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 3 files changed, 37 insertions(+), 20 deletions(-). The main changes are: 1) Disable preemption in perf_event_output helpers code, from Jiri Olsa 2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing, from Lin Ma 3) Multiple warning splat fixes in cpumap from Hou Tao * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, cpumap: Handle skb as well when clean up ptr_ring bpf, cpumap: Make sure kthread is running before map update returns bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing bpf: Disable preemption in bpf_event_output bpf: Disable preemption in bpf_perf_event_output ==================== Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 11:22:53 -07:00
Jakub Kicinski	0d48a84b31	wireless fixes for v6.5 We did some house cleaning in MAINTAINERS file so several patches about that. Few regressions fixed and also fix some recently enabled memcpy() warnings. Only small commits and nothing special standing out. -----BEGIN PGP SIGNATURE----- iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmTLsrcRHGt2YWxvQGtl cm5lbC5vcmcACgkQbhckVSbrbZtn6gf/ZsEOZl98ZVbCoFB09t5/M2IgRdWzbv8C vXyVoacrRaq80rzFQwGZqorEsnEdDXOIJI54VIqnT5avZbIIWIia4mFzBkHwPBef TXcdL2k1KDd+ktPrw3GK8401iEMnWSHs2a/4ztx3x8CFCB47VhGT9DiaIWh6jg1J FUvDhUK7BAk0dItgVjioL+0XKJ5vo4VLENiOCAVj4QJgShKIaq72j/WhKiI/W/+Q 8TBBUjydu0nx7MOM0tOcQlI0z6HXOB89RHj4GxOMA/wvEf+7PHhOE67RAgSAMHJM R9TmeVvdub05Yppv33PUbbvK29McZEI+M+lHMZjLy5AYaXxyYJ+nhw== =4o1a -----END PGP SIGNATURE----- Merge tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Kalle Valo says: ==================== wireless fixes for v6.5 We did some house cleaning in MAINTAINERS file so several patches about that. Few regressions fixed and also fix some recently enabled memcpy() warnings. Only small commits and nothing special standing out. * tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1() wifi: ray_cs: Replace 1-element array with flexible array MAINTAINERS: add Jeff as ath10k, ath11k and ath12k maintainer MAINTAINERS: wifi: mark mlw8k as orphan MAINTAINERS: wifi: mark b43 as orphan MAINTAINERS: wifi: mark zd1211rw as orphan MAINTAINERS: wifi: mark wl3501 as orphan MAINTAINERS: wifi: mark rndis_wlan as orphan MAINTAINERS: wifi: mark ar5523 as orphan MAINTAINERS: wifi: mark cw1200 as orphan MAINTAINERS: wifi: atmel: mark as orphan MAINTAINERS: wifi: rtw88: change Ping as the maintainer Revert "wifi: ath6k: silence false positive -Wno-dangling-pointer warning on GCC 12" wifi: cfg80211: Fix return value in scan logic Revert "wifi: ath11k: Enable threaded NAPI" MAINTAINERS: Update mwifiex maintainer list wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC) ==================== Link: https://lore.kernel.org/r/20230803140058.57476C433C9@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 11:05:46 -07:00
Eric Dumazet	ddf251fa2b	tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen Whenever tcpm_new() reclaims an old entry, tcpm_suck_dst() would overwrite data that could be read from tcp_fastopen_cache_get() or tcp_metrics_fill_info(). We need to acquire fastopen_seqlock to maintain consistency. For newly allocated objects, tcpm_new() can switch to kzalloc() to avoid an extra fastopen_seqlock acquisition. Fixes: `1fe4c481ba` ("net-tcp: Fast Open client - cookie cache") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 10:58:24 -07:00
Eric Dumazet	d5d986ce42	tcp_metrics: annotate data-races around tm->tcpm_net tm->tcpm_net can be read or written locklessly. Instead of changing write_pnet() and read_pnet() and potentially hurt performance, add the needed READ_ONCE()/WRITE_ONCE() in tm_net() and tcpm_new(). Fixes: `849e8a0ca8` ("tcp_metrics: Add a field tcpm_net and verify it matches on lookup") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 10:58:24 -07:00
Eric Dumazet	8c4d04f6b4	tcp_metrics: annotate data-races around tm->tcpm_vals[] tm->tcpm_vals[] values can be read or written locklessly. Add needed READ_ONCE()/WRITE_ONCE() to document this, and force use of tcp_metric_get() and tcp_metric_set() Fixes: `51c5d0c4b1` ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 10:58:24 -07:00
Eric Dumazet	285ce119a3	tcp_metrics: annotate data-races around tm->tcpm_lock tm->tcpm_lock can be read or written locklessly. Add needed READ_ONCE()/WRITE_ONCE() to document this. Fixes: `51c5d0c4b1` ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 10:58:24 -07:00
Eric Dumazet	949ad62a5d	tcp_metrics: annotate data-races around tm->tcpm_stamp tm->tcpm_stamp can be read or written locklessly. Add needed READ_ONCE()/WRITE_ONCE() to document this. Also constify tcpm_check_stamp() dst argument. Fixes: `51c5d0c4b1` ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 10:58:24 -07:00
Eric Dumazet	e6638094d7	tcp_metrics: fix addr_same() helper Because v4 and v6 families use separate inetpeer trees (respectively net->ipv4.peers and net->ipv6.peers), inetpeer_addr_cmp(a, b) assumes a & b share the same family. tcp_metrics use a common hash table, where entries can have different families. We must therefore make sure to not call inetpeer_addr_cmp() if the families do not match. Fixes: `d39d14ffa2` ("net: Add helper function to compare inetpeer addresses") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 10:58:24 -07:00
Jakub Kicinski	82e896d992	docs: net: page_pool: use kdoc to avoid duplicating the information All struct members of the driver-facing APIs are documented twice, in the code and under Documentation. This is a bit tedious. I also get the feeling that a lot of developers will read the header when coding, rather than the doc. Bring the two a little closer together by using kdoc for structs and functions. Using kdoc also gives us links (mentioning a function or struct in the text gets replaced by a link to its doc). Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230802161821.3621985-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 09:54:24 -07:00
Jakub Kicinski	680ee0456a	net: invert the netdevice.h vs xdp.h dependency xdp.h is far more specific and is included in only 67 other files vs netdevice.h's 1538 include sites. Make xdp.h include netdevice.h, instead of the other way around. This decreases the incremental allmodconfig builds size when xdp.h is touched from 5947 to 662 objects. Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h is a mega-header in its own right so it's nice to avoid xdp.h getting included there as well. The only unfortunate part is that the typedef for xdp_features_t has to move to netdevice.h, since its embedded in struct netdevice. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-03 08:38:07 -07:00
Jakub Kicinski	49e47a5b61	net: move struct netdev_rx_queue out of netdevice.h struct netdev_rx_queue is touched in only a few places and having it defined in netdevice.h brings in the dependency on xdp.h, because struct xdp_rxq_info gets embedded in struct netdev_rx_queue. In prep for removal of xdp.h from netdevice.h move all the netdev_rx_queue stuff to a new header. We could technically break the new header up to avoid the sysfs.h include but it's so rarely included it doesn't seem to be worth it at this point. Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-03 08:38:07 -07:00
David Howells	ce650a1663	udp6: Fix __ip6_append_data()'s handling of MSG_SPLICE_PAGES __ip6_append_data() can has a similar problem to __ip_append_data()[1] when asked to splice into a partially-built UDP message that has more than the frag-limit data and up to the MTU limit, but in the ipv6 case, it errors out with EINVAL. This can be triggered with something like: pipe(pfd); sfd = socket(AF_INET6, SOCK_DGRAM, 0); connect(sfd, ...); send(sfd, buffer, 8137, MSG_CONFIRM\|MSG_MORE); write(pfd[1], buffer, 8); splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0); where the amount of data given to send() is dependent on the MTU size (in this instance an interface with an MTU of 8192). The problem is that the calculation of the amount to copy in __ip6_append_data() goes negative in two places, but a check has been put in to give an error in this case. This happens because when pagedlen > 0 (which happens for MSG_ZEROCOPY and MSG_SPLICE_PAGES), the terms in: copy = datalen - transhdrlen - fraggap - pagedlen; then mostly cancel when pagedlen is substituted for, leaving just -fraggap. Fix this by: (1) Insert a note about the dodgy calculation of 'copy'. (2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above equation, so that 'offset' isn't regressed and 'length' isn't increased, which will mean that length and thus copy should match the amount left in the iterator. (3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if we're asked to splice more than is in the iterator. It might be better to not give the warning or even just give a 'short' write. (4) If MSG_SPLICE_PAGES, override the copy<0 check. [!] Note that this should also affect MSG_ZEROCOPY, but that will return -EINVAL for the range of send sizes that requires the skbuff to be split. Signed-off-by: David Howells <dhowells@redhat.com> cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: netdev@vger.kernel.org Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ [1] Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1580952.1690961810@warthog.procyon.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-03 14:56:19 +02:00
Yue Haibing	c956910d5a	tipc: Remove unused function declarations Commit `d50ccc2d39` ("tipc: add 128-bit node identifier") declared but never implemented tipc_node_id2hash(). Also commit `5c216e1d28` ("tipc: Allow run-time alteration of default link settings") never implemented tipc_media_set_priority() and tipc_media_set_window(), commit `cad2929dc4` ("tipc: update a binding service via broadcast") only declared tipc_named_bcast(). Since commit `be07f05639` ("tipc: simplify the finalize work queue") tipc_sched_net_finalize() is removed and declaration is unused. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230802034659.39840-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-03 12:51:45 +02:00
Jiri Slaby	8a76d8b075	net: nfc: remove casts from tty->disc_data tty->disc_data is 'void *', so there is no need to cast from that. Therefore remove the casts and assign the pointer directly. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Cc: Max Staudt <max@enpas.org> Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: linux-can@vger.kernel.org Cc: netdev@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Max Staudt <max@enpas.org> Link: https://lore.kernel.org/r/20230801062237.2687-3-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-03 09:51:21 +02:00
David Howells	0f71c9caf2	udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES __ip_append_data() can get into an infinite loop when asked to splice into a partially-built UDP message that has more than the frag-limit data and up to the MTU limit. Something like: pipe(pfd); sfd = socket(AF_INET, SOCK_DGRAM, 0); connect(sfd, ...); send(sfd, buffer, 8161, MSG_CONFIRM\|MSG_MORE); write(pfd[1], buffer, 8); splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0); where the amount of data given to send() is dependent on the MTU size (in this instance an interface with an MTU of 8192). The problem is that the calculation of the amount to copy in __ip_append_data() goes negative in two places, and, in the second place, this gets subtracted from the length remaining, thereby increasing it. This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and MSG_SPLICE_PAGES), because the terms in: copy = datalen - transhdrlen - fraggap - pagedlen; then mostly cancel when pagedlen is substituted for, leaving just -fraggap. This causes: length -= copy + transhdrlen; to increase the length to more than the amount of data in msg->msg_iter, which causes skb_splice_from_iter() to be unable to fill the request and it returns less than 'copied' - which means that length never gets to 0 and we never exit the loop. Fix this by: (1) Insert a note about the dodgy calculation of 'copy'. (2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above equation, so that 'offset' isn't regressed and 'length' isn't increased, which will mean that length and thus copy should match the amount left in the iterator. (3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if we're asked to splice more than is in the iterator. It might be better to not give the warning or even just give a 'short' write. [!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY avoids the problem by simply assuming that everything asked for got copied, not just the amount that was in the iterator. This is a potential bug for the future. Fixes: `7ac7c98785` ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES") Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 19:19:32 -07:00
Vladimir Oltean	fd770e856e	net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers It is desirable that the new .ndo_hwtstamp_set() API gives more uniformity, less overhead and future flexibility w.r.t. the PHY timestamping behavior. Currently there are some drivers which allow PHY timestamping through the procedure mentioned in Documentation/networking/timestamping.rst. They don't do anything locally if phy_has_hwtstamp() is set, except for lan966x which installs PTP packet traps. Centralize that behavior in a new dev_set_hwtstamp_phylib() code function, which calls either phy_mii_ioctl() for the phylib PHY, or .ndo_hwtstamp_set() of the netdev, based on a single policy (currently simplistic: phy_has_hwtstamp()). Any driver converted to .ndo_hwtstamp_set() will automatically opt into the centralized phylib timestamping policy. Unconverted drivers still get to choose whether they let the PHY handle timestamping or not. Netdev drivers with integrated PHY drivers that don't use phylib presumably don't set dev->phydev, and those will always see HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping policy will remain 100% up to them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 19:11:06 -07:00
Vladimir Oltean	70ef7d87f6	net: transfer rtnl_lock() requirement from ethtool_set_ethtool_phy_ops() to caller phy_init() and phy_exit() will have to do more stuff under rtnl_lock() in a future change. Since rtnl_unlock() -> netdev_run_todo() does a lot of stuff under the hood, it's a pity to lock and unlock the rtnetlink mutex twice in a row. Change the calling convention such that the only caller of ethtool_set_ethtool_phy_ops(), phy_device.c, provides a context where the rtnl_mutex is already acquired. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230801142824.1772134-11-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 19:11:06 -07:00
Maxim Georgiev	65c9fde15a	net: vlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set() 8021q is one of the stackable net devices which pass the hardware timestamping ops to the real device through ndo_eth_ioctl(). This prevents converting any device driver to the new hwtimestamping API without regressions. Remove that limitation in the vlan driver by using the newly introduced helpers for timestamping through lower devices, that handle both the new and the old driver API. Signed-off-by: Maxim Georgiev <glipus@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230801142824.1772134-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 19:11:05 -07:00
Maxim Georgiev	e47d01fea6	net: add hwtstamping helpers for stackable net devices The stackable net devices with hwtstamping support (vlan, macvlan, bonding) only pass the hwtstamping ops to the lower (real) device. These drivers are the first that need to be converted to the new timestamping API, because if they aren't prepared to handle that, then no real device driver cannot be converted to the new API either. After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and bond_eth_ioctl() have in common, here we propose two generic implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which can be called by those 3 drivers, with "dev" being their lower device. These helpers cover both cases, when the lower driver is converted to the new API or unconverted. We need some hacks in case of an unconverted driver, namely to stuff some pointers in struct kernel_hwtstamp_config which shouldn't have been there (since the new API isn't supposed to need it). These will be removed when all drivers will have been converted to the new API. Signed-off-by: Maxim Georgiev <glipus@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 19:11:05 -07:00
Maxim Georgiev	66f7223039	net: add NDOs for configuring hardware timestamping Current hardware timestamping API for NICs requires implementing .ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP. That API has some boilerplate such as request parameter translation between user and kernel address spaces, handling possible translation failures correctly, etc. Since it is the same all across the board, it would be desirable to handle it through generic code. Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which implement that boilerplate and allow drivers to just act upon requests. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Maxim Georgiev <glipus@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 19:11:05 -07:00
Eric Dumazet	ae6db08f8b	net/packet: change packet_alloc_skb() to allow bigger paged allocations packet_alloc_skb() is currently calling sock_alloc_send_pskb() forcing order-0 page allocations. Switch to PAGE_ALLOC_COSTLY_ORDER, to increase max size by 8x. Also add logic to increase the linear part if needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tahsin Erdogan <trdgn@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230801205254.400094-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 18:44:55 -07:00
Eric Dumazet	09c2c90705	net: allow alloc_skb_with_frags() to allocate bigger packets Refactor alloc_skb_with_frags() to allow bigger packets allocations. Instead of assuming that only order-0 allocations will be attempted, use the caller supplied max order. v2: try harder to use high-order pages, per Willem feedback. Link: https://lore.kernel.org/netdev/CANn89iJQfmc_KeUr3TeXvsLQwo3ZymyoCr7Y6AnHrkWSuz0yAg@mail.gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tahsin Erdogan <trdgn@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230801205254.400094-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 18:44:55 -07:00
Leon Hwang	bf4ea1d0b2	bpf, xdp: Add tracepoint to xdp attaching failure When error happens in dev_xdp_attach(), it should have a way to tell users the error message like the netlink approach. To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is an appropriate way to notify users the error message. Hence, bpf libraries are able to retrieve the error message by this tracepoint, and then report the error message to users. Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-02 14:21:12 -07:00
Yue Haibing	e12f2a6d1b	netlabel: Remove unused declaration netlbl_cipsov4_doi_free() Since commit `b1edeb1023` ("netlabel: Replace protocol/NetLabel linking with refrerence counts") this declaration is unused and can be removed. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20230801143453.24452-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 12:28:22 -07:00
Yue Haibing	2fca1b5ef8	ila: Remove unnecessary file net/ila.h Commit `642c2c9558` ("ila: xlat changes") removed ila_xlat_outgoing() and ila_xlat_incoming() functions, then this file became unnecessary. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230801143129.40652-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-02 12:28:16 -07:00
Yue Haibing	30e0191b16	ip6mr: Fix skb_under_panic in ip6mr_cache_report() skbuff: skb_under_panic: text:ffffffff88771f69 len:56 put:-4 head:ffff88805f86a800 data:ffff887f5f86a850 tail:0x88 end:0x2c0 dev:pim6reg ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:192! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 2 PID: 22968 Comm: kworker/2:11 Not tainted 6.5.0-rc3-00044-g0a8db05b571a #236 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:skb_panic+0x152/0x1d0 Call Trace: <TASK> skb_push+0xc4/0xe0 ip6mr_cache_report+0xd69/0x19b0 reg_vif_xmit+0x406/0x690 dev_hard_start_xmit+0x17e/0x6e0 __dev_queue_xmit+0x2d6a/0x3d20 vlan_dev_hard_start_xmit+0x3ab/0x5c0 dev_hard_start_xmit+0x17e/0x6e0 __dev_queue_xmit+0x2d6a/0x3d20 neigh_connected_output+0x3ed/0x570 ip6_finish_output2+0x5b5/0x1950 ip6_finish_output+0x693/0x11c0 ip6_output+0x24b/0x880 NF_HOOK.constprop.0+0xfd/0x530 ndisc_send_skb+0x9db/0x1400 ndisc_send_rs+0x12a/0x6c0 addrconf_dad_completed+0x3c9/0xea0 addrconf_dad_work+0x849/0x1420 process_one_work+0xa22/0x16e0 worker_thread+0x679/0x10c0 ret_from_fork+0x28/0x60 ret_from_fork_asm+0x11/0x20 When setup a vlan device on dev pim6reg, DAD ns packet may sent on reg_vif_xmit(). reg_vif_xmit() ip6mr_cache_report() skb_push(skb, -skb_network_offset(pkt));//skb_network_offset(pkt) is 4 And skb_push declared as: void skb_push(struct sk_buff skb, unsigned int len); skb->data -= len; //0xffff88805f86a84c - 0xfffffffc = 0xffff887f5f86a850 skb->data is set to 0xffff887f5f86a850, which is invalid mem addr, lead to skb_push() fails. Fixes: `14fb64e1f4` ("[IPV6] MROUTE: Support PIM-SM (SSM).") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-02 10:35:21 +01:00
Ratheesh Kannoth	c8915d7329	tc: flower: Enable offload support IPSEC SPI field. This patch enables offload for TC classifier flower rules which matches against SPI field. Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-02 10:09:32 +01:00
Ratheesh Kannoth	4c13eda757	tc: flower: support for SPI tc flower rules support to classify ESP/AH packets matching SPI field. Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-02 10:09:31 +01:00
Ratheesh Kannoth	a57c34a80c	net: flow_dissector: Add IPSEC dissector Support for dissecting IPSEC field SPI (which is 32bits in size) for ESP and AH packets. Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-02 10:09:31 +01:00
Ilya Dryomov	e6e2843230	libceph: fix potential hang in ceph_osdc_notify() If the cluster becomes unavailable, ceph_osdc_notify() may hang even with osd_request_timeout option set because linger_notify_finish_wait() waits for MWatchNotify NOTIFY_COMPLETE message with no associated OSD request in flight -- it's completely asynchronous. Introduce an additional timeout, derived from the specified notify timeout. While at it, switch both waits to killable which is more correct. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com>	2023-08-02 09:07:34 +02:00
Lin Ma	31d49ba033	net: dcb: choose correct policy to parse DCB_ATTR_BCN The dcbnl_bcn_setcfg uses erroneous policy to parse tb[DCB_ATTR_BCN], which is introduced in commit `859ee3c438` ("DCB: Add support for DCB BCN"). Please see the comment in below code static int dcbnl_bcn_setcfg(...) { ... ret = nla_parse_nested_deprecated(..., dcbnl_pfc_up_nest, .. ) // !!! dcbnl_pfc_up_nest for attributes // DCB_PFC_UP_ATTR_0 to DCB_PFC_UP_ATTR_ALL in enum dcbnl_pfc_up_attrs ... for (i = DCB_BCN_ATTR_RP_0; i <= DCB_BCN_ATTR_RP_7; i++) { // !!! DCB_BCN_ATTR_RP_0 to DCB_BCN_ATTR_RP_7 in enum dcbnl_bcn_attrs ... value_byte = nla_get_u8(data[i]); ... } ... for (i = DCB_BCN_ATTR_BCNA_0; i <= DCB_BCN_ATTR_RI; i++) { // !!! DCB_BCN_ATTR_BCNA_0 to DCB_BCN_ATTR_RI in enum dcbnl_bcn_attrs ... value_int = nla_get_u32(data[i]); ... } ... } That is, the nla_parse_nested_deprecated uses dcbnl_pfc_up_nest attributes to parse nlattr defined in dcbnl_pfc_up_attrs. But the following access code fetch each nlattr as dcbnl_bcn_attrs attributes. By looking up the associated nla_policy for dcbnl_bcn_attrs. We can find the beginning part of these two policies are "same". static const struct nla_policy dcbnl_pfc_up_nest[...] = { [DCB_PFC_UP_ATTR_0] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_1] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_2] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_3] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_4] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_5] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_6] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_7] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_ALL] = {.type = NLA_FLAG}, }; static const struct nla_policy dcbnl_bcn_nest[...] = { [DCB_BCN_ATTR_RP_0] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_1] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_2] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_3] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_4] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_5] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_6] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_7] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_ALL] = {.type = NLA_FLAG}, // from here is somewhat different [DCB_BCN_ATTR_BCNA_0] = {.type = NLA_U32}, ... [DCB_BCN_ATTR_ALL] = {.type = NLA_FLAG}, }; Therefore, the current code is buggy and this nla_parse_nested_deprecated could overflow the dcbnl_pfc_up_nest and use the adjacent nla_policy to parse attributes from DCB_BCN_ATTR_BCNA_0. Hence use the correct policy dcbnl_bcn_nest to parse the nested tb[DCB_ATTR_BCN] TLV. Fixes: `859ee3c438` ("DCB: Add support for DCB BCN") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230801013248.87240-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-01 21:07:46 -07:00
Jakub Kicinski	ceaac91dcd	net: make sure we never create ifindex = 0 Instead of allocating from 1 use proper xa_init flag, to protect ourselves from IDs wrapping back to 0. Fixes: `759ab1edb5` ("net: store netdevs in an xarray") Reported-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/all/20230728162350.2a6d4979@hermes.local/ Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230731171159.988962-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-01 15:01:29 -07:00
Leon Romanovsky	f3ec2b5d87	xfrm: don't skip free of empty state in acquire policy In destruction flow, the assignment of NULL to xso->dev caused to skip of xfrm_dev_state_free() call, which was called in xfrm_state_put(to_put) routine. Instead of open-coded variant of xfrm_dev_state_delete() and xfrm_dev_state_free(), let's use them directly. Fixes: `f8a70afafc` ("xfrm: add TX datapath support for IPsec packet offload mode") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2023-08-01 12:04:43 +02:00
Leon Romanovsky	982c3aca8b	xfrm: delete offloaded policy The policy memory was released but not HW driver data. Add call to xfrm_dev_policy_delete(), so drivers will have a chance to release their resources. Fixes: `919e43fad5` ("xfrm: add an interface to offload policy") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2023-08-01 12:04:43 +02:00
Christian Marangi	de9db136dc	net: dsa: tag_qca: return early if dev is not found Currently checksum is recalculated and dsa tag stripped even if we later don't find the dev. To improve code, exit early if we don't find the dev and skip additional operation on the skb since it will be freed anyway. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20230730074113.21889-1-ansuelsmth@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-01 12:02:42 +02:00
Pedro Tammela	e20e75017c	net/sched: sch_qfq: warn about class in use while deleting Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-01 10:47:25 +02:00
Pedro Tammela	7118f56e04	net/sched: sch_htb: warn about class in use while deleting Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-01 10:47:24 +02:00
Pedro Tammela	8e4553ef3e	net/sched: sch_hfsc: warn about class in use while deleting Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-01 10:47:24 +02:00
Pedro Tammela	daf8d9181b	net/sched: sch_drr: warn about class in use while deleting Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-01 10:47:24 +02:00
Pedro Tammela	8798481b66	net/sched: wrap open coded Qdics class filter counter The 'filter_cnt' counter is used to control a Qdisc class lifetime. Each filter referecing this class by its id will eventually increment/decrement this counter in their respective 'add/update/delete' routines. As these operations are always serialized under rtnl lock, we don't need an atomic type like 'refcount_t'. It also means that we lose the overflow/underflow checks already present in refcount_t, which are valuable to hunt down bugs where the unsigned counter wraps around as it aids automated tools like syzkaller to scream in such situations. Wrap the open coded increment/decrement into helper functions and add overflow checks to the operations. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-01 10:47:24 +02:00
Tomas Glozar	13d2618b48	bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire Disabling preemption in sock_map_sk_acquire conflicts with GFP_ATOMIC allocation later in sk_psock_init_link on PREEMPT_RT kernels, since GFP_ATOMIC might sleep on RT (see bpf: Make BPF and PREEMPT_RT co-exist patchset notes for details). This causes calling bpf_map_update_elem on BPF_MAP_TYPE_SOCKMAP maps to BUG (sleeping function called from invalid context) on RT kernels. preempt_disable was introduced together with lock_sk and rcu_read_lock in commit `99ba2b5aba` ("bpf: sockhash, disallow bpf_tcp_close and update in parallel"), probably to match disabled migration of BPF programs, and is no longer necessary. Remove preempt_disable to fix BUG in sock_map_update_common on RT. Signed-off-by: Tomas Glozar <tglozar@redhat.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/all/20200224140131.461979697@linutronix.de/ Fixes: `99ba2b5aba` ("bpf: sockhash, disallow bpf_tcp_close and update in parallel") Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230728064411.305576-1-tglozar@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-08-01 09:24:34 +02:00
Yue Haibing	2f48401dd0	net/hsr: Remove unused function declarations commit `f421436a59` ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") introducted these but never implemented. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230729123456.36340-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-31 20:11:47 -07:00
valis	b80b829e9e	net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free When route4_change() is called on an existing filter, the whole tcf_result struct is always copied into the new instance of the filter. This causes a problem when updating a filter bound to a class, as tcf_unbind_filter() is always called on the old instance in the success path, decreasing filter_cnt of the still referenced class and allowing it to be deleted, leading to a use-after-free. Fix this by no longer copying the tcf_result struct from the old filter. Fixes: `1109c00547` ("net: sched: RCU cls_route") Reported-by: valis <sec@valis.email> Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-4-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-31 20:10:37 -07:00
valis	76e42ae831	net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free When fw_change() is called on an existing filter, the whole tcf_result struct is always copied into the new instance of the filter. This causes a problem when updating a filter bound to a class, as tcf_unbind_filter() is always called on the old instance in the success path, decreasing filter_cnt of the still referenced class and allowing it to be deleted, leading to a use-after-free. Fix this by no longer copying the tcf_result struct from the old filter. Fixes: `e35a8ee599` ("net: sched: fw use RCU") Reported-by: valis <sec@valis.email> Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-31 20:10:36 -07:00
valis	3044b16e7c	net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free When u32_change() is called on an existing filter, the whole tcf_result struct is always copied into the new instance of the filter. This causes a problem when updating a filter bound to a class, as tcf_unbind_filter() is always called on the old instance in the success path, decreasing filter_cnt of the still referenced class and allowing it to be deleted, leading to a use-after-free. Fix this by no longer copying the tcf_result struct from the old filter. Fixes: `de5df63228` ("net: sched: cls_u32 changes to knode must appear atomic to readers") Reported-by: valis <sec@valis.email> Reported-by: M A Ramdhan <ramdhan@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-31 20:10:36 -07:00
Daniel Xu	81584c23f2	netfilter: bpf: Only define get_proto_defrag_hook() if necessary Before, we were getting this warning: net/netfilter/nf_bpf_link.c:32:1: warning: 'get_proto_defrag_hook' defined but not used [-Wunused-function] Guard the definition with CONFIG_NF_DEFRAG_IPV[4\|6]. Fixes: `91721c2d02` ("netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202307291213.fZ0zDmoG-lkp@intel.com/ Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/b128b6489f0066db32c4772ae4aaee1480495929.1690840454.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-07-31 17:51:13 -07:00
Yue Haibing	634e449719	vsock: Remove unused function declarations These are never implemented since introduction in commit `d021c34405` ("VSOCK: Introduce VM Sockets") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20230729122036.32988-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-31 14:41:08 -07:00
Yue Haibing	4cbc32a8a2	net/smc: Remove unused function declarations commit `f9aab6f2ce` ("net/smc: immediate freeing in smc_lgr_cleanup_early()") left behind smc_lgr_schedule_free_work_fast() declaration. And since commit `349d43127d` ("net/smc: fix kernel panic caused by race of smc_sock") smc_ib_modify_qp_reset() is not used anymore. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://lore.kernel.org/r/20230729121929.17180-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-31 14:40:26 -07:00
Lorenz Bauer	74bdfab4fd	net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn There are already INDIRECT_CALLABLE_DECLARE in the hashtable headers, no need to declare them again. Fixes: `0f495f7617` ("net: remove duplicate reuseport_lookup functions") Suggested-by: Martin Lau <martin.lau@linux.dev> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230731-indir-call-v1-1-4cd0aeaee64f@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-07-31 13:53:10 -07:00
Jiri Slaby	659705d0a6	Bluetooth: rfcomm: remove casts from tty->driver_data tty->driver_data is 'void *', so there is no need to cast from that. Therefore remove the casts and assign the pointer directly. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: linux-bluetooth@vger.kernel.org Link: https://lore.kernel.org/r/20230731080244.2698-3-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-31 17:16:05 +02:00
Kuniyuki Iwashima	8936bf53a0	net: Use sockaddr_storage for getsockopt(SO_PEERNAME). Commit `df8fc4e934` ("kbuild: Enable -fstrict-flex-arrays=3") started applying strict rules to standard string functions. It does not work well with conventional socket code around each protocol- specific sockaddr_XXX struct, which is cast from sockaddr_storage and has a bigger size than fortified functions expect. See these commits: commit `06d4c8a808` ("af_unix: Fix fortify_panic() in unix_bind_bsd().") commit `ecb4534b6a` ("af_unix: Terminate sun_path when bind()ing pathname socket.") commit `a0ade8404c` ("af_packet: Fix warning of fortified memcpy() in packet_getname().") We must cast the protocol-specific address back to sockaddr_storage to call such functions. However, in the case of getsockaddr(SO_PEERNAME), the rationale is a bit unclear as the buffer is defined by char[128] which is the same size as sockaddr_storage. Let's use sockaddr_storage explicitly. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-31 09:14:16 +01:00
Kuniyuki Iwashima	e739718444	net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX. syzkaller found zero division error [0] in div_s64_rem() called from get_cycle_time_elapsed(), where sched->cycle_time is the divisor. We have tests in parse_taprio_schedule() so that cycle_time will never be 0, and actually cycle_time is not 0 in get_cycle_time_elapsed(). The problem is that the types of divisor are different; cycle_time is s64, but the argument of div_s64_rem() is s32. syzkaller fed this input and 0x100000000 is cast to s32 to be 0. @TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME={0xc, 0x8, 0x100000000} We use s64 for cycle_time to cast it to ktime_t, so let's keep it and set max for cycle_time. While at it, we prevent overflow in setup_txtime() and add another test in parse_taprio_schedule() to check if cycle_time overflows. Also, we add a new tdc test case for this issue. [0]: divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 103 Comm: kworker/1:3 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:div_s64_rem include/linux/math64.h:42 [inline] RIP: 0010:get_cycle_time_elapsed net/sched/sch_taprio.c:223 [inline] RIP: 0010:find_entry_to_transmit+0x252/0x7e0 net/sched/sch_taprio.c:344 Code: 3c 02 00 0f 85 5e 05 00 00 48 8b 4c 24 08 4d 8b bd 40 01 00 00 48 8b 7c 24 48 48 89 c8 4c 29 f8 48 63 f7 48 99 48 89 74 24 70 <48> f7 fe 48 29 d1 48 8d 04 0f 49 89 cc 48 89 44 24 20 49 8d 85 10 RSP: 0018:ffffc90000acf260 EFLAGS: 00010206 RAX: 177450e0347560cf RBX: 0000000000000000 RCX: 177450e0347560cf RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000100000000 RBP: 0000000000000056 R08: 0000000000000000 R09: ffffed10020a0934 R10: ffff8880105049a7 R11: ffff88806cf3a520 R12: ffff888010504800 R13: ffff88800c00d800 R14: ffff8880105049a0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0edf84f0e8 CR3: 000000000d73c002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> get_packet_txtime net/sched/sch_taprio.c:508 [inline] taprio_enqueue_one+0x900/0xff0 net/sched/sch_taprio.c:577 taprio_enqueue+0x378/0xae0 net/sched/sch_taprio.c:658 dev_qdisc_enqueue+0x46/0x170 net/core/dev.c:3732 __dev_xmit_skb net/core/dev.c:3821 [inline] __dev_queue_xmit+0x1b2f/0x3000 net/core/dev.c:4169 dev_queue_xmit include/linux/netdevice.h:3088 [inline] neigh_resolve_output net/core/neighbour.c:1552 [inline] neigh_resolve_output+0x4a7/0x780 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:544 [inline] ip6_finish_output2+0x924/0x17d0 net/ipv6/ip6_output.c:135 __ip6_finish_output+0x620/0xaa0 net/ipv6/ip6_output.c:196 ip6_finish_output net/ipv6/ip6_output.c:207 [inline] NF_HOOK_COND include/linux/netfilter.h:292 [inline] ip6_output+0x206/0x410 net/ipv6/ip6_output.c:228 dst_output include/net/dst.h:458 [inline] NF_HOOK.constprop.0+0xea/0x260 include/linux/netfilter.h:303 ndisc_send_skb+0x872/0xe80 net/ipv6/ndisc.c:508 ndisc_send_ns+0xb5/0x130 net/ipv6/ndisc.c:666 addrconf_dad_work+0xc14/0x13f0 net/ipv6/addrconf.c:4175 process_one_work+0x92c/0x13a0 kernel/workqueue.c:2597 worker_thread+0x60f/0x1240 kernel/workqueue.c:2748 kthread+0x2fe/0x3f0 kernel/kthread.c:389 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308 </TASK> Modules linked in: Fixes: `4cfd5779bd` ("taprio: Add support for txtime-assist mode") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Co-developed-by: Eric Dumazet <edumazet@google.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-31 09:12:27 +01:00
Ratheesh Kannoth	2b3082c6ef	net: flow_dissector: Use 64bits for used_keys As 32bits of dissector->used_keys are exhausted, increase the size to 64bits. This is base change for ESP/AH flow dissector patch. Please find patch and discussions at https://lore.kernel.org/netdev/ZMDNjD46BvZ5zp5I@corigine.com/T/#t Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw Tested-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-31 09:11:24 +01:00
Lin Ma	5e2424708d	xfrm: add forgotten nla_policy for XFRMA_MTIMER_THRESH The previous commit `4e484b3e96` ("xfrm: rate limit SA mapping change message to user space") added one additional attribute named XFRMA_MTIMER_THRESH and described its type at compat_policy (net/xfrm/xfrm_compat.c). However, the author forgot to also describe the nla_policy at xfrma_policy (net/xfrm/xfrm_user.c). Hence, this suppose NLA_U32 (4 bytes) value can be faked as empty (0 bytes) by a malicious user, which leads to 4 bytes overflow read and heap information leak when parsing nlattrs. To exploit this, one malicious user can spray the SLUB objects and then leverage this 4 bytes OOB read to leak the heap data into x->mapping_maxage (see xfrm_update_ae_params(...)), and leak it to userspace via copy_to_user_state_extra(...). The above bug is assigned CVE-2023-3773. To fix it, this commit just completes the nla_policy description for XFRMA_MTIMER_THRESH, which enforces the length check and avoids such OOB read. Fixes: `4e484b3e96` ("xfrm: rate limit SA mapping change message to user space") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2023-07-31 08:20:08 +02:00
Lin Ma	00374d9b6d	xfrm: add NULL check in xfrm_update_ae_params Normally, x->replay_esn and x->preplay_esn should be allocated at xfrm_alloc_replay_state_esn(...) in xfrm_state_construct(...), hence the xfrm_update_ae_params(...) is okay to update them. However, the current implementation of xfrm_new_ae(...) allows a malicious user to directly dereference a NULL pointer and crash the kernel like below. BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 8253067 P4D 8253067 PUD 8e0e067 PMD 0 Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 PID: 98 Comm: poc.npd Not tainted 6.4.0-rc7-00072-gdad9774deaf1 #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.o4 RIP: 0010:memcpy_orig+0xad/0x140 Code: e8 4c 89 5f e0 48 8d 7f e0 73 d2 83 c2 20 48 29 d6 48 29 d7 83 fa 10 72 34 4c 8b 06 4c 8b 4e 08 c RSP: 0018:ffff888008f57658 EFLAGS: 00000202 RAX: 0000000000000000 RBX: ffff888008bd0000 RCX: ffffffff8238e571 RDX: 0000000000000018 RSI: ffff888007f64844 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888008f57818 R13: ffff888007f64aa4 R14: 0000000000000000 R15: 0000000000000000 FS: 00000000014013c0(0000) GS:ffff88806d600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000054d8000 CR4: 00000000000006f0 Call Trace: <TASK> ? __die+0x1f/0x70 ? page_fault_oops+0x1e8/0x500 ? __pfx_is_prefetch.constprop.0+0x10/0x10 ? __pfx_page_fault_oops+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x11/0x40 ? fixup_exception+0x36/0x460 ? _raw_spin_unlock_irqrestore+0x11/0x40 ? exc_page_fault+0x5e/0xc0 ? asm_exc_page_fault+0x26/0x30 ? xfrm_update_ae_params+0xd1/0x260 ? memcpy_orig+0xad/0x140 ? __pfx__raw_spin_lock_bh+0x10/0x10 xfrm_update_ae_params+0xe7/0x260 xfrm_new_ae+0x298/0x4e0 ? __pfx_xfrm_new_ae+0x10/0x10 ? __pfx_xfrm_new_ae+0x10/0x10 xfrm_user_rcv_msg+0x25a/0x410 ? __pfx_xfrm_user_rcv_msg+0x10/0x10 ? __alloc_skb+0xcf/0x210 ? stack_trace_save+0x90/0xd0 ? filter_irq_stacks+0x1c/0x70 ? __stack_depot_save+0x39/0x4e0 ? __kasan_slab_free+0x10a/0x190 ? kmem_cache_free+0x9c/0x340 ? netlink_recvmsg+0x23c/0x660 ? sock_recvmsg+0xeb/0xf0 ? __sys_recvfrom+0x13c/0x1f0 ? __x64_sys_recvfrom+0x71/0x90 ? do_syscall_64+0x3f/0x90 ? entry_SYSCALL_64_after_hwframe+0x72/0xdc ? copyout+0x3e/0x50 netlink_rcv_skb+0xd6/0x210 ? __pfx_xfrm_user_rcv_msg+0x10/0x10 ? __pfx_netlink_rcv_skb+0x10/0x10 ? __pfx_sock_has_perm+0x10/0x10 ? mutex_lock+0x8d/0xe0 ? __pfx_mutex_lock+0x10/0x10 xfrm_netlink_rcv+0x44/0x50 netlink_unicast+0x36f/0x4c0 ? __pfx_netlink_unicast+0x10/0x10 ? netlink_recvmsg+0x500/0x660 netlink_sendmsg+0x3b7/0x700 This Null-ptr-deref bug is assigned CVE-2023-3772. And this commit adds additional NULL check in xfrm_update_ae_params to fix the NPD. Fixes: `d8647b79c3` ("xfrm: Add user interface for esn and big anti-replay windows") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2023-07-31 08:06:34 +02:00
Eric Dumazet	8bf43be799	net: annotate data-races around sk->sk_priority sk_getsockopt() runs locklessly. This means sk->sk_priority can be read while other threads are changing its value. Other reads also happen without socket lock being held. Add missing annotations where needed. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	e5f0d2dd3c	net: add missing data-race annotation for sk_ll_usec In a prior commit I forgot that sk_getsockopt() reads sk->sk_ll_usec without holding a lock. Fixes: `0dbffbb533` ("net: annotate data race around sk_ll_usec") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	11695c6e96	net: add missing data-race annotations around sk->sk_peek_off sk_getsockopt() runs locklessly, thus we need to annotate the read of sk->sk_peek_off. While we are at it, add corresponding annotations to sk_set_peek_off() and unix_set_peek_off(). Fixes: `b9bb53f383` ("sock: convert sk_peek_offset functions to WRITE_ONCE") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	3c5b4d69c3	net: annotate data-races around sk->sk_mark sk->sk_mark is often read while another thread could change the value. Fixes: `4a19ec5800` ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	b4b5532530	net: add missing READ_ONCE(sk->sk_rcvbuf) annotation In a prior commit, I forgot to change sk_getsockopt() when reading sk->sk_rcvbuf locklessly. Fixes: `ebb3b78db7` ("tcp: annotate sk->sk_rcvbuf lockless reads") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	74bc084327	net: add missing READ_ONCE(sk->sk_sndbuf) annotation In a prior commit, I forgot to change sk_getsockopt() when reading sk->sk_sndbuf locklessly. Fixes: `e292f05e0d` ("tcp: annotate sk->sk_sndbuf lockless reads") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	285975dd67	net: annotate data-races around sk->sk_{rcv\|snd}timeo sk_getsockopt() runs without locks, we must add annotations to sk->sk_rcvtimeo and sk->sk_sndtimeo. In the future we might allow fetching these fields before we lock the socket in TCP fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	e6d12bdb43	net: add missing READ_ONCE(sk->sk_rcvlowat) annotation In a prior commit, I forgot to change sk_getsockopt() when reading sk->sk_rcvlowat locklessly. Fixes: `eac66402d1` ("net: annotate sk->sk_rcvlowat lockless reads") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	ea7f45ef77	net: annotate data-races around sk->sk_max_pacing_rate sk_getsockopt() runs locklessly. This means sk->sk_max_pacing_rate can be read while other threads are changing its value. Fixes: `62748f32d5` ("net: introduce SO_MAX_PACING_RATE") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	c76a032889	net: annotate data-race around sk->sk_txrehash sk_getsockopt() runs locklessly. This means sk->sk_txrehash can be read while other threads are changing its value. Other locations were handled in commit `cb6cd2cec7` ("tcp: Change SYN ACK retransmit behaviour to account for rehash") Fixes: `26859240e4` ("txhash: Add socket option to control TX hash rethink behavior") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Akhmat Karakotov <hmukos@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:41 +01:00
Eric Dumazet	fe11fdcb42	net: annotate data-races around sk->sk_reserved_mem sk_getsockopt() runs locklessly. This means sk->sk_reserved_mem can be read while other threads are changing its value. Add missing annotations where they are needed. Fixes: `2bb2f5fb21` ("net: add new socket option SO_RESERVE_MEM") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 18:13:40 +01:00
Richard Gobert	7938cd1543	net: gro: fix misuse of CB in udp socket lookup This patch fixes a misuse of IP{6}CB(skb) in GRO, while calling to `udp6_lib_lookup2` when handling udp tunnels. `udp6_lib_lookup2` fetch the device from CB. The fix changes it to fetch the device from `skb->dev`. l3mdev case requires special attention since it has a master and a slave device. Fixes: `a6024562ff` ("udp: Add GRO functions to UDP socket") Reported-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-29 17:10:27 +01:00
Jamal Hadi Salim	e68409db99	net: sched: cls_u32: Fix match key mis-addressing A match entry is uniquely identified with an "address" or "path" in the form of: hashtable ID(12b):bucketid(8b):nodeid(12b). When creating table match entries all of hash table id, bucket id and node (match entry id) are needed to be either specified by the user or reasonable in-kernel defaults are used. The in-kernel default for a table id is 0x800(omnipresent root table); for bucketid it is 0x0. Prior to this fix there was none for a nodeid i.e. the code assumed that the user passed the correct nodeid and if the user passes a nodeid of 0 (as Mingi Cho did) then that is what was used. But nodeid of 0 is reserved for identifying the table. This is not a problem until we dump. The dump code notices that the nodeid is zero and assumes it is referencing a table and therefore references table struct tc_u_hnode instead of what was created i.e match entry struct tc_u_knode. Ming does an equivalent of: tc filter add dev dummy0 parent 10: prio 1 handle 0x1000 \ protocol ip u32 match ip src 10.0.0.1/32 classid 10:1 action ok Essentially specifying a table id 0, bucketid 1 and nodeid of zero Tableid 0 is remapped to the default of 0x800. Bucketid 1 is ignored and defaults to 0x00. Nodeid was assumed to be what Ming passed - 0x000 dumping before fix shows: ~$ tc filter ls dev dummy0 parent 10: filter protocol ip pref 1 u32 chain 0 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor -30591 Note that the last line reports a table instead of a match entry (you can tell this because it says "ht divisor..."). As a result of reporting the wrong data type (misinterpretting of struct tc_u_knode as being struct tc_u_hnode) the divisor is reported with value of -30591. Ming identified this as part of the heap address (physmap_base is 0xffff8880 (-30591 - 1)). The fix is to ensure that when table entry matches are added and no nodeid is specified (i.e nodeid == 0) then we get the next available nodeid from the table's pool. After the fix, this is what the dump shows: $ tc filter ls dev dummy0 parent 10: filter protocol ip pref 1 u32 chain 0 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1 filter protocol ip pref 1 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 10:1 not_in_hw match 0a000001/ffffffff at 12 action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 Reported-by: Mingi Cho <mgcho.minic@gmail.com> Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20230726135151.416917-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 18:05:04 -07:00
Daniel Xu	91721c2d02	netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link This commit adds support for enabling IP defrag using pre-existing netfilter defrag support. Basically all the flag does is bump a refcnt while the link the active. Checks are also added to ensure the prog requesting defrag support is run _after_ netfilter defrag hooks. We also take care to avoid any issues w.r.t. module unloading -- while defrag is active on a link, the module is prevented from unloading. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-07-28 16:52:08 -07:00
Daniel Xu	9abddac583	netfilter: defrag: Add glue hooks for enabling/disabling defrag We want to be able to enable/disable IP packet defrag from core bpf/netfilter code. In other words, execute code from core that could possibly be built as a module. To help avoid symbol resolution errors, use glue hooks that the modules will register callbacks with during module init. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/f6a8824052441b72afe5285acedbd634bd3384c1.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-07-28 16:52:08 -07:00
Jakub Kicinski	05191d8896	Merge branch 'in-kernel-support-for-the-tls-alert-protocol' Chuck Lever says: ==================== In-kernel support for the TLS Alert protocol IMO the kernel doesn't need user space (ie, tlshd) to handle the TLS Alert protocol. Instead, a set of small helper functions can be used to handle sending and receiving TLS Alerts for in-kernel TLS consumers. ==================== Merged on top of a tag in case it's needed in the NFS tree. Link: https://lore.kernel.org/r/169047923706.5241.1181144206068116926.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 14:08:02 -07:00
Chuck Lever	b470985c76	net/handshake: Trace events for TLS Alert helpers Add observability for the new TLS Alert infrastructure. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047947409.5241.14548832149596892717.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 14:07:59 -07:00
Chuck Lever	39067dda1d	SUNRPC: Use new helpers to handle TLS Alerts Use the helpers to parse the level and description fields in incoming alerts. "Warning" alerts are discarded, and "fatal" alerts mean the session is no longer valid. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047944747.5241.1974889594004407123.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 14:07:59 -07:00
Chuck Lever	39d0e38dcc	net/handshake: Add helpers for parsing incoming TLS Alerts Kernel TLS consumers can replace common TLS Alert parsing code with these helpers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047942074.5241.13791647439480672048.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 14:07:59 -07:00
Chuck Lever	5dd5ad682c	SUNRPC: Send TLS Closure alerts before closing a TCP socket Before closing a TCP connection, the TLS protocol wants peers to send session close Alert notifications. Add those in both the RPC client and server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047939404.5241.14392506226409865832.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 14:07:59 -07:00
Chuck Lever	35b1b538d4	net/handshake: Add API for sending TLS Closure alerts This helper sends an alert only if a TLS session was established. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047936730.5241.618595693821012638.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 14:07:59 -07:00
Chuck Lever	6a7eccef47	net/tls: Move TLS protocol elements to a separate header Kernel TLS consumers will need definitions of various parts of the TLS protocol, but often do not need the function declarations and other infrastructure provided in <net/tls.h>. Break out existing standardized protocol elements into a separate header, and make room for a few more elements in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047931374.5241.7713175865185969309.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 14:07:59 -07:00
Patrick Rohr	5027d54a9c	net: change accept_ra_min_rtr_lft to affect all RA lifetimes accept_ra_min_rtr_lft only considered the lifetime of the default route and discarded entire RAs accordingly. This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and applies the value to individual RA sections; in particular, router lifetime, PIO preferred lifetime, and RIO lifetime. If any of those lifetimes are lower than the configured value, the specific RA section is ignored. In order for the sysctl to be useful to Android, it should really apply to all lifetimes in the RA, since that is what determines the minimum frequency at which RAs must be processed by the kernel. Android uses hardware offloads to drop RAs for a fraction of the minimum of all lifetimes present in the RA (some networks have very frequent RAs (5s) with high lifetimes (2h)). Despite this, we have encountered networks that set the router lifetime to 30s which results in very frequent CPU wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the WiFi firmware) entirely on such networks, it seems better to ignore the misconfigured routers while still processing RAs from other IPv6 routers on the same network (i.e. to support IoT applications). The previous implementation dropped the entire RA based on router lifetime. This turned out to be hard to expand to the other lifetimes present in the RA in a consistent manner; dropping the entire RA based on RIO/PIO lifetimes would essentially require parsing the whole thing twice. Fixes: `1671bcfd76` ("net: add sysctl accept_ra_min_rtr_lft") Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Patrick Rohr <prohr@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230726230701.919212-1-prohr@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 13:30:51 -07:00
Jakub Kicinski	84e00d9bd4	net: convert some netlink netdev iterators to depend on the xarray Reap the benefits of easier iteration thanks to the xarray. Convert just the genetlink ones, those are easier to test. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 11:35:58 -07:00
Jakub Kicinski	759ab1edb5	net: store netdevs in an xarray Iterating over the netdev hash table for netlink dumps is hard. Dumps are done in "chunks" so we need to save the position after each chunk, so we know where to restart from. Because netdevs are stored in a hash table we remember which bucket we were in and how many devices we dumped. Since we don't hold any locks across the "chunks" - devices may come and go while we're dumping. If that happens we may miss a device (if device is deleted from the bucket we were in). We indicate to user space that this may have happened by setting NLM_F_DUMP_INTR. User space is supposed to dump again (I think) if it sees that. Somehow I doubt most user space gets this right.. To illustrate let's look at an example: System state: start: # [A, B, C] del: B # [A, C] with the hash table we may dump [A, B], missing C completely even tho it existed both before and after the "del B". Add an xarray and use it to allocate ifindexes. This way we can iterate ifindexes in order, without the worry that we'll skip one. We may still generate a dump of a state which "never existed", for example for a set of values and sequence of ops: System state: start: # [A, B] add: C # [A, C, B] del: B # [A, C] we may generate a dump of [A], if C got an index between A and B. System has never been in such state. But I'm 90% sure that's perfectly fine, important part is that we can't _miss_ devices which exist before and after. User space which wants to mirror kernel's state subscribes to notifications and does periodic dumps so it will know that C exists from the notification about its creation or from the next dump (next dump is _guaranteed_ to include C, if it doesn't get removed). To avoid any perf regressions keep the hash table for now. Most net namespaces have very few devices and microbenchmarking 1M lookups on Skylake I get the following results (not counting loopback to number of devs): #devs \| hash \| xa \| delta 2 \| 18.3 \| 20.1 \| + 9.8% 16 \| 18.3 \| 20.1 \| + 9.5% 64 \| 18.3 \| 26.3 \| +43.8% 128 \| 20.4 \| 26.3 \| +28.6% 256 \| 20.0 \| 26.4 \| +32.1% 1024 \| 26.6 \| 26.7 \| + 0.2% 8192 \|541.3 \| 33.5 \| -93.8% No surprises since the hash table has 256 entries. The microbenchmark scans indexes in order, if the pattern is more random xa starts to win at 512 devices already. But that's a lot of devices, in practice. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-28 11:35:58 -07:00
Linus Torvalds	e62e26d3e9	A patch to reduce the potential for erroneous RBD exclusive lock blocklisting (fencing) with a couple of prerequisites and a fixup to prevent metrics from being sent to the MDS even just once after that has been disabled by the user. All marked for stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTD7MgTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi3SBB/4nHdfSQwy0z2+PM766sUKxSlmaRw8X 4AJyGAIGj5BnHHhtluwLpEYfrh3wfyCRaNYgS64jdsudbUBxPKIIWn2lEFxtyWbC w0R2uEc+NGJLJOYfJ+lBP06Q2r6qk7N6OGNy6qLaN+v6xJ8WPw7H3fJBLVhnPgMq 7lkACRN+0P5Xt6ZJ57kbWWiFQ+vjv7bbDa0P9zMl6uCgoYsIpvrskqygx+gHbdsq IcnpsHu3F0ycYAJT5eJ5GcCcThvwbNjWdbJy1fERah7U/LNcX/S3To9V5LPmydOQ tYAWMlC/1a99fr+jTYF0Pu5GLUdK0UMKeX04ZN3SKON4pORurpypw3n8 =rw72 -----END PGP SIGNATURE----- Merge tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "A patch to reduce the potential for erroneous RBD exclusive lock blocklisting (fencing) with a couple of prerequisites and a fixup to prevent metrics from being sent to the MDS even just once after that has been disabled by the user. All marked for stable" * tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client: rbd: retrieve and check lock owner twice before blocklisting rbd: harden get_lock_owner_info() a bit rbd: make get_lock_owner_info() return a single locker or NULL ceph: never send metrics if disable_send_metrics is set	2023-07-28 10:47:24 -07:00
Linus Torvalds	28d79b746c	Misc set of fixes for 9p in 6.5 Most of these clean up warnings we've gotten out of compilation tools, but several of them were from inspection while hunting down a couple of regressions. The most important one to pull is `75b396821c` (fs/9p: remove unnecessary and overrestrictive check) which caused a regression for some folks by restricting mmap in any case where writeback caches weren't enabled. Most of the other bugs caught via inspection were type mismatches. Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmTDH8cACgkQiP/V+0pf /5hSJQ//b59cDliC7Knf9B2Of1UsLJ2wYIbxWVYKLwYarKFn3tmtO5dPtWZrQzjB Kz6fif5z1c0WdjNFLifs/XNqUq5znX/TY8bV/NmOg8VlaoJqmUQSSYnNQOWZCFKT zwxC6BO6gPNNIkJN2xQ8oOq11Qon/nbZbuN9P2VDcT5Yr2KmFx6FHRcrBNRYAm3E UzFdjkLrLef3VrvegJNGM3Wv2HqyNBA6QhifZBjDkydtDPMd9fRNns7Q60AARR9K aqXV6SihE/Ox7sSmVNjTzYF67eq5Xjt+sSzo2SdfOaZxVIa6wf0UXQuFqmHts6Zs QUCdXS5YbQAwQfdkm22rnTIxAwsbEpFOGGweUvMBXzZbl/sq/PK4Nt6DpCS8ZFi3 81Z5Ey+Q4yaxwdirP521M4ao2Ae2Fzg12bqDTNssZdOYGcXBqBfWiR5IfMbbkgWq WzCVI3V/LshQ75pXQyS4BtW/29C2nN7g3jLrF3Q5OTe7XmHMCZFvtP4lKvY0piQy ++XoDs1LCJWSZebfkNa05L5nhQ1mYhwiZutHTtF3ejTTiJvcJXQ4xHYLzjOON+4i blLTpgWLO0rIRAmX0I8GwPi6q0xL4rFP4XGGz/LDppRQkRa13vtzFtMvuldPyEq7 g4pcLkI3SPbL982qYbg8UO+GjO/Q9M/DafXQVlUyDw04TQbT8Jg= =TAC1 -----END PGP SIGNATURE----- Merge tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9p fixes from Eric Van Hensbergen: "Misc set of fixes for 9p. Most of these clean up warnings we've gotten out of compilation tools, but several of them were from inspection while hunting down a couple of regressions. The most important one is `75b396821c` ("fs/9p: remove unnecessary and overrestrictive check") which caused a regression for some folks by restricting mmap in any case where writeback caches weren't enabled. Most of the other bugs caught via inspection were type mismatches" * tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: Remove unused extern declaration 9p: remove dead stores (variable set again without being read) 9p: virtio: skip incrementing unused variable 9p: virtio: make sure 'offs' is initialized in zc_request 9p: virtio: fix unlikely null pointer deref in handle_rerror 9p: fix ignored return value in v9fs_dir_release fs/9p: remove unnecessary invalidate_inode_pages2 fs/9p: fix type mismatch in file cache mode helper fs/9p: fix typo in comparison logic for cache mode fs/9p: remove unnecessary and overrestrictive check fs/9p: Fix a datatype used with V9FS_DIRECT_IO	2023-07-28 10:43:16 -07:00
Remi Pommarel	eac27a41ab	batman-adv: Do not get eth header before batadv_check_management_packet If received skb in batadv_v_elp_packet_recv or batadv_v_ogm_packet_recv is either cloned or non linearized then its data buffer will be reallocated by batadv_check_management_packet when skb_cow or skb_linearize get called. Thus geting ethernet header address inside skb data buffer before batadv_check_management_packet had any chance to reallocate it could lead to the following kernel panic: Unable to handle kernel paging request at virtual address ffffff8020ab069a Mem abort info: ESR = 0x96000007 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x07: level 3 translation fault Data abort info: ISV = 0, ISS = 0x00000007 CM = 0, WnR = 0 swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000040f45000 [ffffff8020ab069a] pgd=180000007fffa003, p4d=180000007fffa003, pud=180000007fffa003, pmd=180000007fefe003, pte=0068000020ab0706 Internal error: Oops: 96000007 [#1] SMP Modules linked in: ahci_mvebu libahci_platform libahci dvb_usb_af9035 dvb_usb_dib0700 dib0070 dib7000m dibx000_common ath11k_pci ath10k_pci ath10k_core mwl8k_new nf_nat_sip nf_conntrack_sip xhci_plat_hcd xhci_hcd nf_nat_pptp nf_conntrack_pptp at24 sbsa_gwdt CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.15.42-00066-g3242268d425c-dirty #550 Hardware name: A8k (DT) pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : batadv_is_my_mac+0x60/0xc0 lr : batadv_v_ogm_packet_recv+0x98/0x5d0 sp : ffffff8000183820 x29: ffffff8000183820 x28: 0000000000000001 x27: ffffff8014f9af00 x26: 0000000000000000 x25: 0000000000000543 x24: 0000000000000003 x23: ffffff8020ab0580 x22: 0000000000000110 x21: ffffff80168ae880 x20: 0000000000000000 x19: ffffff800b561000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 00dc098924ae0032 x14: 0f0405433e0054b0 x13: ffffffff00000080 x12: 0000004000000001 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : ffffffc076dae000 x6 : ffffff8000183700 x5 : ffffffc00955e698 x4 : ffffff80168ae000 x3 : ffffff80059cf000 x2 : ffffff800b561000 x1 : ffffff8020ab0696 x0 : ffffff80168ae880 Call trace: batadv_is_my_mac+0x60/0xc0 batadv_v_ogm_packet_recv+0x98/0x5d0 batadv_batman_skb_recv+0x1b8/0x244 __netif_receive_skb_core.isra.0+0x440/0xc74 __netif_receive_skb_one_core+0x14/0x20 netif_receive_skb+0x68/0x140 br_pass_frame_up+0x70/0x80 br_handle_frame_finish+0x108/0x284 br_handle_frame+0x190/0x250 __netif_receive_skb_core.isra.0+0x240/0xc74 __netif_receive_skb_list_core+0x6c/0x90 netif_receive_skb_list_internal+0x1f4/0x310 napi_complete_done+0x64/0x1d0 gro_cell_poll+0x7c/0xa0 __napi_poll+0x34/0x174 net_rx_action+0xf8/0x2a0 _stext+0x12c/0x2ac run_ksoftirqd+0x4c/0x7c smpboot_thread_fn+0x120/0x210 kthread+0x140/0x150 ret_from_fork+0x10/0x20 Code: f9403844 eb03009f 54fffee1 f94 Thus ethernet header address should only be fetched after batadv_check_management_packet has been called. Fixes: `0da0035942` ("batman-adv: OGMv2 - add basic infrastructure") Cc: stable@vger.kernel.org Signed-off-by: Remi Pommarel <repk@triplefau.lt> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2023-07-28 15:39:38 +02:00
Hangbin Liu	7f6c40391a	IPv6: add extack info for IPv6 address add/delete Add extack info for IPv6 address add/delete, which would be useful for users to understand the problem without having to read kernel code. Suggested-by: Beniamino Galvani <bgalvani@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-28 11:01:56 +01:00
Joe Damato	801b27e880	net: ethtool: Unify ETHTOOL_{G,S}RXFH rxnfc copy ETHTOOL_GRXFH correctly copies in the full struct ethtool_rxnfc when FLOW_RSS is set; ETHTOOL_SRXFH needs a similar code path to handle the FLOW_RSS case so that ethtool can set the flow hash for custom RSS contexts (if supported by the driver). The copy code from ETHTOOL_GRXFH has been pulled out in to a helper so that it can be called in both ETHTOOL_{G,S}RXFH code paths. Acked-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-28 09:35:53 +01:00
Rob Herring	3d40aed862	net: Explicitly include correct DT includes The DT of_device.h and of_platform.h date back to the separate of_platform_bus_type before it as merged into the regular platform bus. As part of that merge prepping Arm DT support 13 years ago, they "temporarily" include each other. They also include platform_device.h and of.h. As a result, there's a pretty much random mix of those include files used throughout the tree. In order to detangle these headers and replace the implicit includes with struct declarations, users need to explicitly include the correct includes. Acked-by: Alex Elder <elder@linaro.org> Reviewed-by: Bhupesh Sharma <bhupesh.sharma@linaro.org> Reviewed-by: Wei Fang <wei.fang@nxp.com> Signed-off-by: Rob Herring <robh@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230727014944.3972546-1-robh@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 20:33:16 -07:00
Jakub Kicinski	5908a4c47c	netfilter net-next pull request 2023-07-27 -----BEGIN PGP SIGNATURE----- iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTCcgkNHGZ3QHN0cmxl bi5kZQAKCRBwkajZrV/2AJmMD/9IPWnzSNLUgoAhSo0h2OkCKl2iIdRnkrPrruhE Su8bD8ohmU100iN1DMXT2a7C9o0BTog4EB7WtF21z+06dUhROiZizrSt8bTk/rRi 0+Sm9xlDAdl3CZcU8fnVjwf6PLYgUv5zVjcQc4Ggf15MwEIdpviKCps2bbBtrozF PJEK6+UwTU6+z4GSTc957nhFHstEcwktyxoaAote98CD78G2YCQT5yVbfctHgRm0 9qovT8S/zZmqHvqvUfrqJd+N5V/+40O7ZuFls93kYxK9Bttx9wRwEqALPldxXudU o0kG4QZ8NAwiIVsGqPwKu/cKi9PF0z/PUXYgVdnkKK+XofBDHbHyfR+BJO1ejOdX +ea9AoQ6lD6NVmvX01+lF9OI4D1zgc6pLGyjSsyVgv3x0iKJeZ8QOgb0DTGFiG1U MnFIeckedrh/dt3NXLG/blZvuAzhofHqEhH/DlvbI/QBtN2zEgIMJKxRfBAMs3OO WAIlaHASQFVbyrHOr/X3FoNDTsvZyrTppo9WwJVTj9F41lYXzWoiBY+nVj2brGDR SMW1M13sufRBQlk0aTpPYPvcS5FhsMf6ggxygi2rNxX5/AdFE02nnEU9ybpHAqcy NiZ8kCxJ2J9+aCj7yvJ7QQcAD7l2tAIeAZCKSlKteigqTI0PWoTUc0IYPT85URLm cy/l4A== =fgLz -----END PGP SIGNATURE----- Merge tag 'nf-next-23-07-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter updates for net-next 1. silence a harmless warning for CONFIG_NF_CONNTRACK_PROCFS=n builds, from Zhu Wang. 2, 3: Allow NLA_POLICY_MASK to be used with BE16/BE32 types, and replace a few manual checks with nla_policy based one in nf_tables, from myself. 4: cleanup in ctnetlink to validate while parsing rather than using two steps, from Lin Ma. 5: refactor boyer-moore textsearch by moving a small chunk to a helper function, rom Jeremy Sowden. * tag 'nf-next-23-07-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: lib/ts_bm: add helper to reduce indentation and improve readability netfilter: conntrack: validate cta_ip via parsing netfilter: nf_tables: use NLA_POLICY_MASK to test for valid flag options netlink: allow be16 and be32 types in all uint policy checks nf_conntrack: fix -Wunused-const-variable= ==================== Link: https://lore.kernel.org/r/20230727133604.8275-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 20:25:43 -07:00
Jakub Kicinski	bb85e12f8f	Merge branch 'net-tls-fixes-for-nvme-over-tls' Hannes Reinecke says: ==================== net/tls: fixes for NVMe-over-TLS here are some small fixes to get NVMe-over-TLS up and running. The first set are just minor modifications to have MSG_EOR handled for TLS, but the second set implements the ->read_sock() callback for tls_sw. The ->read_sock() callbacks return -EIO when encountering any TLS Alert message, but as that's the default behaviour anyway I guess we can get away with it. ==================== Applied on top of the tag in case Sagi gets convinced to pull it. Link: https://lore.kernel.org/r/20230726191556.41714-1-hare@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 20:11:48 -07:00
Eric Dumazet	4d50e50045	net: flower: fix stack-out-of-bounds in fl_set_key_cfm() Typical misuse of nla_parse_nested(array, XXX_MAX, ...); array must be declared as struct nlattr *array[XXX_MAX + 1]; v2: Based on feedbacks from Ido Schimmel and Zahari Doychev, I also changed TCA_FLOWER_KEY_CFM_OPT_MAX and cfm_opt_policy definitions. syzbot reported: BUG: KASAN: stack-out-of-bounds in __nla_validate_parse+0x136/0x2bd0 lib/nlattr.c:588 Write of size 32 at addr ffffc90003a0ee20 by task syz-executor296/5014 CPU: 0 PID: 5014 Comm: syz-executor296 Not tainted 6.5.0-rc2-syzkaller-00307-gd192f5382581 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x163/0x540 mm/kasan/report.c:475 kasan_report+0x175/0x1b0 mm/kasan/report.c:588 kasan_check_range+0x27e/0x290 mm/kasan/generic.c:187 __asan_memset+0x23/0x40 mm/kasan/shadow.c:84 __nla_validate_parse+0x136/0x2bd0 lib/nlattr.c:588 __nla_parse+0x40/0x50 lib/nlattr.c:700 nla_parse_nested include/net/netlink.h:1262 [inline] fl_set_key_cfm+0x1e3/0x440 net/sched/cls_flower.c:1718 fl_set_key+0x2168/0x6620 net/sched/cls_flower.c:1884 fl_tmplt_create+0x1fe/0x510 net/sched/cls_flower.c:2666 tc_chain_tmplt_add net/sched/cls_api.c:2959 [inline] tc_ctl_chain+0x131d/0x1ac0 net/sched/cls_api.c:3068 rtnetlink_rcv_msg+0x82b/0xf50 net/core/rtnetlink.c:6424 netlink_rcv_skb+0x1df/0x430 net/netlink/af_netlink.c:2549 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x7c3/0x990 net/netlink/af_netlink.c:1365 netlink_sendmsg+0xa2a/0xd60 net/netlink/af_netlink.c:1914 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x592/0x890 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2577 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f54c6150759 Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffe06c30578 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f54c619902d RCX: 00007f54c6150759 RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003 RBP: 00007ffe06c30590 R08: 0000000000000000 R09: 00007ffe06c305f0 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f54c61c35f0 R13: 00007ffe06c30778 R14: 0000000000000001 R15: 0000000000000001 </TASK> The buggy address belongs to stack of task syz-executor296/5014 and is located at offset 32 in frame: fl_set_key_cfm+0x0/0x440 net/sched/cls_flower.c:374 This frame has 1 object: [32, 56) 'nla_cfm_opt' The buggy address belongs to the virtual mapping at [ffffc90003a08000, ffffc90003a11000) created by: copy_process+0x5c8/0x4290 kernel/fork.c:2330 Fixes: `7cfffd5fed` ("net: flower: add support for matching cfm fields") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Simon Horman <simon.horman@corigine.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Zahari Doychev <zdoychev@maxlinear.com> Link: https://lore.kernel.org/r/20230726145815.943910-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 20:01:29 -07:00
Hannes Reinecke	662fbcec32	net/tls: implement ->read_sock() Implement ->read_sock() function for use with nvme-tcp. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Cc: Boris Pismenny <boris.pismenny@gmail.com> Link: https://lore.kernel.org/r/20230726191556.41714-7-hare@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 19:49:35 -07:00
Hannes Reinecke	f9ae3204fb	net/tls: split tls_rx_reader_lock Split tls_rx_reader_{lock,unlock} into an 'acquire/release' and the actual locking part. With that we can use the tls_rx_reader_lock in situations where the socket is already locked. Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230726191556.41714-6-hare@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 19:49:35 -07:00
Hannes Reinecke	11863c6d44	net/tls: Use tcp_read_sock() instead of ops->read_sock() TLS resets the protocol operations, so the read_sock() callback might be changed, too. In this case using sock->ops->readsock() in tls_strp_read_copyin() will enter an infinite recursion if the read_sock() callback is calling tls_rx_rec_wait() which will call into sock->ops->readsock() via tls_strp_read_copyin(). But as tls_strp_read_copyin() is supposed to produce data from the consumed socket and that socket is always a TCP socket we can call tcp_read_sock() directly without having to deal with callbacks. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230726191556.41714-5-hare@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 19:49:35 -07:00
Hannes Reinecke	c004b0e00c	net/tls: handle MSG_EOR for tls_device TX flow tls_push_data() MSG_MORE, but bails out on MSG_EOR. Seeing that MSG_EOR is basically the opposite of MSG_MORE this patch adds handling MSG_EOR by treating it as the absence of MSG_MORE. Consequently we should return an error when both are set. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230726191556.41714-3-hare@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 19:49:34 -07:00
Hannes Reinecke	e22e358bbe	net/tls: handle MSG_EOR for tls_sw TX flow tls_sw_sendmsg() already handles MSG_MORE, but bails out on MSG_EOR. Seeing that MSG_EOR is basically the opposite of MSG_MORE this patch adds handling MSG_EOR by treating it as the negation of MSG_MORE. And erroring out if MSG_EOR is specified with MSG_MORE. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230726191556.41714-2-hare@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 19:49:34 -07:00
Russell King (Oracle)	9945c1fb03	net: dsa: fix older DSA drivers using phylink Older DSA drivers that do not provide an dsa_ops adjust_link method end up using phylink. Unfortunately, a recent phylink change that requires its supported_interfaces bitmap to be filled breaks these drivers because the bitmap remains empty. Rather than fixing each driver individually, fix it in the core code so we have a sensible set of defaults. Reported-by: Sergei Antonov <saproj@gmail.com> Fixes: `de5c9bf40c` ("net: phylink: require supported_interfaces to be filled") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> # dsa_loop Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/E1qOflM-001AEz-D3@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 17:19:46 -07:00
YueHaibing	d4a80cc69a	dccp: Remove unused declaration dccp_feat_initialise_sysctls() This is never used, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230726143239.9904-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 17:16:26 -07:00
Lin Ma	d73ef2d69c	rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length There are totally 9 ndo_bridge_setlink handlers in the current kernel, which are 1) bnxt_bridge_setlink, 2) be_ndo_bridge_setlink 3) i40e_ndo_bridge_setlink 4) ice_bridge_setlink 5) ixgbe_ndo_bridge_setlink 6) mlx5e_bridge_setlink 7) nfp_net_bridge_setlink 8) qeth_l2_bridge_setlink 9) br_setlink. By investigating the code, we find that 1-7 parse and use nlattr IFLA_BRIDGE_MODE but 3 and 4 forget to do the nla_len check. This can lead to an out-of-attribute read and allow a malformed nlattr (e.g., length 0) to be viewed as a 2 byte integer. To avoid such issues, also for other ndo_bridge_setlink handlers in the future. This patch adds the nla_len check in rtnl_bridge_setlink and does an early error return if length mismatches. To make it works, the break is removed from the parsing for IFLA_BRIDGE_FLAGS to make sure this nla_for_each_nested iterates every attribute. Fixes: `b1edc14a3f` ("ice: Implement ice_bridge_getlink and ice_bridge_setlink") Fixes: `51616018dd` ("i40e: Add support for getlink, setlink ndo ops") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Lin Ma <linma@zju.edu.cn> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20230726075314.1059224-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 17:14:01 -07:00
YueHaibing	4d66f235c7	bridge: Remove unused declaration br_multicast_set_hash_max() Since commit `19e3a9c90c` ("net: bridge: convert multicast to generic rhashtable") this is not used, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230726143141.11704-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 17:11:29 -07:00
Patrick Rohr	ef27ba5c84	net: remove comment in ndisc_router_discovery Removes superfluous (and misplaced) comment from ndisc_router_discovery. Signed-off-by: Patrick Rohr <prohr@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230726184742.342825-1-prohr@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 16:56:23 -07:00
Jakub Kicinski	014acf2668	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 15:22:46 -07:00
Lin Ma	bcc29b7f5a	bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing The nla_for_each_nested parsing in function bpf_sk_storage_diag_alloc does not check the length of the nested attribute. This can lead to an out-of-attribute read and allow a malformed nlattr (e.g., length 0) to be viewed as a 4 byte integer. This patch adds an additional check when the nlattr is getting counted. This makes sure the latter nla_get_u32 can access the attributes with the correct length. Fixes: `1ed4d92458` ("bpf: INET_DIAG support in bpf_sk_storage") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230725023330.422856-1-linma@zju.edu.cn Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-07-27 10:07:56 -07:00
Arseniy Krasnov	a75f501de8	virtio/vsock: support MSG_PEEK for SOCK_SEQPACKET This adds support of MSG_PEEK flag for SOCK_SEQPACKET type of socket. Difference with SOCK_STREAM is that this callback returns either length of the message or error. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-07-27 15:51:48 +02:00

... 4 5 6 7 8 ...

74468 Commits