OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Wen Gong	e6ed929b41	wireless: add check of field VHT Extended NSS BW Capable for 160/80+80 MHz setting Table 9-251—Supported VHT-MCS and NSS Set subfields, it has subfield VHT Extended NSS BW Capable, its definition is: Indicates whether the STA is capable of interpreting the Extended NSS BW Support subfield of the VHT Capabilities Information field. This patch is to add check for the subfield. Signed-off-by: Wen Gong <wgong@codeaurora.org> Link: https://lore.kernel.org/r/20210524033624.16993-1-wgong@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:13 +02:00
Ryder Lee	4f2e3eb6c9	mac80211: check per vif offload_flags in Tx path offload_flags has been introduced to indicate encap status of each interface. An interface can encap offload at runtime, or if it has some extra limitations it can simply override the flags, so it's more flexible to check offload_flags in Tx path. Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Link: https://lore.kernel.org/r/177785418cf407808bf3a44760302d0647076990.1623961575.git.ryder.lee@mediatek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:13 +02:00
Ryder Lee	3187ba0cea	mac80211: add rate control support for encap offload The software rate control cannot deal with encap offload, so fix it. Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Link: https://lore.kernel.org/r/20210617163113.75815-3-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:13 +02:00
Ryder Lee	03c3911d2d	mac80211: call ieee80211_tx_h_rate_ctrl() when dequeue Make ieee80211_tx_h_rate_ctrl() get called on dequeue to improve performance since it reduces the turnaround time for rate control. Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Link: https://lore.kernel.org/r/20210617163113.75815-2-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:13 +02:00
Felix Fietkau	08a46c6420	mac80211: move A-MPDU session check from minstrel_ht to mac80211 This avoids calling back into tx handlers from within the rate control module. Preparation for deferring rate control until tx dequeue Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210617163113.75815-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:13 +02:00
Emmanuel Grumbach	358ae88881	cfg80211: expose the rfkill device to the low level driver This will allow the low level driver to query the rfkill state. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://lore.kernel.org/r/20210616202826.9833-1-emmanuel.grumbach@intel.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:13 +02:00
Johannes Berg	3f9d9725cb	mac80211: don't open-code LED manipulations We shouldn't open-code led_trigger_blink() or led_trigger_event(), use them instead of badly open-coding them. This also fixes the locking, led_trigger_blink() and led_trigger_event() now use read_lock_irqsave(). Link: https://lore.kernel.org/r/20210616212804.b19ba1c60353.I8ea1b4defd5e12fc20ef281291e602feeec336a6@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:12 +02:00
Johannes Berg	d656a4c6ea	mac80211: consider per-CPU statistics if present If we have been keeping per-CPU statistics, consider them regardless of USES_RSS, because we may not actually fill those, for example in non-fast-RX cases when the connection is not compatible with fast-RX. If we didn't fill them, the additional data will be zero and not affect anything, and if we did fill them then it's more correct to consider them. This fixes an issue in mesh mode where some statistics are not updated due to USES_RSS being set, but fast-RX isn't used. Reported-by: Thiraviyam Mariyappan <tmariyap@codeaurora.org> Link: https://lore.kernel.org/r/20210610220814.13b35f5797c5.I511e9b33c5694e0d6cef4b6ae755c873d7c22124@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:29:10 +02:00
Ping-Ke Shih	9df66d5b9f	cfg80211: fix default HE tx bitrate mask in 2G band In 2G band, a HE sta can only supports HT and HE, but not supports VHT. In this case, default HE tx bitrate mask isn't filled, when we use iw to set bitrates without any parameter. Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20210609075944.51130-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:02:59 +02:00
Shaokun Zhang	057e377af2	mac80211: remove the repeated declaration Function 'ieee80211_sta_set_rx_nss' is declared twice, so remove the repeated declaration. Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Link: https://lore.kernel.org/r/1622196424-62403-1-git-send-email-zhangshaokun@hisilicon.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:02:38 +02:00
Philipp Borgers	4e6c78bdcf	mac80211: refactor rc_no_data_or_no_ack_use_min function Use newly introduced helper function ieee80211_is_tx_data to check if frame is a data frame. Takes into account that hardware encapsulation can be enabled for a frame and therefore no ieee80211 header is present. Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de> Link: https://lore.kernel.org/r/20210519122019.92359-4-borgers@mi.fu-berlin.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:01:49 +02:00
Philipp Borgers	d333322361	mac80211: do not use low data rates for data frames with no ack flag Data Frames with no ack flag set should be handled by the rate controler. Make sure we reach the rate controler by returning early from rate_control_send_low if the frame is a data frame with no ack flag. Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de> Link: https://lore.kernel.org/r/20210519122019.92359-3-borgers@mi.fu-berlin.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:01:41 +02:00
Johannes Berg	4ebdce1dcb	mac80211: simplify ieee80211_add_station() There's no need to do some kind of weird err and RCU dance just use sta_info_insert() directly. Link: https://lore.kernel.org/r/20210517230754.55abd10056c0.I6f5a3b7b23347b2cdaf64e6d5ce1d9e904059654@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:00:26 +02:00
Johannes Berg	f057d14036	mac80211: use sdata->skb_queue for TDLS We need to differentiate these frames since the ones we currently put on the skb_queue_tdls_chsw have already been converted to ethernet format, but now that we've got a single place to enqueue to the sdata->skb_queue this isn't hard. Just differentiate based on protocol and adjust the code to queue the SKBs appropriately. Link: https://lore.kernel.org/r/20210517230754.17034990abef.I5342f2183c0d246b18d36c511eb3b6be298a6572@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:00:17 +02:00
Johannes Berg	07bd1c79c9	mac80211: refactor SKB queue processing a bit This is a very long loop body, move it into its own function instead, keeping only the kcov and free outside in the loop body. Link: https://lore.kernel.org/r/20210517230754.6bc6cdd68570.I28a86ebdb19601ca1965c4dc654cc49fc1064efa@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 11:00:08 +02:00
Johannes Berg	0044cc177f	mac80211: unify queueing SKB to iface We have a bunch of places that open-code the same to queue an SKB to an interface, unify that. Link: https://lore.kernel.org/r/20210517230754.113b65febd5a.Ie0e1d58a2885e75f242cb6e06f3b9660117fef93@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 10:59:59 +02:00
Philipp Borgers	0edab4ff84	mac80211: minstrel_ht: ignore frame that was sent with noAck flag QoS Data Frames that were sent with a No Ack policy should be ignored by the minstrel statistics. There will never be an Ack for these frames so there is no way to draw conclusions about the success of the transmission. Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de> Link: https://lore.kernel.org/r/20210517120145.132814-1-borgers@mi.fu-berlin.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 10:59:44 +02:00
Dan Carpenter	5b5c9f3bd5	cfg80211: clean up variable use in cfg80211_parse_colocated_ap() The "ap_info->tbtt_info_len" and "length" variables are the same value but it is confusing how the names are mixed up. Let's use "length" everywhere for consistency. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/YJaMNzZENkYFAYQX@mwanda Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 10:59:08 +02:00
Johannes Berg	21b7805434	cfg80211: remove CFG80211_MAX_NUM_DIFFERENT_CHANNELS We no longer need to put any limits here, hardware will and mac80211-hwsim can do whatever it likes. The reason we had this was some accounting code (still mentioned in the comment) but that code was deleted in commit `c781944b71` ("cfg80211: Remove unused cfg80211_can_use_iftype_chan()"). Link: https://lore.kernel.org/r/20210506221159.d1d61db1d31c.Iac4da68d54b9f1fdc18a03586bbe06aeb9515425@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 10:58:56 +02:00
Yang Li	5eae270500	mac80211: Remove redundant assignment to ret Variable 'ret' is set to -ENODEV but this value is never read as it is overwritten with a new value later on, hence it is a redundant assignment and can be removed. Clean up the following clang-analyzer warning: net/mac80211/debugfs_netdev.c:60:2: warning: Value stored to 'ret' is never read [clang-analyzer-deadcode.DeadStores] Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/1619774483-116805-1-git-send-email-yang.lee@linux.alibaba.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 10:57:55 +02:00
Yang Li	c2a8637c05	net: wireless: wext_compat.c: Remove redundant assignment to ps Variable 'ps' is set to wdev->ps but this value is never read as it is overwritten with a new value later on, hence it is a redundant assignment and can be removed. Cleans up the following clang-analyzer warning: net/wireless/wext-compat.c:1170:7: warning: Value stored to 'ps' during its initialization is never read [clang-analyzer-deadcode.DeadStores] Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/1619603945-116891-1-git-send-email-yang.lee@linux.alibaba.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 10:57:29 +02:00
Gustavo A. R. Silva	e93bdd7840	wireless: wext-spy: Fix out-of-bounds warning Fix the following out-of-bounds warning: net/wireless/wext-spy.c:178:2: warning: 'memcpy' offset [25, 28] from the object at 'threshold' is out of the bounds of referenced subobject 'low' with type 'struct iw_quality' at offset 20 [-Warray-bounds] The problem is that the original code is trying to copy data into a couple of struct members adjacent to each other in a single call to memcpy(). This causes a legitimate compiler warning because memcpy() overruns the length of &threshold.low and &spydata->spy_thr_low. As these are just a couple of struct members, fix this by using direct assignments, instead of memcpy(). This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). Link: https://github.com/KSPP/linux/issues/109 Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210422200032.GA168995@embeddedor Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-23 10:57:17 +02:00
David S. Miller	f4b29d2ee9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Nicolas Dichtel updates MAINTAINERS file to add Netfilter IRC channel. 2) Skip non-IPv6 packets in nft_exthdr. 3) Skip non-TCP packets in nft_osf. 4) Skip non-TCP/UDP packets in nft_tproxy. 5) Memleak in hardware offload infrastructure when counters are used for first time in a rule. 6) The VLAN transfer routine must use FLOW_DISSECTOR_KEY_BASIC instead of FLOW_DISSECTOR_KEY_CONTROL. Moreover, make a more robust check for 802.1q and 802.1ad to restore simple matching on transport protocols. 7) Fix bogus EPERM when listing a ruleset when table ownership flag is set on. 8) Honor table ownership flag when table is referenced by handle. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 16:27:54 -07:00
Huy Nguyen	fa4535238f	net/xfrm: Add inner_ipproto into sec_path The inner_ipproto saves the inner IP protocol of the plain text packet. This allows vendor's IPsec feature making offload decision at skb's features_check and configuring hardware at ndo_start_xmit. For example, ConnectX6-DX IPsec device needs the plaintext's IP protocol to support partial checksum offload on VXLAN/GENEVE packet over IPsec transport mode tunnel. Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Huy Nguyen <huyn@nvidia.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-22 15:24:32 -07:00
Paolo Abeni	fde56eea01	mptcp: refine mptcp_cleanup_rbuf The current cleanup rbuf tries a bit too hard to avoid acquiring the subflow socket lock. We may end-up delaying the needed ack, or skip acking a blocked subflow. Address the above extending the conditions used to trigger the cleanup to reflect more closely what TCP does and invoking tcp_cleanup_rbuf() on all the active subflows. Note that we can't replicate the exact tests implemented in tcp_cleanup_rbuf(), as MPTCP lacks some of the required info - e.g. ping-pong mode. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 14:36:01 -07:00
Geliang Tang	df377be387	mptcp: add deny_join_id0 in mptcp_options_received This patch added a new flag named deny_join_id0 in struct mptcp_options_received. Set it when MP_CAPABLE with the flag MPTCP_CAP_DENYJOIN_ID0 is received. Also add a new flag remote_deny_join_id0 in struct mptcp_pm_data. When the flag deny_join_id0 is set, set this remote_deny_join_id0 flag. In mptcp_pm_create_subflow_or_signal_addr, if the remote_deny_join_id0 flag is set, and the remote address id is zero, stop this connection. Suggested-by: Florian Westphal <fw@strlen.de> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 14:36:01 -07:00
Geliang Tang	bab6b88e05	mptcp: add allow_join_id0 in mptcp_out_options This patch defined a new flag MPTCP_CAP_DENY_JOIN_ID0 for the third bit, labeled "C" of the MP_CAPABLE option. Add a new flag allow_join_id0 in struct mptcp_out_options. If this flag is set, send out the MP_CAPABLE option with the flag MPTCP_CAP_DENY_JOIN_ID0. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 14:36:01 -07:00
Geliang Tang	d2f77960e5	mptcp: add sysctl allow_join_initial_addr_port This patch added a new sysctl, named allow_join_initial_addr_port, to control whether allow peers to send join requests to the IP address and port number used by the initial subflow. Suggested-by: Florian Westphal <fw@strlen.de> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 14:36:01 -07:00
Xin Long	9e47df005c	sctp: process sctp over udp icmp err on sctp side Previously, sctp over udp was using udp tunnel's icmp err process, which only does sk lookup on sctp side. However for sctp's icmp error process, there are more things to do, like syncing assoc pmtu/retransmit packets for toobig type err, and starting proto_unreach_timer for unreach type err etc. Now after adding PLPMTUD, which also requires to process toobig type err on sctp side. This patch is to process icmp err on sctp side by parsing the type/code/info in .encap_err_lookup and call sctp's icmp processing functions. Note as the 'redirect' err process needs to know the outer ip(v6) header's, we have to leave it to udp(v6)_err to handle it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	d83060759a	sctp: extract sctp_v4_err_handle function from sctp_v4_err This patch is to extract sctp_v4_err_handle() from sctp_v4_err() to only handle the icmp err after the sock lookup, and it also makes the code clearer. sctp_v4_err_handle() will be used in sctp over udp's err handling in the following patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	f6549bd37b	sctp: extract sctp_v6_err_handle function from sctp_v6_err This patch is to extract sctp_v6_err_handle() from sctp_v6_err() to only handle the icmp err after the sock lookup, and it also makes the code clearer. sctp_v6_err_handle() will be used in sctp over udp's err handling in the following patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	237a6a2e31	sctp: remove the unessessary hold for idev in sctp_v6_err Same as in tcp_v6_err() and __udp6_lib_err(), there's no need to hold idev in sctp_v6_err(), so just call __in6_dev_get() instead. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	7307e4fa4d	sctp: enable PLPMTUD when the transport is ready sctp_transport_pl_reset() is called whenever any of these 3 members in transport is changed: - probe_interval - param_flags & SPP_PMTUD_ENABLE - state == ACTIVE If all are true, start the PLPMTUD when it's not yet started. If any of these is false, stop the PLPMTUD when it's already running. sctp_transport_pl_update() is called when the transport dst has changed. It will restart the PLPMTUD probe. Again, the pathmtu won't change but use the dst's mtu until the Search phase is done. Note that after using PLPMTUD, the pathmtu is only initialized with the dst mtu when the transport dst changes. At other time it is updated by pl.pmtu. So sctp_transport_pmtu_check() will be called only when PLPMTUD is disabled in sctp_packet_config(). After this patch, the PLPMTUD feature from RFC8899 will be activated and can be used by users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	8369640831	sctp: do state transition when receiving an icmp TOOBIG packet PLPMTUD will short-circuit the old process for icmp TOOBIG packets. This part is described in rfc8899#section-4.6.2 (PL_PTB_SIZE = PTB_SIZE - other_headers_len). Note that from rfc8899#section-5.2 State Machine, each case below is for some specific states only: a) PL_PTB_SIZE < MIN_PLPMTU \|\| PL_PTB_SIZE >= PROBED_SIZE, discard it, for any state b) MIN_PLPMTU < PL_PTB_SIZE < BASE_PLPMTU, Base -> Error, for Base state c) BASE_PLPMTU <= PL_PTB_SIZE < PLPMTU, Search -> Base or Complete -> Base, for Search and Complete states. d) PLPMTU < PL_PTB_SIZE < PROBED_SIZE, set pl.probe_size to PL_PTB_SIZE then verify it, for Search state. The most important one is case d), which will help find the optimal fast during searching. Like when pathmtu = 1392 for SCTP over IPv4, the search will be (20 is iphdr_len): 1. probe with 1200 - 20 2. probe with 1232 - 20 3. probe with 1264 - 20 ... 7. probe with 1388 - 20 8. probe with 1420 - 20 When sending the probe with 1420 - 20, TOOBIG may come with PL_PTB_SIZE = 1392 - 20. Then it matches case d), and saves some rounds to try with the 1392 - 20 probe. But of course, PLPMTUD doesn't trust TOOBIG packets, and it will go back to the common searching once the probe with the new size can't be verified. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	b87641aff9	sctp: do state transition when a probe succeeds on HB ACK recv path As described in rfc8899#section-5.2, when a probe succeeds, there might be the following state transitions: - Base -> Search, occurs when probe succeeds with BASE_PLPMTU, pl.pmtu is not changing, pl.probe_size increases by SCTP_PL_BIG_STEP, - Error -> Search, occurs when probe succeeds with BASE_PLPMTU, pl.pmtu is changed from SCTP_MIN_PLPMTU to SCTP_BASE_PLPMTU, pl.probe_size increases by SCTP_PL_BIG_STEP. - Search -> Search Complete, occurs when probe succeeds with the probe size SCTP_MAX_PLPMTU less than pl.probe_high, pl.pmtu is not changing, but update pathmtu with it, pl.probe_size is set back to pl.pmtu to double check it. - Search Complete -> Search, occurs when probe succeeds with the probe size equal to pl.pmtu, pl.pmtu is not changing, pl.probe_size increases by SCTP_PL_MIN_STEP. So search process can be described as: 1. When it just enters 'Search' state, pathmtu is not updated with pl.pmtu, and probe_size increases by a big step (SCTP_PL_BIG_STEP) each round. 2. Until pl.probe_high is set when a probe fails, and probe_size decreases back to pl.pmtu, as described in the last patch. 3. When the probe with the new size succeeds, probe_size changes to increase by a small step (SCTP_PL_MIN_STEP) due to pl.probe_high is set. 4. Until probe_size is next to pl.probe_high, the searching finishes and it goes to 'Complete' state and updates pathmtu with pl.pmtu, and then probe_size is set to pl.pmtu to confirm by once more probe. 5. This probe occurs after "30 * probe_inteval", a much longer time than that in Search state. Once it is done it goes to 'Search' state again with probe_size increased by SCTP_PL_MIN_STEP. As we can see above, during the searching, pl.pmtu changes while pathmtu doesn't. pathmtu is only updated when the search finishes by which it gets an optimal value for it. A big step is used at the beginning until it gets close to the optimal value, then it changes to a small step until it has this optimal value. The small step is also used in 'Complete' until it goes to 'Search' state again and the probe with 'pmtu + the small step' succeeds, which means a higher size could be used. Then probe_size changes to increase by a big step again until it gets close to the next optimal value. Note that anytime when black hole is detected, it goes directly to 'Base' state with pl.pmtu set to SCTP_BASE_PLPMTU, as described in the last patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	1dc68c1945	sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path The state transition is described in rfc8899#section-5.2, PROBE_COUNT == MAX_PROBES means the probe fails for MAX times, and the state transition includes: - Base -> Error, occurs when BASE_PLPMTU Confirmation Fails, pl.pmtu is set to SCTP_MIN_PLPMTU, probe_size is still SCTP_BASE_PLPMTU; - Search -> Base, occurs when Black Hole Detected, pl.pmtu is set to SCTP_BASE_PLPMTU, probe_size is set back to SCTP_BASE_PLPMTU; - Search Complete -> Base, occurs when Black Hole Detected pl.pmtu is set to SCTP_BASE_PLPMTU, probe_size is set back to SCTP_BASE_PLPMTU; Note a black hole is encountered when a sender is unaware that packets are not being delivered to the destination endpoint. So it includes the probe failures with equal probe_size to pl.pmtu, and definitely not include that with greater probe_size than pl.pmtu. The later one is the normal probe failure where probe_size should decrease back to pl.pmtu and pl.probe_high is set. pl.probe_high would be used on HB ACK recv path in the next patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	fe59379b9a	sctp: do the basic send and recv for PLPMTUD probe This patch does exactly what rfc8899#section-6.2.1.2 says: The SCTP sender needs to be able to determine the total size of a probe packet. The HEARTBEAT chunk could carry a Heartbeat Information parameter that includes, besides the information suggested in [RFC4960], the probe size to help an implementation associate a HEARTBEAT ACK with the size of probe that was sent. The sender could also use other methods, such as sending a nonce and verifying the information returned also contains the corresponding nonce. The length of the PAD chunk is computed by reducing the probing size by the size of the SCTP common header and the HEARTBEAT chunk. Note that HB ACK chunk will carry back whatever HB chunk carried, including the probe_size we put it in; We also check hbinfo->probe_size in the HB ACK against link->pl.probe_size to validate this HB ACK chunk. v1->v2: - Remove the unused 'sp' and add static for sctp_packet_bundle_pad(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	92548ec2f1	sctp: add the probe timer in transport for PLPMTUD There are 3 timers described in rfc8899#section-5.1.1: PROBE_TIMER, PMTU_RAISE_TIMER, CONFIRMATION_TIMER This patches adds a 'probe_timer' in transport, and it works as either PROBE_TIMER or PMTU_RAISE_TIMER. At most time, it works as PROBE_TIMER and expires every a 'probe_interval' time to send the HB probe packet. When transport pl enters COMPLETE state, it works as PMTU_RAISE_TIMER and expires in 'probe_interval * 30' time to go back to SEARCH state and do searching again. SCTP HB is an acknowledged packet, CONFIRMATION_TIMER is not needed. The timer will start when transport pl enters BASE state and stop when it enters DISABLED state. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:52 -07:00
Xin Long	3190b649b4	sctp: add SCTP_PLPMTUD_PROBE_INTERVAL sockopt for sock/asoc/transport With this socket option, users can change probe_interval for a transport, asoc or sock after it's created. Note that if the change is for an asoc, also apply the change to each transport in this asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:51 -07:00
Xin Long	d1e462a7a5	sctp: add probe_interval in sysctl and sock/asoc/transport PLPMTUD can be enabled by doing 'sysctl -w net.sctp.probe_interval=n'. 'n' is the interval for PLPMTUD probe timer in milliseconds, and it can't be less than 5000 if it's not 0. All asoc/transport's PLPMTUD in a new socket will be enabled by default. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:51 -07:00
Xin Long	745a32117b	sctp: add pad chunk and its make function and event table This chunk is defined in rfc4820#section-3, and used to pad an SCTP packet. The receiver must discard this chunk and continue processing the rest of the chunks in the packet. Add it now, as it will be bundled with a heartbeat chunk to probe pmtu in the following patches. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 11:28:51 -07:00
Aaron Conole	c4ab7b56be	openvswitch: add trace points This makes openvswitch module use the event tracing framework to log the upcall interface and action execution pipeline. When using openvswitch as the packet forwarding engine, some types of debugging are made possible simply by using the ovs-vswitchd's ofproto/trace command. However, such a command has some limitations: 1. When trying to trace packets that go through the CT action, the state of the packet can't be determined, and probably would be potentially wrong. 2. Deducing problem packets can sometimes be difficult as well even if many of the flows are known 3. It's possible to use the openvswitch module even without the ovs-vswitchd (although, not common use). Introduce the event tracing points here to make it possible for working through these problems in kernel space. The style is copied from the mac80211 driver-trace / trace code for consistency - this creates some checkpatch splats, but the official 'guide' for adding tracepoints, as well as the existing examples all add the same splats so it seems acceptable. Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:47:32 -07:00
Ido Schimmel	88f9a87afe	ethtool: Validate module EEPROM offset as part of policy Validate the offset to read from module EEPROM as part of the netlink policy and remove the corresponding check from the code. This also makes it possible to query the offset range from user space: $ genl ctrl policy name ethtool ... ID: 0x14 policy[32]:attr[2]: type=U32 range:[0,255] ... Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:40:54 -07:00
Ido Schimmel	0dc7dd02ba	ethtool: Validate module EEPROM length as part of policy Validate the number of bytes to read from the module EEPROM as part of the netlink policy and remove the corresponding check from the code. This also makes it possible to query the length range from user space: $ genl ctrl policy name ethtool ... ID: 0x14 policy[32]:attr[3]: type=U32 range:[1,128] ... Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:40:54 -07:00
Ido Schimmel	f5fe211d13	ethtool: Decrease size of module EEPROM get policy array The 'ETHTOOL_A_MODULE_EEPROM_DATA' attribute is not part of the get request. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:40:54 -07:00
gushengxian	98534fce52	bridge: cfm: remove redundant return Return statements are not needed in Void function. Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:35:15 -07:00
Miao Wang	c69f114d09	net/ipv4: swap flow ports when validating source When doing source address validation, the flowi4 struct used for fib_lookup should be in the reverse direction to the given skb. fl4_dport and fl4_sport returned by fib4_rules_early_flow_dissect should thus be swapped. Fixes: `5a847a6e14` ("net/ipv4: Initialize proto and ports in flow struct") Signed-off-by: Miao Wang <shankerwangmiao@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:33:04 -07:00
Jakub Kicinski	a6e3f2985a	ip6_tunnel: fix GRE6 segmentation Commit `6c11fbf97e` ("ip6_tunnel: add MPLS transmit support") moved assiging inner_ipproto down from ipxip6_tnl_xmit() to its callee ip6_tnl_xmit(). The latter is also used by GRE. Since commit `3872035241` ("gre: Use inner_proto to obtain inner header protocol") GRE had been depending on skb->inner_protocol during segmentation. It sets it in gre_build_header() and reads it in gre_gso_segment(). Changes to ip6_tnl_xmit() overwrite the protocol, resulting in GSO skbs getting dropped. Note that inner_protocol is a union with inner_ipproto, GRE uses the former while the change switched it to the latter (always setting it to just IPPROTO_GRE). Restore the original location of skb_set_inner_ipproto(), it is unclear why it was moved in the first place. Fixes: `6c11fbf97e` ("ip6_tunnel: add MPLS transmit support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:30:05 -07:00
Paolo Abeni	597dbae77e	mptcp: drop duplicate mptcp_setsockopt() declaration commit `7896248983` ("mptcp: add skeleton to sync msk socket options to subflows") introduced a duplicate declaration of mptcp_setsockopt(), just drop it. Reported-by: Florian Westphal <fw@strlen.de> Fixes: `7896248983` ("mptcp: add skeleton to sync msk socket options to subflows") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:22:42 -07:00
Paolo Abeni	490274b474	mptcp: avoid race on msk state changes The msk socket state is currently updated in a few spots without owning the msk socket lock itself. Some of such operations are safe, as they happens before exposing the msk socket to user-space and can't race with other changes. A couple of them, at connect time, can actually race with close() or shutdown(), leaving breaking the socket state machine. This change addresses the issue moving such update under the msk socket lock with the usual: <acquire spinlock> <check sk lock onwers> <ev defer to release_cb> scheme. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/56 Fixes: `8fd738049a` ("mptcp: fallback in case of simultaneous connect") Fixes: `c3c123d16c` ("net: mptcp: don't hang in mptcp_sendmsg() after TCP fallback") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:22:42 -07:00
Bui Quang Minh	7dd5d437c2	bpf: Fix integer overflow in argument calculation for bpf_map_area_alloc In 32-bit architecture, the result of sizeof() is a 32-bit integer so the expression becomes the multiplication between 2 32-bit integer which can potentially leads to integer overflow. As a result, bpf_map_area_alloc() allocates less memory than needed. Fix this by casting 1 operand to u64. Fixes: `0d2c4f9640` ("bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps") Fixes: `99c51064fb` ("devmap: Use bpf_map_area_alloc() for allocating hash buckets") Fixes: `546ac1ffb7` ("bpf: add devmap, a map for storing net device references") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210613143440.71975-1-minhquangbui99@gmail.com	2021-06-22 10:14:29 -07:00
Paolo Abeni	06285da96a	mptcp: add MIB counter for invalid mapping Account this exceptional events for better introspection. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 09:57:45 -07:00
Paolo Abeni	8cfc47fc2e	mptcp: drop redundant test in move_skbs_to_msk() Currently we check the msk state to avoid enqueuing new skbs at msk shutdown time. Such test is racy - as we can't acquire the msk socket lock - and useless, as the caller already checked the subflow field 'disposable', covering the same scenario in a race free manner - read and updated under the ssk socket lock. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 09:57:45 -07:00
Paolo Abeni	3c90e377a1	mptcp: don't clear MPTCP_DATA_READY in sk_wait_event() If we don't flush entirely the receive queue, we need set again such bit later. We can simply avoid clearing it. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 09:57:45 -07:00
Paolo Abeni	75e908c336	mptcp: use fast lock for subflows when possible There are a bunch of callsite where the ssk socket lock is acquired using the full-blown version eligible for the fast variant. Let's move to the latter. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 09:57:45 -07:00
Paolo Abeni	8ce568ed06	mptcp: drop tx skb cache The mentioned cache was introduced to reduce the number of skb allocation in atomic context, but the required complexity is excessive. This change remove the mentioned cache. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 09:57:45 -07:00
Eric Dumazet	64295f0d01	virtio/vsock: avoid NULL deref in virtio_transport_seqpacket_allow() Make sure the_virtio_vsock is not NULL before dereferencing it. general protection fault, probably for non-canonical address 0xdffffc0000000071: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000388-0x000000000000038f] CPU: 0 PID: 8452 Comm: syz-executor406 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:virtio_transport_seqpacket_allow+0xbf/0x210 net/vmw_vsock/virtio_transport.c:503 Code: e8 c6 d9 ab f8 84 db 0f 84 0f 01 00 00 e8 09 d3 ab f8 48 8d bd 88 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 2a 01 00 00 44 0f b6 a5 88 03 00 00 RSP: 0018:ffffc90003757c18 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000071 RSI: ffffffff88c908e7 RDI: 0000000000000388 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff88c90a06 R11: 0000000000000000 R12: 0000000000000000 R13: ffffffff88c90840 R14: 0000000000000000 R15: 0000000000000001 FS: 0000000001bee300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000082 CR3: 000000002847e000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: vsock_assign_transport+0x575/0x700 net/vmw_vsock/af_vsock.c:490 vsock_connect+0x200/0xc00 net/vmw_vsock/af_vsock.c:1337 __sys_connect_file+0x155/0x1a0 net/socket.c:1824 __sys_connect+0x161/0x190 net/socket.c:1841 __do_sys_connect net/socket.c:1851 [inline] __se_sys_connect net/socket.c:1848 [inline] __x64_sys_connect+0x6f/0xb0 net/socket.c:1848 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x43ee69 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd49e7c788 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000400488 RCX: 000000000043ee69 RDX: 0000000000000010 RSI: 0000000020000080 RDI: 0000000000000003 RBP: 0000000000402e50 R08: 0000000000000000 R09: 0000000000400488 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402ee0 R13: 0000000000000000 R14: 00000000004ac018 R15: 0000000000400488 Fixes: `53efbba12c` ("virtio/vsock: enable SEQPACKET for transport") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 09:49:37 -07:00
Pablo Neira Ayuso	e31f072ffa	netfilter: nf_tables: do not allow to delete table with owner by handle nft_table_lookup_byhandle() also needs to validate the netlink PortID owner when deleting a table by handle. Fixes: `6001a930ce` ("netfilter: nftables: introduce table ownership") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-22 12:15:05 +02:00
Pablo Neira Ayuso	534799097a	netfilter: nf_tables: skip netlink portID validation if zero nft_table_lookup() allows us to obtain the table object by the name and the family. The netlink portID validation needs to be skipped for the dump path, since the ownership only applies to commands to update the given table. Skip validation if the specified netlink PortID is zero when calling nft_table_lookup(). Fixes: `6001a930ce` ("netfilter: nftables: introduce table ownership") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-22 12:08:04 +02:00
Ayush Sawal	dd72fadf21	xfrm: Fix xfrm offload fallback fail case In case of xfrm offload, if xdo_dev_state_add() of driver returns -EOPNOTSUPP, xfrm offload fallback is failed. In xfrm state_add() both xso->dev and xso->real_dev are initialized to dev and when err(-EOPNOTSUPP) is returned only xso->dev is set to null. So in this scenario the condition in func validate_xmit_xfrm(), if ((x->xso.dev != dev) && (x->xso.real_dev == dev)) return skb; returns true, due to which skb is returned without calling esp_xmit() below which has fallback code. Hence the CRYPTO_FALLBACK is failing. So fixing this with by keeping x->xso.real_dev as NULL when err is returned in func xfrm_dev_state_add(). Fixes: `bdfd2d1fa7` ("bonding/xfrm: use real_dev instead of slave_dev") Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-22 09:08:15 +02:00
Eric Dumazet	0cd58e5c53	pkt_sched: sch_qfq: fix qfq_change_class() error path If qfq_change_class() is unable to allocate memory for qfq_aggregate, it frees the class that has been inserted in the class hash table, but does not unhash it. Defer the insertion after the problematic allocation. BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:884 [inline] BUG: KASAN: use-after-free in qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731 Write of size 8 at addr ffff88814a534f10 by task syz-executor.4/31478 CPU: 0 PID: 31478 Comm: syz-executor.4 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:233 __kasan_report mm/kasan/report.c:419 [inline] kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:436 hlist_add_head include/linux/list.h:884 [inline] qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731 qfq_change_class+0x96c/0x1990 net/sched/sch_qfq.c:489 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x4665d9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fdc7b5f0188 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 000000000056bf80 RCX: 00000000004665d9 RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003 RBP: 00007fdc7b5f01d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 00007ffcf7310b3f R14: 00007fdc7b5f0300 R15: 0000000000022000 Allocated by task 31445: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:428 [inline] ____kasan_kmalloc mm/kasan/common.c:507 [inline] ____kasan_kmalloc mm/kasan/common.c:466 [inline] __kasan_kmalloc+0x9b/0xd0 mm/kasan/common.c:516 kmalloc include/linux/slab.h:556 [inline] kzalloc include/linux/slab.h:686 [inline] qfq_change_class+0x705/0x1990 net/sched/sch_qfq.c:464 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 31445: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track+0x1c/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:357 ____kasan_slab_free mm/kasan/common.c:360 [inline] ____kasan_slab_free mm/kasan/common.c:325 [inline] __kasan_slab_free+0xfb/0x130 mm/kasan/common.c:368 kasan_slab_free include/linux/kasan.h:212 [inline] slab_free_hook mm/slub.c:1583 [inline] slab_free_freelist_hook+0xdf/0x240 mm/slub.c:1608 slab_free mm/slub.c:3168 [inline] kfree+0xe5/0x7f0 mm/slub.c:4212 qfq_change_class+0x10fb/0x1990 net/sched/sch_qfq.c:518 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88814a534f00 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 16 bytes inside of 128-byte region [ffff88814a534f00, ffff88814a534f80) The buggy address belongs to the page: page:ffffea0005294d00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a534 flags: 0x57ff00000000200(slab\|node=1\|zone=2\|lastcpupid=0x7ff) raw: 057ff00000000200 ffffea00004fee00 0000000600000006 ffff8880110418c0 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL\|__GFP_NOWARN\|__GFP_NORETRY), pid 29797, ts 604817765317, free_ts 604810151744 prep_new_page mm/page_alloc.c:2358 [inline] get_page_from_freelist+0x1033/0x2b60 mm/page_alloc.c:3994 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5200 alloc_pages+0x18c/0x2a0 mm/mempolicy.c:2272 alloc_slab_page mm/slub.c:1646 [inline] allocate_slab+0x2c5/0x4c0 mm/slub.c:1786 new_slab mm/slub.c:1849 [inline] new_slab_objects mm/slub.c:2595 [inline] ___slab_alloc+0x4a1/0x810 mm/slub.c:2758 __slab_alloc.constprop.0+0xa7/0xf0 mm/slub.c:2798 slab_alloc_node mm/slub.c:2880 [inline] slab_alloc mm/slub.c:2922 [inline] __kmalloc+0x315/0x330 mm/slub.c:4050 kmalloc include/linux/slab.h:561 [inline] kzalloc include/linux/slab.h:686 [inline] __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1318 mpls_dev_sysctl_register+0x1b7/0x2d0 net/mpls/af_mpls.c:1421 mpls_add_dev net/mpls/af_mpls.c:1472 [inline] mpls_dev_notify+0x214/0x8b0 net/mpls/af_mpls.c:1588 notifier_call_chain+0xb5/0x200 kernel/notifier.c:83 call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:2121 call_netdevice_notifiers_extack net/core/dev.c:2133 [inline] call_netdevice_notifiers net/core/dev.c:2147 [inline] register_netdevice+0x106b/0x1500 net/core/dev.c:10312 veth_newlink+0x585/0xac0 drivers/net/veth.c:1547 __rtnl_newlink+0x1062/0x1710 net/core/rtnetlink.c:3452 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1298 [inline] free_pcp_prepare+0x223/0x300 mm/page_alloc.c:1342 free_unref_page_prepare mm/page_alloc.c:3250 [inline] free_unref_page+0x12/0x1d0 mm/page_alloc.c:3298 __vunmap+0x783/0xb60 mm/vmalloc.c:2566 free_work+0x58/0x70 mm/vmalloc.c:80 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422 kthread+0x3b1/0x4a0 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Memory state around the buggy address: ffff88814a534e00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88814a534e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88814a534f00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88814a534f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88814a535000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: `462dbc9101` ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 14:50:19 -07:00
Boris Sukholitko	6d5516177d	Revert "net/sched: cls_flower: Remove match on n_proto" This reverts commit `0dca2c7404`. The commit in question breaks hardware offload of flower filters. Quoting Vladimir Oltean <olteanv@gmail.com>: fl_hw_replace_filter() and fl_reoffload() create a struct flow_cls_offload with a rule->match.mask member derived from the mask of the software classifier: &f->mask->key - that same mask that is used for initializing the flow dissector keys, and the one from which Boris removed the basic.n_proto member because it was bothering him. Reported-by: Vadym Kochan <vadym.kochan@plvision.eu> Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 14:46:36 -07:00
Longpeng(Mike)	c7ff9cff70	vsock: notify server to shutdown when client has pending signal The client's sk_state will be set to TCP_ESTABLISHED if the server replay the client's connect request. However, if the client has pending signal, its sk_state will be set to TCP_CLOSE without notify the server, so the server will hold the corrupt connection. client server 1. sk_state=TCP_SYN_SENT \| 2. call ->connect() \| 3. wait reply \| \| 4. sk_state=TCP_ESTABLISHED \| 5. insert to connected list \| 6. reply to the client 7. sk_state=TCP_ESTABLISHED \| 8. insert to connected list \| 9. signal pending <--------------------- the user kill client 10. sk_state=TCP_CLOSE \| client is exiting... \| 11. call ->release() \| virtio_transport_close if (!(sk->sk_state == TCP_ESTABLISHED \|\| sk->sk_state == TCP_CLOSING)) return true; return at here, the server cannot notice the connection is corrupt So the client should notify the peer in this case. Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jorgen Hansen <jhansen@vmware.com> Cc: Norbert Slusarek <nslusarek@gmx.net> Cc: Andra Paraschiv <andraprs@amazon.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: David Brazdil <dbrazdil@google.com> Cc: Alexander Popov <alex.popov@linux.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lkml.org/lkml/2021/5/17/418 Signed-off-by: lixianming <lixianming5@huawei.com> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 14:43:28 -07:00
Yejune Deng	fe0bdbde07	net: add pf_family_names[] for protocol family Modify the pr_info content from int to char * in sock_register() and sock_unregister(), this looks more readable. Fixed build error in ARCH=sparc64. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 14:41:54 -07:00
Paolo Abeni	5957a8901d	mptcp: fix 32 bit DSN expansion The current implementation of 32 bit DSN expansion is buggy. After the previous patch, we can simply reuse the newly introduced helper to do the expansion safely. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/120 Fixes: `648ef4b886` ("mptcp: Implement MPTCP receive path") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 14:21:28 -07:00
Paolo Abeni	1502328f17	mptcp: fix bad handling of 32 bit ack wrap-around When receiving 32 bits DSS ack from the peer, the MPTCP need to expand them to 64 bits value. The current code is buggy WRT detecting 32 bits ack wrap-around: when the wrap-around happens the current unsigned 32 bit ack value is lower than the previous one. Additionally check for possible reverse wrap and make the helper visible, so that we could re-use it for the next patch. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/204 Fixes: `cc9d256698` ("mptcp: update per unacked sequence on pkt reception") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 14:21:27 -07:00
Pablo Neira Ayuso	ea45fdf82c	netfilter: nf_tables_offload: check FLOW_DISSECTOR_KEY_BASIC in VLAN transfer logic The VLAN transfer logic should actually check for FLOW_DISSECTOR_KEY_BASIC, not FLOW_DISSECTOR_KEY_CONTROL. Moreover, do not fallback to case 2) .n_proto is set to 802.1q or 802.1ad, if FLOW_DISSECTOR_KEY_BASIC is unset. Fixes: `783003f3bb` ("netfilter: nftables_offload: special ethertype handling for VLAN") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-21 22:26:19 +02:00
Pablo Neira Ayuso	3c5e446220	netfilter: nf_tables: memleak in hw offload abort path Release flow from the abort path, this is easy to reproduce since `b72920f6e4` ("netfilter: nftables: counter hardware offload support"). If the preparation phase fails, then the abort path is exercised without releasing the flow rule object. unreferenced object 0xffff8881f0fa7700 (size 128): comm "nft", pid 1335, jiffies 4294931120 (age 4163.740s) hex dump (first 32 bytes): 08 e4 de 13 82 88 ff ff 98 e4 de 13 82 88 ff ff ................ 48 e4 de 13 82 88 ff ff 01 00 00 00 00 00 00 00 H............... backtrace: [<00000000634547e7>] flow_rule_alloc+0x26/0x80 [<00000000c8426156>] nft_flow_rule_create+0xc9/0x3f0 [nf_tables] [<0000000075ff8e46>] nf_tables_newrule+0xc79/0x10a0 [nf_tables] [<00000000ba65e40e>] nfnetlink_rcv_batch+0xaac/0xf90 [nfnetlink] [<00000000505c614a>] nfnetlink_rcv+0x1bb/0x1f0 [nfnetlink] [<00000000eb78e1fe>] netlink_unicast+0x34b/0x480 [<00000000a8f72c94>] netlink_sendmsg+0x3af/0x690 [<000000009cb1ddf4>] sock_sendmsg+0x96/0xa0 [<0000000039d06e44>] ____sys_sendmsg+0x3fe/0x440 [<00000000137e82ca>] ___sys_sendmsg+0xd8/0x140 [<000000000c6bf6a6>] __sys_sendmsg+0xb3/0x130 [<0000000043bd6268>] do_syscall_64+0x40/0xb0 [<00000000afdebc2d>] entry_SYSCALL_64_after_hwframe+0x44/0xae Remove flow rule release from the offload commit path, otherwise error from the offload commit phase might trigger a double-free due to the execution of the abort_offload -> abort. After this patch, the abort path takes care of releasing the flow rule. This fix also needs to move the nft_flow_rule_create() call before the transaction object is added otherwise the abort path might find a NULL pointer to the flow rule object for the NFT_CHAIN_HW_OFFLOAD case. While at it, rename BASIC-like goto tags to slightly more meaningful names rather than adding a new "err3" tag. Fixes: `63b48c73ff` ("netfilter: nf_tables_offload: undo updates if transaction fails") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-21 22:23:38 +02:00
Dan Carpenter	24610ed80d	netfilter: nfnetlink_hook: fix check for snprintf() overflow The kernel version of snprintf() can't return negatives. The "ret > (int)sizeof(sym)" check is off by one because and it should be >=. Finally, we need to set a negative error code. Fixes: `e2cf17d377` ("netfilter: add new hook nfnl subsystem") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-21 22:05:29 +02:00
Vladimir Oltean	f9bcdc362c	net: dsa: remove cross-chip support from the MRP notifiers With MRP hardware assist being supported only by the ocelot switch family, which by design does not support cross-chip bridging, the current match functions are at best a guess and have not been confirmed in any way to do anything relevant in a multi-switch topology. Drop the code and make the notifiers match only on the targeted switch port. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:50:20 -07:00
Vladimir Oltean	88faba20e2	net: dsa: targeted MTU notifiers should only match on one port dsa_slave_change_mtu() calls dsa_port_mtu_change() twice: - it sends a cross-chip notifier with the MTU of the CPU port which is used to update the DSA links. - it sends one targeted MTU notifier which is supposed to only match the user port on which we are changing the MTU. The "propagate_upstream" variable is used here to bypass the cross-chip notifier system from switch.c But due to a mistake, the second, targeted notifier matches not only on the user port, but also on the DSA link which is a member of the same switch, if that exists. And because the DSA links of the entire dst were programmed in a previous round to the largest_mtu via a "propagate_upstream == true" notification, then the dsa_port_mtu_change(propagate_upstream == false) call that is immediately upcoming will break the MTU on the one DSA link which is chip-wise local to the dp whose MTU is changing right now. Example given this daisy chain topology: sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ cpu ] [ user ] [ user ] [ dsa ] [ user ] [ x ] [ ] [ ] [ x ] [ ] \| +---------+ \| sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] ip link set sw0p1 mtu 9000 ip link set sw1p1 mtu 9000 # at this stage, sw0p1 and sw1p1 can talk # to one another using jumbo frames ip link set sw0p2 mtu 1500 # this programs the sw0p3 DSA link first to # the largest_mtu of 9000, then reprograms it to # 1500 with the "propagate_upstream == false" # notifier, breaking communication between # sw0p1 and sw1p1 To escape from this situation, make the targeted match really match on a single port - the user port, and rename the "propagate_upstream" variable to "targeted_match" to clarify the intention and avoid future issues. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:50:20 -07:00
Vladimir Oltean	4e4ab79500	net: dsa: calculate the largest_mtu across all ports in the tree If we have a cross-chip topology like this: sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ cpu ] [ user ] [ user ] [ dsa ] [ user ] \| +---------+ \| sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] and we issue the following commands: 1. ip link set sw0p1 mtu 1700 2. ip link set sw1p1 mtu 1600 we notice the following happening: Command 1. emits a non-targeted MTU notifier for the CPU port (sw0p0) with the largest_mtu calculated across switch 0, of 1700. This matches sw0p0, sw0p3 and sw1p4 (all CPU ports and DSA links). Then, it emits a targeted MTU notifier for the user port (sw0p1), again with MTU 1700 (this doesn't matter). Command 2. emits a non-targeted MTU notifier for the CPU port (sw0p0) with the largest_mtu calculated across switch 1, of 1600. This matches the same group of ports as above, and decreases the MTU for the CPU port and the DSA links from 1700 to 1600. As a result, the sw0p1 user port can no longer communicate with its CPU port at MTU 1700. To address this, we should calculate the largest_mtu across all switches that may share a CPU port, and only emit MTU notifiers with that value. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:50:20 -07:00
Vladimir Oltean	abd49535c3	net: dsa: execute dsa_switch_mdb_add only for routing port in cross-chip topologies Currently, the notifier for adding a multicast MAC address matches on the targeted port and on all DSA links in the system, be they upstream or downstream links. This leads to a considerable amount of useless traffic. Consider this daisy chain topology, and a MDB add notifier emitted on sw0p0. It matches on sw0p0, sw0p3, sw1p3 and sw2p4. sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ x ] [ ] \| +---------+ \| sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ x ] [ x ] \| +---------+ \| sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] But switch 0 has no reason to send the multicast traffic for that MAC address on sw0p3, which is how it reaches switches 1 and 2. Those switches don't expect, according to the user configuration, to receive this multicast address from switch 1, and they will drop it anyway, because the only valid destination is the port they received it on. They only need to configure themselves to deliver that multicast address _towards_ switch 1, where the MDB entry is installed. Similarly, switch 1 should not send this multicast traffic towards sw1p3, because that is how it reaches switch 2. With this change, the heat map for this MDB notifier changes as follows: sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ ] [ ] \| +---------+ \| sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] \| +---------+ \| sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] Now the mdb notifier behaves the same as the fdb notifier. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:50:20 -07:00
Vladimir Oltean	a8986681cc	net: dsa: export the dsa_port_is_{user,cpu,dsa} helpers The difference between dsa_is_user_port and dsa_port_is_user is that the former needs to look up the list of ports of the DSA switch tree in order to find the struct dsa_port, while the latter directly receives it as an argument. dsa_is_user_port is already in widespread use and has its place, so there isn't any chance of converting all callers to a single form. But being able to do: dsa_port_is_user(dp) instead of dsa_is_user_port(dp->ds, dp->index) is much more efficient too, especially when the "dp" comes from an iterator over the DSA switch tree - this reduces the complexity from quadratic to linear. Move these helpers from dsa2.c to include/net/dsa.h so that others can use them too. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:50:20 -07:00
Vladimir Oltean	8674f8d310	net: dsa: assert uniqueness of dsa,member properties The cross-chip notifiers work by comparing each ds->index against the info->sw_index value from the notifier. The ds->index is retrieved from the device tree dsa,member property. If a single tree cross-chip topology does not declare unique switch IDs, this will result in hard-to-debug issues/voodoo effects such as the cross-chip notifier for one switch port also matching the port with the same number from another switch. Check in dsa_switch_parse_member_of() whether the DSA switch tree contains a DSA switch with the index we're preparing to add, before actually adding it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:50:20 -07:00
Jakub Kicinski	d452d48b9f	tls: prevent oversized sendfile() hangs by ignoring MSG_MORE We got multiple reports that multi_chunk_sendfile test case from tls selftest fails. This was sort of expected, as the original fix was never applied (see it in the first Link:). The test in question uses sendfile() with count larger than the size of the underlying file. This will make splice set MSG_MORE on all sendpage calls, meaning TLS will never close and flush the last partial record. Eric seem to have addressed a similar problem in commit `35f9c09fe9` ("tcp: tcp_sendpages() should call tcp_push() once") by introducing MSG_SENDPAGE_NOTLAST. Unlike MSG_MORE MSG_SENDPAGE_NOTLAST is not set on the last call of a "pipefull" of data (PIPE_DEF_BUFFERS == 16, so every 16 pages or whenever we run out of data). Having a break every 16 pages should be fine, TLS can pack exactly 4 pages into a record, so for aligned reads there should be no difference, unaligned may see one extra record per sendpage(). Sticking to TCP semantics seems preferable to modifying splice, but we can revisit it if real life scenarios show a regression. Reported-by: Vadim Fedorenko <vfedorenko@novek.ru> Reported-by: Seth Forshee <seth.forshee@canonical.com> Link: https://lore.kernel.org/netdev/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ Fixes: `3c4d755915` ("tls: kernel TLS support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:39:08 -07:00
Al Viro	be752283a2	__unix_find_socket_byname(): don't pass hash and type separately We only care about exclusive or of those, so pass that directly. Makes life simpler for callers as well... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
Al Viro	c0c3b8d380	unix_bind_bsd(): unlink if we fail after successful mknod We can do that more or less safely, since the parent is held locked all along. Yes, somebody might observe the object via dcache, only to have it disappear afterwards, but there's really no good way to prevent that. It won't race with other bind(2) or attempts to move the sucker elsewhere, or put something else in its place - locked parent prevents that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
Al Viro	56c1731b28	unix_bind_bsd(): move done_path_create() call after dealing with ->bindlock Final preparations for doing unlink on failure past the successful mknod. We can't hold ->bindlock over ->mknod() or ->unlink(), since either might do sb_start_write() (e.g. on overlayfs). However, we can do it while holding filesystem and VFS locks - doing kern_path_create() vfs_mknod() grab ->bindlock if u->addr had been set drop ->bindlock done_path_create return -EINVAL else assign the address to socket drop ->bindlock done_path_create return 0 would be deadlock-free. Here we massage unix_bind_bsd() to that form. We are still doing equivalent transformations. Next commit will not be an equivalent transformation - it will add a call of vfs_unlink() before done_path_create() in "alread bound" case. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
Al Viro	71e6be6f7d	fold unix_mknod() into unix_bind_bsd() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
Al Viro	fa42d910a3	unix_bind(): take BSD and abstract address cases into new helpers unix_bind_bsd() and unix_bind_abstract() respectively. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
Al Viro	aee5151705	unix_bind(): separate BSD and abstract cases We do get some duplication that way, but it's minor compared to parts that are different. What we get is an ability to change locking in BSD case without making failure exits very hard to follow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
Al Viro	c34d458251	unix_bind(): allocate addr earlier makes it easier to massage; we do pay for that by extra work (kmalloc+memcpy+kfree) in some error cases, but those are not on the hot paths anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
Al Viro	185ab886d3	af_unix: take address assignment/hash insertion into a new helper Duplicated logics in all bind variants (autobind, bind-to-path, bind-to-abstract) gets taken into a common helper. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:28:49 -07:00
David S. Miller	d52f9b22d5	linux-can-fixes-for-5.13-20210619 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmDOZ38THG1rbEBwZW5n dXRyb25peC5kZQAKCRCpyVqK+u3vqf9CCACcVoZsa+47buGmvQmRpyhbA+/a3LFF h5kBpts+igtZ2HnB5ODSu+SeqphYtE+eJeLLxbaw8riig0Vz+ogNJUMoalodYIwx B1jUTKYHg6wxDq6cAqZrG2KpOdIucXEFcugccPF0tjRthet0vZyWxbx66XWzFrp8 +UGK5H/diGihaRqguJwN3P9Mw3SYw4VWo2J2iYQ8WkGT1sy1UO4XuO9U6KqNBHmA 3a48VrgtwC4yZI7+Ar36SNMnL9P3qArE6UlqtpYDudmqpSCX08A4itM5rJ4UGSKk PwetFCvjhjsxFs081ILaBe2Ktu3fl3i+FsZF/hwv99p45l4OCaYBEwfw =xYM1 -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-5.13-20210619' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2021-06-19 this is a pull request of 5 patches for net/master. The first patch is by Thadeu Lima de Souza Cascardo and fixes a potential use-after-free in the CAN broadcast manager socket, by delaying the release of struct bcm_op after synchronize_rcu(). Oliver Hartkopp's patch fixes a similar potential user-after-free in the CAN gateway socket by synchronizing RCU operations before removing gw job entry. Another patch by Oliver Hartkopp fixes a potential use-after-free in the ISOTP socket by omitting unintended hrtimer restarts on socket release. Oleksij Rempel's patch for the j1939 socket fixes a potential use-after-free by setting the SOCK_RCU_FREE flag on the socket. The last patch is by Pavel Skripkin and fixes a use-after-free in the ems_usb CAN driver. All patches are intended for stable and have stable@v.k.o on Cc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:23:49 -07:00
Dan Carpenter	1a1100d53f	net/smc: Fix ENODATA tests in smc_nl_get_fback_stats() These functions return negative ENODATA but the minus sign was left out in the tests. Fixes: `f0dd7bf5e3` ("net/smc: Add netlink support for SMC fallback statistics") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:16:58 -07:00
Sebastian Andrzej Siewior	2b4cd14fd9	net/netif_receive_skb_core: Use migrate_disable() The preempt disable around do_xdp_generic() has been introduced in commit `bbbe211c29` ("net: rcu lock and preempt disable missing around generic xdp") For BPF it is enough to use migrate_disable() and the code was updated as it can be seen in commit `3c58482a38` ("bpf: Provide bpf_prog_run_pin_on_cpu() helper") This is a leftover which was not converted. Use migrate_disable() before invoking do_xdp_generic(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:08:02 -07:00
Cong Wang	781dd0431e	skmsg: Increase sk->sk_drops when dropping packets It is hard to observe packet drops without increasing relevant drop counters, here we should increase sk->sk_drops which is a protocol-independent counter. Fortunately psock is always associated with a struct sock, we can just use psock->sk. Suggested-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-9-xiyou.wangcong@gmail.com	2021-06-21 16:48:44 +02:00
Cong Wang	42830571f1	skmsg: Pass source psock to sk_psock_skb_redirect() sk_psock_skb_redirect() only takes skb as a parameter, we will need to know where this skb is from, so just pass the source psock to this function as a new parameter. This patch prepares for the next one. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-8-xiyou.wangcong@gmail.com	2021-06-21 16:48:41 +02:00
Cong Wang	1581a6c1c3	skmsg: Teach sk_psock_verdict_apply() to return errors Currently sk_psock_verdict_apply() is void, but it handles some error conditions too. Its caller is impossible to learn whether it succeeds or fails, especially sk_psock_verdict_recv(). Make it return int to indicate error cases and propagate errors to callers properly. Fixes: `ef5659280e` ("bpf, sockmap: Allow skipping sk_skb parser program") Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-7-xiyou.wangcong@gmail.com	2021-06-21 16:48:37 +02:00
Cong Wang	0cf6672b23	skmsg: Fix a memory leak in sk_psock_verdict_apply() If the dest psock does not set SK_PSOCK_TX_ENABLED, the skb can't be queued anywhere so must be dropped. This one is found during code review. Fixes: `799aa7f98d` ("skmsg: Avoid lock_sock() in sk_psock_backlog()") Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-6-xiyou.wangcong@gmail.com	2021-06-21 16:48:33 +02:00
Cong Wang	30b9c54a70	skmsg: Clear skb redirect pointer before dropping it When we drop skb inside sk_psock_skb_redirect(), we have to clear its skb->_sk_redir pointer too, otherwise kfree_skb() would misinterpret it as a valid skb->_skb_refdst and dst_release() would eventually complain. Fixes: `e3526bb92a` ("skmsg: Move sk_redir from TCP_SKB_CB to skb") Reported-by: Jiang Wang <jiang.wang@bytedance.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-5-xiyou.wangcong@gmail.com	2021-06-21 16:48:29 +02:00
Cong Wang	e00a5c331b	udp: Fix a memory leak in udp_read_sock() sk_psock_verdict_recv() clones the skb and uses the clone afterward, so udp_read_sock() should free the skb after using it, regardless of error or not. This fixes a real kmemleak. Fixes: `d7f571188e` ("udp: Implement ->read_sock() for sockmap") Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-4-xiyou.wangcong@gmail.com	2021-06-21 16:48:24 +02:00
Cong Wang	9f2470fbc4	skmsg: Improve udp_bpf_recvmsg() accuracy I tried to reuse sk_msg_wait_data() for different protocols, but it turns out it can not be simply reused. For example, UDP actually uses two queues to receive skb: udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have to check both of them to know whether we have received any packet. Also, UDP does not lock the sock during BH Rx path, it makes no sense for its ->recvmsg() to lock the sock. It is always possible for ->recvmsg() to be called before packets actually arrive in the receive queue, we just use best effort to make it accurate here. Fixes: `1f5be6b3b0` ("udp: Implement udp_bpf_recvmsg() for sockmap") Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com	2021-06-21 16:48:11 +02:00
Florian Westphal	b5a1d1fe0c	xfrm: replay: remove last replay indirection This replaces the overflow indirection with the new xfrm_replay_overflow helper. After this, the 'repl' pointer in xfrm_state is no longer needed and can be removed as well. xfrm_replay_overflow() is added in two incarnations, one is used when the kernel is compiled with xfrm hardware offload support enabled, the other when its disabled. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-21 09:55:06 +02:00
Florian Westphal	adfc2fdbae	xfrm: replay: avoid replay indirection Add and use xfrm_replay_check helper instead of indirection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-21 09:55:06 +02:00
Florian Westphal	25cfb8bc97	xfrm: replay: remove recheck indirection Adds new xfrm_replay_recheck() helper and calls it from xfrm input path instead of the indirection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-21 09:55:06 +02:00
Florian Westphal	c7f877833c	xfrm: replay: remove advance indirection Similar to other patches: add a new helper to avoid an indirection. v2: fix 'net/xfrm/xfrm_replay.c:519:13: warning: 'seq' may be used uninitialized in this function' warning. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-21 09:55:06 +02:00
Florian Westphal	cfc61c598e	xfrm: replay: avoid xfrm replay notify indirection replay protection is implemented using a callback structure and then called via x->repl->notify(), x->repl->recheck(), and so on. all the differect functions are always built-in, so this could be direct calls instead. This first patch prepares for removal of the x->repl structure. Add an enum with the three available replay modes to the xfrm_state structure and then replace all x->repl->notify() calls by the new xfrm_replay_notify() helper. The helper checks the enum internally to adapt behaviour as needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-21 09:55:06 +02:00
Oleksij Rempel	22c696fed2	can: j1939: j1939_sk_init(): set SOCK_RCU_FREE to call sk_destruct() after RCU is done Set SOCK_RCU_FREE to let RCU to call sk_destruct() on completion. Without this patch, we will run in to j1939_can_recv() after priv was freed by j1939_sk_release()->j1939_sk_sock_destruct() Fixes: `25fe97cb76` ("can: j1939: move j1939_priv_put() into sk_destruct callback") Link: https://lore.kernel.org/r/20210617130623.12705-1-o.rempel@pengutronix.de Cc: linux-stable <stable@vger.kernel.org> Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Reported-by: syzbot+bdf710cfc41c186fdff3@syzkaller.appspotmail.com Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-06-19 23:54:00 +02:00
Oliver Hartkopp	14a4696bc3	can: isotp: isotp_release(): omit unintended hrtimer restart on socket release When closing the isotp socket, the potentially running hrtimers are canceled before removing the subscription for CAN identifiers via can_rx_unregister(). This may lead to an unintended (re)start of a hrtimer in isotp_rcv_cf() and isotp_rcv_fc() in the case that a CAN frame is received by isotp_rcv() while the subscription removal is processed. However, isotp_rcv() is called under RCU protection, so after calling can_rx_unregister, we may call synchronize_rcu in order to wait for any RCU read-side critical sections to finish. This prevents the reception of CAN frames after hrtimer_cancel() and therefore the unintended (re)start of the hrtimers. Link: https://lore.kernel.org/r/20210618173713.2296-1-socketcan@hartkopp.net Fixes: `e057dd3fc2` ("can: add ISO 15765-2:2016 transport protocol") Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-06-19 23:54:00 +02:00
Oliver Hartkopp	fb8696ab14	can: gw: synchronize rcu operations before removing gw job entry can_can_gw_rcv() is called under RCU protection, so after calling can_rx_unregister(), we have to call synchronize_rcu in order to wait for any RCU read-side critical sections to finish before removing the kmem_cache entry with the referenced gw job entry. Link: https://lore.kernel.org/r/20210618173645.2238-1-socketcan@hartkopp.net Fixes: `c1aabdf379` ("can-gw: add netlink based CAN routing") Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-06-19 23:53:43 +02:00
Thadeu Lima de Souza Cascardo	d5f9023fa6	can: bcm: delay release of struct bcm_op after synchronize_rcu() can_rx_register() callbacks may be called concurrently to the call to can_rx_unregister(). The callbacks and callback data, though, are protected by RCU and the struct sock reference count. So the callback data is really attached to the life of sk, meaning that it should be released on sk_destruct. However, bcm_remove_op() calls tasklet_kill(), and RCU callbacks may be called under RCU softirq, so that cannot be used on kernels before the introduction of HRTIMER_MODE_SOFT. However, bcm_rx_handler() is called under RCU protection, so after calling can_rx_unregister(), we may call synchronize_rcu() in order to wait for any RCU read-side critical sections to finish. That is, bcm_rx_handler() won't be called anymore for those ops. So, we only free them, after we do that synchronize_rcu(). Fixes: `ffd980f976` ("[CAN]: Add broadcast manager (bcm) protocol") Link: https://lore.kernel.org/r/20210619161813.2098382-1-cascardo@canonical.com Cc: linux-stable <stable@vger.kernel.org> Reported-by: syzbot+0f7e7e5e2f4f40fa89c0@syzkaller.appspotmail.com Reported-by: Norbert Slusarek <nslusarek@gmx.net> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-06-19 23:47:32 +02:00
Jakub Kicinski	adc2e56ebe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-06-18 19:47:02 -07:00
Linus Torvalds	9ed13a17e3	Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of 0.0.0.0 - cdc_ncm: switch to eth%d interface naming Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDNP7EACgkQMUZtbf5S IrvTmxAAgOAM9MdRl9wnYtqXKPXJ1JJtenozwt1yX6b6OG+Ns7cm6YYafU3KoZWR KlzpvP90vRrER3RqksbMngHzvGjZKDS4LWRur7sRlJ1TBQoLrQCIbriAh07d7wlU 0nnS4J8mczTCKx78QCUYy1QBIX5TQrUbx0JQZDPoIPBjFeILW+Gx/Ghg5tUR4mhf 6icYqwIPocTXO37ZmWOzezZNVOXJF4kaQUZeuOHNe5hOtm6EeIpZbW1Xx3DIr5bd 80a/uNU7nVyos0n7jxnfVE/oelTnYbT5scZeV/PPVqZ4U113f7uex2QP23/XhGSX lK1EhwPqPOyaNhQoihLM6Xzd4o7aZOcmF8NY96xqjC+DqdN+juvfJU+ClCZojGIj H4bwCSaj3y2PiimfQdBiIKvYMc5d4zBdw/Dpk/gLDp4d5N638TAtuunK4Mj+TEuT QF1qkBLIB4HFtLS0M35/twk93md/5GUdSTij2GB3fOkAWRu2m266P5m+4DigW/TB Xm8FgKdetvxVP0Qv/p49nPEn24Ny8wCafH1x1wVTmoda2qi6j1EXMuSa0PlCdz70 Sl5FrlxdEkOpC4p+Aoc8APSoBXnOriAlpU+z/EVb8Co4JR/+Ge5zBWpsiZDVD0/K Ay0FW3I87iyn9tw1H1Fzr9GBlVl5vWRauZFHjzl90fWakCrCzJE= =xxUe -----END PGP SIGNATURE----- Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of 0.0.0.0 - cdc_ncm: switch to eth%d interface naming" * tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits) net: ethernet: fix potential use-after-free in ec_bhf_remove selftests/net: Add icmp.sh for testing ICMP dummy address responses icmp: don't send out ICMP messages with a source address of 0.0.0.0 net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY net: ll_temac: Fix TX BD buffer overwrite net: ll_temac: Add memory-barriers for TX BD access net: ll_temac: Make sure to free skb when it is completely used MAINTAINERS: add Guvenc as SMC maintainer bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path bnxt_en: Fix TQM fastpath ring backing store computation bnxt_en: Rediscover PHY capabilities after firmware reset cxgb4: fix wrong shift. mac80211: handle various extensible elements correctly mac80211: reset profile_periodicity/ema_ap cfg80211: avoid double free of PMSR request cfg80211: make certificate generation more robust mac80211: minstrel_ht: fix sample time check net: qed: Fix memcpy() overflow of qed_dcbx_params() net: cdc_eem: fix tx fixup skb leak net: hamradio: fix memory leak in mkiss_close ...	2021-06-18 18:55:29 -07:00
David S. Miller	103ebe658a	Revert "net: add pf_family_names[] for protocol family" This reverts commit `1f3c98eadd`. Does not build... Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 13:02:45 -07:00
Yejune Deng	1f3c98eadd	net: add pf_family_names[] for protocol family Modify the pr_info content from int to char *, this looks more readable. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 13:02:21 -07:00
Stefano Garzarella	91aa49a8fa	vsock/virtio: remove redundant `copy_failed` variable When memcpy_to_msg() fails in virtio_transport_seqpacket_do_dequeue(), we already set `dequeued_len` with the negative error value returned by memcpy_to_msg(). So we can directly check `dequeued_len` value instead of using a dedicated flag variable to skip the copy path for the rest of fragments. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:59:53 -07:00
Stefano Garzarella	0de5b2e672	vsock: rename vsock_wait_data() vsock_wait_data() is used only by STREAM and SEQPACKET sockets, so let's rename it to vsock_connectible_wait_data(), using the same nomenclature (connectible) used in other functions after the introduction of SEQPACKET. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:59:53 -07:00
Stefano Garzarella	cc97141afd	vsock: rename vsock_has_data() vsock_has_data() is used only by STREAM and SEQPACKET sockets, so let's rename it to vsock_connectible_has_data(), using the same nomenclature (connectible) used in other functions after the introduction of SEQPACKET. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:59:53 -07:00
David S. Miller	0d1dc9e1f4	A couple of straggler fixes: * a minstrel HT sample check fix * peer measurement could double-free on races * certificate file generation at build time could sometimes hang * some parameters weren't reset between connections in mac80211 * some extensible elements were treated as non- extensible, possibly causuing bad connections (or failures) if the AP adds data -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmDMgxsACgkQB8qZga/f l8QsDBAAhY5zN2LFdxkqyfPd8DJs2KpnE1osSi1qjmPOItn7K7H6hD6jN/UaaysQ uC1ngAyuiRMtO5JgAtj58NlnDNM3IYvYxt909PnG/NAuNGW9RDebEf2H8JGKzCTR sFW6QKOj4CkVyLwjRwu3VziI0WOaF0kNoNW2ZSr4DEHSS9siMe5svv5fLqoNxNCP 9fhS1T5xgDZfcGVdedXzilH1waqsEzPeRYY7TKGr/TZwDPksYmNsFU7mETqzKV14 OuGan7eolZ6Q869FydkKs+J9NDiHXEBVM4vt6K/2I+qHXAUUsui01l+l1oV4+XzW Jh3eS7t72uov1UV5jVvLjrFvKOWBu1RpsO+8XfUqnTa7AvDdC5jrBTWzFYaATmqm OtfVy3JSkd8d9eMX6Yg3/K/f9WoNPIyrR1BbbOCpWN3tHvE2xc8fWsRmS3o6VnpP DZ/+Za4csLKl5/D1x3cqYnIaLwQdD75WNGJU10UvvyPyNsKLsw4UxfSm49gWXXBm /fqXGS2SJX39GiHysZAnQlpRy9x03E/qkWaPZWx+xYP4zkr5MNecM5kmiINZINBA eJPjO8Ex2ODkNf/BAmzHhIyPilRw0ypDa8K5NS/KCp2WBA01lEgyglRD0Rnz5vjD MSP+cV38SjFoOxxiN1qtB1bSyN0EN5MdFwyrerJjmDRp/sqA5xE= =Nh7Q -----END PGP SIGNATURE----- Merge tag 'mac80211-for-net-2021-06-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A couple of straggler fixes: * a minstrel HT sample check fix * peer measurement could double-free on races * certificate file generation at build time could sometimes hang * some parameters weren't reset between connections in mac80211 * some extensible elements were treated as non- extensible, possibly causuing bad connections (or failures) if the AP adds data ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:22:55 -07:00
Toke Høiland-Jørgensen	3218274773	icmp: don't send out ICMP messages with a source address of 0.0.0.0 When constructing ICMP response messages, the kernel will try to pick a suitable source address for the outgoing packet. However, if no IPv4 addresses are configured on the system at all, this will fail and we end up producing an ICMP message with a source address of 0.0.0.0. This can happen on a box routing IPv4 traffic via v6 nexthops, for instance. Since 0.0.0.0 is not generally routable on the internet, there's a good chance that such ICMP messages will never make it back to the sender of the original packet that the ICMP message was sent in response to. This, in turn, can create connectivity and PMTUd problems for senders. Fortunately, RFC7600 reserves a dummy address to be used as a source for ICMP messages (192.0.0.8/32), so let's teach the kernel to substitute that address as a last resort if the regular source address selection procedure fails. Below is a quick example reproducing this issue with network namespaces: ip netns add ns0 ip l add type veth peer netns ns0 ip l set dev veth0 up ip a add 10.0.0.1/24 dev veth0 ip a add fc00:dead:cafe:42::1/64 dev veth0 ip r add 10.1.0.0/24 via inet6 fc00:dead:cafe:42::2 ip -n ns0 l set dev veth0 up ip -n ns0 a add fc00:dead:cafe:42::2/64 dev veth0 ip -n ns0 r add 10.0.0.0/24 via inet6 fc00:dead:cafe:42::1 ip netns exec ns0 sysctl -w net.ipv4.icmp_ratelimit=0 ip netns exec ns0 sysctl -w net.ipv4.ip_forward=1 tcpdump -tpni veth0 -c 2 icmp & ping -w 1 10.1.0.1 > /dev/null tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on veth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 29, seq 1, length 64 IP 0.0.0.0 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92 2 packets captured 2 packets received by filter 0 packets dropped by kernel With this patch the above capture changes to: IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 31127, seq 1, length 64 IP 192.0.0.8 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Juliusz Chroboczek <jch@irif.fr> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:13:24 -07:00
Colin Ian King	040c12570e	net: bridge: remove redundant continue statement The continue statement at the end of a for-loop has no effect, invert the if expression and remove the continue. Addresses-Coverity: ("Continue has no effect") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:08:40 -07:00
Dongliang Mu	9fd2bc3206	net: caif: modify the label out_err to out Modify the label out_err to out to avoid the meanless kfree. Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:07:09 -07:00
Jakub Kicinski	9d72b8da9f	net: vlan: pass thru all GSO_SOFTWARE in hw_enc_features Currently UDP tunnel devices on top of VLANs lose the ability to offload UDP GSO. Widen the pass thru features from TSO to all GSO_SOFTWARE. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:58:03 -07:00
Geliang Tang	fc3c82eebf	mptcp: add a new sysctl checksum_enabled This patch added a new sysctl, named checksum_enabled, to control whether DSS checksum can be enabled. Acked-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	fe3ab1cbd3	mptcp: add the mib for data checksum This patch added the mib for the data checksum, MPTCP_MIB_DATACSUMERR. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Paolo Abeni	4e14867d5e	mptcp: tune re-injections for csum enabled mode If the MPTCP-level checksum is enabled, on re-injections we must spool a complete DSS, or the receive side will not be able to compute the csum and process any data. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Paolo Abeni	dd8bcd1768	mptcp: validate the data checksum This patch added three new members named data_csum, csum_len and map_csum in struct mptcp_subflow_context, implemented a new function named mptcp_validate_data_checksum(). If the current mapping is valid and csum is enabled traverse the later pending skbs and compute csum incrementally till the whole mapping has been covered. If not enough data is available in the rx queue, return MAPPING_EMPTY - that is, no data. Next subflow_data_ready invocation will trigger again csum computation. When the full DSS is available, validate the csum and return to the caller an appropriate error code, to trigger subflow reset of fallback as required by the RFC. Additionally: - if the csum prevence in the DSS don't match the negotiated value e.g. csum present, but not requested, return invalid mapping to trigger subflow reset. - keep some csum state, to avoid re-compute the csum on the same data when multiple rx queue traversal are required. - clean-up the uncompleted mapping from the receive queue on close, to allow proper subflow disposal Co-developed-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	390b95a5fb	mptcp: receive checksum for DSS In mptcp_parse_option, adjust the expected_opsize, and always parse the data checksum value from the receiving DSS regardless of csum presence. Then save it in mp_opt->csum. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	208e8f6692	mptcp: receive checksum for MP_CAPABLE with data This patch added a new member named csum in struct mptcp_options_received. When parsing the MP_CAPABLE with data, if the checksum is enabled, adjust the expected_opsize. If the receiving option length matches the length with the data checksum, get the checksum value and save it in mp_opt->csum. And in mptcp_incoming_options, pass it to mpext->csum. We always parse any csum/nocsum combination and delay the presence check to later code, to allow reset if missing. Additionally, in the TX path, use the newly introduce ext field to avoid MPTCP csum recomputation on TCP retransmission and unneeded csum update on when setting the data fin_flag. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	0625118115	mptcp: add csum_reqd in mptcp_options_received This patch added a new flag csum_reqd in struct mptcp_options_received, if the flag MPTCP_CAP_CHECKSUM_REQD is set in the receiving MP_CAPABLE suboption, set this flag. In mptcp_sk_clone and subflow_finish_connect, if the csum_reqd flag is set, enable the msk->csum_enabled flag. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	c863225b79	mptcp: add sk parameter for mptcp_get_options This patch added a new parameter name sk in mptcp_get_options(). Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	c5b39e26d0	mptcp: send out checksum for DSS In mptcp_write_options, if the checksum is enabled, adjust the option length and send out the data checksum with DSS suboption. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	c94b1f96dc	mptcp: send out checksum for MP_CAPABLE with data If the checksum is enabled, send out the data checksum with the MP_CAPABLE suboption with data. In mptcp_established_options_mp, save the data checksum in opts->ext_copy.csum. In mptcp_write_options, adjust the option length and send it out with the MP_CAPABLE suboption. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	06fe1719aa	mptcp: add csum_reqd in mptcp_out_options This patch added a new member csum_reqd in struct mptcp_out_options and struct mptcp_subflow_request_sock. Initialized it with the helper function mptcp_is_checksum_enabled(). In mptcp_write_options, if this field is enabled, send out the MP_CAPABLE suboption with the MPTCP_CAP_CHECKSUM_REQD flag. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	d0cc298745	mptcp: generate the data checksum This patch added a new member named csum in struct mptcp_ext, implemented a new function named mptcp_generate_data_checksum(). Generate the data checksum in mptcp_sendmsg_frag, save it in mpext->csum. Note that we must generate the csum for zero window probe, too. Do the csum update incrementally, to avoid multiple csum computation when the data is appended to existing skb. Note that in a later patch we will skip unneeded csum related operation. Changes not included here to keep the delta small. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Geliang Tang	752e906732	mptcp: add csum_enabled in mptcp_sock This patch added a new member named csum_enabled in struct mptcp_sock, used a dummy mptcp_is_checksum_enabled() helper to initialize it. Also added a new member named mptcpi_csum_enabled in struct mptcp_info to expose the csum_enabled flag. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:40:11 -07:00
Andrea Mayer	8b532109bf	seg6: add support for SRv6 End.DT46 Behavior IETF RFC 8986 [1] includes the definition of SRv6 End.DT4, End.DT6, and End.DT46 Behaviors. The current SRv6 code in the Linux kernel only implements End.DT4 and End.DT6 which can be used respectively to support IPv4-in-IPv6 and IPv6-in-IPv6 VPNs. With End.DT4 and End.DT6 it is not possible to create a single SRv6 VPN tunnel to carry both IPv4 and IPv6 traffic. The proposed End.DT46 implementation is meant to support the decapsulation of IPv4 and IPv6 traffic coming from a single SRv6 tunnel. The implementation of the SRv6 End.DT46 Behavior in the Linux kernel greatly simplifies the setup and operations of SRv6 VPNs. The SRv6 End.DT46 Behavior leverages the infrastructure of SRv6 End.DT{4,6} Behaviors implemented so far, because it makes use of a VRF device in order to force the routing lookup into the associated routing table. To make the End.DT46 work properly, it must be guaranteed that the routing table used for routing lookup operations is bound to one and only one VRF during the tunnel creation. Such constraint has to be enforced by enabling the VRF strict_mode sysctl parameter, i.e.: $ sysctl -wq net.vrf.strict_mode=1 Note that the same approach is used for the SRv6 End.DT4 Behavior and for the End.DT6 Behavior in VRF mode. The command used to instantiate an SRv6 End.DT46 Behavior is straightforward, i.e.: $ ip -6 route add 2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100. [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-enddt46-decapsulation-and-s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Performance and impact of SRv6 End.DT46 Behavior on the SRv6 Networking ======================================================================= This patch aims to add the SRv6 End.DT46 Behavior with minimal impact on the performance of SRv6 End.DT4 and End.DT6 Behaviors. In order to verify this, we tested the performance of the newly introduced SRv6 End.DT46 Behavior and compared it with the performance of SRv6 End.DT{4,6} Behaviors, considering both the patched kernel and the kernel before applying the End.DT46 patch (referred to as vanilla kernel). In details, the following decapsulation scenarios were considered: 1.a) IPv6 traffic in SRv6 End.DT46 Behavior on patched kernel; 1.b) IPv4 traffic in SRv6 End.DT46 Behavior on patched kernel; 2.a) SRv6 End.DT6 Behavior (VRF mode) on patched kernel; 2.b) SRv6 End.DT4 Behavior on patched kernel; 3.a) SRv6 End.DT6 Behavior (VRF mode) on vanilla kernel (without the End.DT46 patch); 3.b) SRv6 End.DT4 Behavior on vanilla kernel (without the End.DT46 patch). All tests were performed on a testbed deployed on the CloudLab [2] facilities. We considered IPv{4,6} traffic handled by a single core (at 2.4 GHz on a Xeon(R) CPU E5-2630 v3) on kernel 5.13-rc1 using packets of size ~ 100 bytes. Scenario (1.a): average 684.70 kpps; std. dev. 0.7 kpps; Scenario (1.b): average 711.69 kpps; std. dev. 1.2 kpps; Scenario (2.a): average 690.70 kpps; std. dev. 1.2 kpps; Scenario (2.b): average 722.22 kpps; std. dev. 1.7 kpps; Scenario (3.a): average 690.02 kpps; std. dev. 2.6 kpps; Scenario (3.b): average 721.91 kpps; std. dev. 1.2 kpps; Considering the results for the patched kernel (1.a, 1.b, 2.a, 2.b) we observe that the performance degradation incurred in using End.DT46 rather than End.DT6 and End.DT4 respectively for IPv6 and IPv4 traffic is minimal, around 0.9% and 1.5%. Such very minimal performance degradation is the price to be paid if one prefers to use a single tunnel capable of handling both types of traffic (IPv4 and IPv6). Comparing the results for End.DT4 and End.DT6 under the patched and the vanilla kernel (2.a, 2.b, 3.a, 3.b) we observe that the introduction of the End.DT46 patch has no impact on the performance of End.DT4 and End.DT6. [2] https://www.cloudlab.us Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 11:35:47 -07:00
Magnus Karlsson	f654fae47e	xsk: Fix broken Tx ring validation Fix broken Tx ring validation for AF_XDP. The commit under the Fixes tag, fixed an off-by-one error in the validation but introduced another error. Descriptors are now let through even if they straddle a chunk boundary which they are not allowed to do in aligned mode. Worse is that they are let through even if they straddle the end of the umem itself, tricking the kernel to read data outside the allowed umem region which might or might not be mapped at all. Fix this by reintroducing the old code, but subtract the length by one to fix the off-by-one error that the original patch was addressing. The test chunk != chunk_end makes sure packets do not straddle chunk boundraries. Note that packets of zero length are allowed in the interface, therefore the test if the length is non-zero. Fixes: `ac31565c21` ("xsk: Fix for xp_aligned_validate_desc() when len == chunk_size") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/bpf/20210618075805.14412-1-magnus.karlsson@gmail.com	2021-06-18 16:59:20 +02:00
Florian Westphal	62eec0d733	netfilter: conntrack: pass hook state to log functions The packet logger backend is unable to provide the incoming (or outgoing) interface name because that information isn't available. Pass the hook state, it contains the network namespace, the protocol family, the network interfaces and other things. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-18 14:47:43 +02:00
Johannes Berg	652e8363bb	mac80211: handle various extensible elements correctly Various elements are parsed with a requirement to have an exact size, when really we should only check that they have the minimum size that we need. Check only that and therefore ignore any additional data that they might carry. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.cd101f8040a4.Iadf0e9b37b100c6c6e79c7b298cc657c2be9151a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-18 13:25:49 +02:00
Johannes Berg	bbc6f03ff2	mac80211: reset profile_periodicity/ema_ap Apparently we never clear these values, so they'll remain set since the setting of them is conditional. Clear the values in the relevant other cases. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.316e32d136a9.I2a12e51814258e1e1b526103894f4b9f19a91c8d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-18 13:25:49 +02:00
Avraham Stern	0288e5e16a	cfg80211: avoid double free of PMSR request If cfg80211_pmsr_process_abort() moves all the PMSR requests that need to be freed into a local list before aborting and freeing them. As a result, it is possible that cfg80211_pmsr_complete() will run in parallel and free the same PMSR request. Fix it by freeing the request in cfg80211_pmsr_complete() only if it is still in the original pmsr list. Cc: stable@vger.kernel.org Fixes: `9bb7e0f24e` ("cfg80211: add peer measurement with FTM initiator API") Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.1fbef57e269a.I00294bebdb0680b892f8d1d5c871fd9dbe785a5e@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-18 13:25:24 +02:00
Johannes Berg	b5642479b0	cfg80211: make certificate generation more robust If all net/wireless/certs/*.hex files are deleted, the build will hang at this point since the 'cat' command will have no arguments. Do "echo \| cat - ..." so that even if the "..." part is empty, the whole thing won't hang. Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.c989056c3664.Ic3b77531d00b30b26dcd69c64e55ae2f60c3f31e@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-18 13:25:15 +02:00
Peter Zijlstra	2f064a59a1	sched: Change task_struct::state Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org	2021-06-18 11:43:09 +02:00
Felix Fietkau	1236af327a	mac80211: minstrel_ht: fix sample time check We need to skip sampling if the next sample time is after jiffies, not before. This patch fixes an issue where in some cases only very little sampling (or none at all) is performed, leading to really bad data rates Fixes: `80d55154b2` ("mac80211: minstrel_ht: significantly redesign the rate probing strategy") Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210617103854.61875-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-06-18 11:35:29 +02:00
David S. Miller	a52171ae7b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-17 The following pull-request contains BPF updates for your net-next tree. We've added 50 non-merge commits during the last 25 day(s) which contain a total of 148 files changed, 4779 insertions(+), 1248 deletions(-). The main changes are: 1) BPF infrastructure to migrate TCP child sockets from a listener to another in the same reuseport group/map, from Kuniyuki Iwashima. 2) Add a provably sound, faster and more precise algorithm for tnum_mul() as noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan. 3) Streamline error reporting changes in libbpf as planned out in the 'libbpf: the road to v1.0' effort, from Andrii Nakryiko. 4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu. 5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek. 6) Support new LLVM relocations in libbpf to make them more linker friendly, also add a doc to describe the BPF backend relocations, from Yonghong Song. 7) Silence long standing KUBSAN complaints on register-based shifts in interpreter, from Daniel Borkmann and Eric Biggers. 8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when target arch cannot be determined, from Lorenz Bauer. 9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson. 10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi. 11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF programs to bpf_helpers.h header, from Florent Revest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-17 11:54:56 -07:00
Yang Yingliang	55d96f72e8	net: sched: fix error return code in tcf_del_walker() When nla_put_u32() fails, 'ret' could be 0, it should return error code in tcf_del_walker(). Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-17 11:36:18 -07:00
Pablo Neira Ayuso	836382dc24	netfilter: nf_tables: add last expression Add a new optional expression that tells you when last matching on a given rule / set element element has happened. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-17 03:23:00 +02:00
Phil Sutter	06e95f0a2a	netfilter: nft_extdhr: Drop pointless check of tprot_set Pablo says, tprot_set is only there to detect if tprot was set to IPPROTO_IP as that evaluates to zero. Therefore, code asserting a different value in tprot does not need to check tprot_set. Fixes: `935b7f6430` ("netfilter: nft_exthdr: add TCP option matching") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-16 22:25:01 +02:00
Phil Sutter	5acc44f394	netfilter: nft_exthdr: Search chunks in SCTP packets only Since user space does not generate a payload dependency, plain sctp chunk matches cause searching in non-SCTP packets, too. Avoid this potential mis-interpretation of packet data by checking pkt->tprot. Fixes: `133dc203d7` ("netfilter: nft_exthdr: Support SCTP chunks") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-16 22:25:01 +02:00
Guvenc Gulce	194730a9be	net/smc: Make SMC statistics network namespace aware Make the gathered SMC statistics network namespace aware, for each namespace collect an own set of statistic information. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:54:02 -07:00
Guvenc Gulce	f0dd7bf5e3	net/smc: Add netlink support for SMC fallback statistics Add support to collect more detailed SMC fallback reason statistics and provide these statistics to user space on the netlink interface. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:54:02 -07:00
Guvenc Gulce	8c40602b4b	net/smc: Add netlink support for SMC statistics Add the netlink function which collects the statistics information and delivers it to the userspace. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:54:02 -07:00
Guvenc Gulce	e0e4b8fa53	net/smc: Add SMC statistics support Add the ability to collect SMC statistics information. Per-cpu variables are used to collect the statistic information for better performance and for reducing concurrency pitfalls. The code that is collecting statistic data is implemented in macros to increase code reuse and readability. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:54:02 -07:00
Eric Dumazet	a494bd642d	net/af_unix: fix a data-race in unix_dgram_sendmsg / unix_release_sock While unix_may_send(sk, osk) is called while osk is locked, it appears unix_release_sock() can overwrite unix_peer() after this lock has been released, making KCSAN unhappy. Changing unix_release_sock() to access/change unix_peer() before lock is released should fix this issue. BUG: KCSAN: data-race in unix_dgram_sendmsg / unix_release_sock write to 0xffff88810465a338 of 8 bytes by task 20852 on cpu 1: unix_release_sock+0x4ed/0x6e0 net/unix/af_unix.c:558 unix_release+0x2f/0x50 net/unix/af_unix.c:859 __sock_release net/socket.c:599 [inline] sock_close+0x6c/0x150 net/socket.c:1258 __fput+0x25b/0x4e0 fs/file_table.c:280 ____fput+0x11/0x20 fs/file_table.c:313 task_work_run+0xae/0x130 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:175 [inline] exit_to_user_mode_prepare+0x156/0x190 kernel/entry/common.c:209 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:302 do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88810465a338 of 8 bytes by task 20888 on cpu 0: unix_may_send net/unix/af_unix.c:189 [inline] unix_dgram_sendmsg+0x923/0x1610 net/unix/af_unix.c:1712 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0xffff888167905400 -> 0x0000000000000000 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 20888 Comm: syz-executor.0 Not tainted 5.13.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:51:55 -07:00
Eric Dumazet	e032f7c9c7	net/packet: annotate accesses to po->ifindex Like prior patch, we need to annotate lockless accesses to po->ifindex For instance, packet_getname() is reading po->ifindex (twice) while another thread is able to change po->ifindex. KCSAN reported: BUG: KCSAN: data-race in packet_do_bind / packet_getname write to 0xffff888143ce3cbc of 4 bytes by task 25573 on cpu 1: packet_do_bind+0x420/0x7e0 net/packet/af_packet.c:3191 packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255 __sys_bind+0x200/0x290 net/socket.c:1637 __do_sys_bind net/socket.c:1648 [inline] __se_sys_bind net/socket.c:1646 [inline] __x64_sys_bind+0x3d/0x50 net/socket.c:1646 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888143ce3cbc of 4 bytes by task 25578 on cpu 0: packet_getname+0x5b/0x1a0 net/packet/af_packet.c:3525 __sys_getsockname+0x10e/0x1a0 net/socket.c:1887 __do_sys_getsockname net/socket.c:1902 [inline] __se_sys_getsockname net/socket.c:1899 [inline] __x64_sys_getsockname+0x3e/0x50 net/socket.c:1899 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 25578 Comm: syz-executor.5 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:48:18 -07:00
Eric Dumazet	c7d2ef5dd4	net/packet: annotate accesses to po->bind tpacket_snd(), packet_snd(), packet_getname() and packet_seq_show() can read po->num without holding a lock. This means other threads can change po->num at the same time. KCSAN complained about this known fact [1] Add READ_ONCE()/WRITE_ONCE() to address the issue. [1] BUG: KCSAN: data-race in packet_do_bind / packet_sendmsg write to 0xffff888131a0dcc0 of 2 bytes by task 24714 on cpu 0: packet_do_bind+0x3ab/0x7e0 net/packet/af_packet.c:3181 packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255 __sys_bind+0x200/0x290 net/socket.c:1637 __do_sys_bind net/socket.c:1648 [inline] __se_sys_bind net/socket.c:1646 [inline] __x64_sys_bind+0x3d/0x50 net/socket.c:1646 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888131a0dcc0 of 2 bytes by task 24719 on cpu 1: packet_snd net/packet/af_packet.c:2899 [inline] packet_sendmsg+0x317/0x3570 net/packet/af_packet.c:3040 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmsg+0x1ed/0x270 net/socket.c:2433 __do_sys_sendmsg net/socket.c:2442 [inline] __se_sys_sendmsg net/socket.c:2440 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2440 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0x1200 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 24719 Comm: syz-executor.5 Not tainted 5.13.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:48:18 -07:00
David S. Miller	e82a35aead	linux-can-fixes-for-5.13-20210616 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmDJ2AUTHG1rbEBwZW5n dXRyb25peC5kZQAKCRCpyVqK+u3vqSfnCAClPazQgSljFZSz4KfH72RQvheGnMKN 5HmZXOn8nR7/3LMGVovtqK+ZVhnRaQdILDp1RIGjv7Itti4gc+vdmMdrX3zPBiEA CjsHVAP7Fj7CTIUp7X/1SjHrhgjy0RzbzfreXRmUSsQVHIpIsTc5aqNPTwhhPFRU 5wSzs80PPQP5OZjPyPGc3v6gzbiNQH+XqQX/ipR29uteP7eFQBmRkbaga30Q9CrT 9I4748smzsEvbHTpOn4mcsjZJBGN3qT21GvtS/BsoXVBBdoG92Wwoo0ng33dG/FH j4BG2Mr2KhVgFLSCnQFiM+/SQGBj0KfFdcpX3mISZINmcnksdvkmebld =4ghS -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-5.13-20210616' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2021-06-16 this is a pull request of 4 patches for net/master. The first patch is by Oleksij Rempel and fixes a Use-after-Free found by syzbot in the j1939 stack. The next patch is by Tetsuo Handa and fixes hung task detected by syzbot in the bcm, raw and isotp protocols. Norbert Slusarek's patch fixes a infoleak in bcm's struct bcm_msg_head. Pavel Skripkin's patch fixes a memory leak in the mcba_usb driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:44:11 -07:00
Chengyang Fan	d8e2973029	net: ipv4: fix memory leak in ip_mc_add1_src BUG: memory leak unreferenced object 0xffff888101bc4c00 (size 32): comm "syz-executor527", pid 360, jiffies 4294807421 (age 19.329s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01 00 00 00 00 00 00 00 ac 14 14 bb 00 00 02 00 ................ backtrace: [<00000000f17c5244>] kmalloc include/linux/slab.h:558 [inline] [<00000000f17c5244>] kzalloc include/linux/slab.h:688 [inline] [<00000000f17c5244>] ip_mc_add1_src net/ipv4/igmp.c:1971 [inline] [<00000000f17c5244>] ip_mc_add_src+0x95f/0xdb0 net/ipv4/igmp.c:2095 [<000000001cb99709>] ip_mc_source+0x84c/0xea0 net/ipv4/igmp.c:2416 [<0000000052cf19ed>] do_ip_setsockopt net/ipv4/ip_sockglue.c:1294 [inline] [<0000000052cf19ed>] ip_setsockopt+0x114b/0x30c0 net/ipv4/ip_sockglue.c:1423 [<00000000477edfbc>] raw_setsockopt+0x13d/0x170 net/ipv4/raw.c:857 [<00000000e75ca9bb>] __sys_setsockopt+0x158/0x270 net/socket.c:2117 [<00000000bdb993a8>] __do_sys_setsockopt net/socket.c:2128 [inline] [<00000000bdb993a8>] __se_sys_setsockopt net/socket.c:2125 [inline] [<00000000bdb993a8>] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2125 [<000000006a1ffdbd>] do_syscall_64+0x40/0x80 arch/x86/entry/common.c:47 [<00000000b11467c4>] entry_SYSCALL_64_after_hwframe+0x44/0xae In commit `24803f38a5` ("igmp: do not remove igmp souce list info when set link down"), the ip_mc_clear_src() in ip_mc_destroy_dev() was removed, because it was also called in igmpv3_clear_delrec(). Rough callgraph: inetdev_destroy -> ip_mc_destroy_dev -> igmpv3_clear_delrec -> ip_mc_clear_src -> RCU_INIT_POINTER(dev->ip_ptr, NULL) However, ip_mc_clear_src() called in igmpv3_clear_delrec() doesn't release in_dev->mc_list->sources. And RCU_INIT_POINTER() assigns the NULL to dev->ip_ptr. As a result, in_dev cannot be obtained through inetdev_by_index() and then in_dev->mc_list->sources cannot be released by ip_mc_del1_src() in the sock_close. Rough call sequence goes like: sock_close -> __sock_release -> inet_release -> ip_mc_drop_socket -> inetdev_by_index -> ip_mc_leave_src -> ip_mc_del_src -> ip_mc_del1_src So we still need to call ip_mc_clear_src() in ip_mc_destroy_dev() to free in_dev->mc_list->sources. Fixes: `24803f38a5` ("igmp: do not remove igmp souce list info ...") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Chengyang Fan <cy.fan@huawei.com> Acked-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:41:01 -07:00
George McCollister	c2ae34a7de	net: hsr: don't check sequence number if tag removal is offloaded Don't check the sequence number when deciding when to update time_in in the node table if tag removal is offloaded since the sequence number is part of the tag. This fixes a problem where the times in the node table wouldn't update when 0 appeared to be before or equal to seq_out when tag removal was offloaded. Signed-off-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:13:01 -07:00
Pablo Neira Ayuso	52f0f4e178	netfilter: nft_tproxy: restrict support to TCP and UDP transport protocols Add unfront check for TCP and UDP packets before performing further processing. Fixes: `4ed8eb6570` ("netfilter: nf_tables: Add native tproxy support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-16 20:51:51 +02:00
Pablo Neira Ayuso	8f518d43f8	netfilter: nft_osf: check for TCP packet before further processing The osf expression only supports for TCP packets, add a upfront sanity check to skip packet parsing if this is not a TCP packet. Fixes: `b96af92d6e` ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-16 20:51:50 +02:00
Pablo Neira Ayuso	cdd73cc545	netfilter: nft_exthdr: check for IPv6 packet before further processing ipv6_find_hdr() does not validate that this is an IPv6 packet. Add a sanity check for calling ipv6_find_hdr() to make sure an IPv6 packet is passed for parsing. Fixes: `96518518cc` ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-06-16 20:51:50 +02:00
Norbert Slusarek	5e87ddbe39	can: bcm: fix infoleak in struct bcm_msg_head On 64-bit systems, struct bcm_msg_head has an added padding of 4 bytes between struct members count and ival1. Even though all struct members are initialized, the 4-byte hole will contain data from the kernel stack. This patch zeroes out struct bcm_msg_head before usage, preventing infoleaks to userspace. Fixes: `ffd980f976` ("[CAN]: Add broadcast manager (bcm) protocol") Link: https://lore.kernel.org/r/trinity-7c1b2e82-e34f-4885-8060-2cd7a13769ce-1623532166177@3c-app-gmx-bs52 Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Norbert Slusarek <nslusarek@gmx.net> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-06-16 12:52:18 +02:00
Tetsuo Handa	8d0caedb75	can: bcm/raw/isotp: use per module netdevice notifier syzbot is reporting hung task at register_netdevice_notifier() [1] and unregister_netdevice_notifier() [2], for cleanup_net() might perform time consuming operations while CAN driver's raw/bcm/isotp modules are calling {register,unregister}_netdevice_notifier() on each socket. Change raw/bcm/isotp modules to call register_netdevice_notifier() from module's __init function and call unregister_netdevice_notifier() from module's __exit function, as with gw/j1939 modules are doing. Link: https://syzkaller.appspot.com/bug?id=391b9498827788b3cc6830226d4ff5be87107c30 [1] Link: https://syzkaller.appspot.com/bug?id=1724d278c83ca6e6df100a2e320c10d991cf2bce [2] Link: https://lore.kernel.org/r/54a5f451-05ed-f977-8534-79e7aa2bcc8f@i-love.sakura.ne.jp Cc: linux-stable <stable@vger.kernel.org> Reported-by: syzbot <syzbot+355f8edb2ff45d5f95fa@syzkaller.appspotmail.com> Reported-by: syzbot <syzbot+0f1827363a305f74996f@syzkaller.appspotmail.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Tested-by: syzbot <syzbot+355f8edb2ff45d5f95fa@syzkaller.appspotmail.com> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-06-16 12:52:18 +02:00
Oleksij Rempel	2030043e61	can: j1939: fix Use-after-Free, hold skb ref while in use This patch fixes a Use-after-Free found by the syzbot. The problem is that a skb is taken from the per-session skb queue, without incrementing the ref count. This leads to a Use-after-Free if the skb is taken concurrently from the session queue due to a CTS. Fixes: `9d71dd0c70` ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/r/20210521115720.7533-1-o.rempel@pengutronix.de Cc: Hillf Danton <hdanton@sina.com> Cc: linux-stable <stable@vger.kernel.org> Reported-by: syzbot+220c1a29987a9a490903@syzkaller.appspotmail.com Reported-by: syzbot+45199c1b73b4013525cf@syzkaller.appspotmail.com Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-06-16 12:52:18 +02:00
Jakub Kicinski	4d1fb7cde0	ethtool: add a stricter length check There has been a few errors in the ethtool reply size calculations, most of those are hard to trigger during basic testing because of skb size rounding up and netdev names being shorter than max. Add a more precise check. This change will affect the value of payload length displayed in case of -EMSGSIZE but that should be okay, "payload length" isn't a well defined term here. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 00:40:44 -07:00
Maciej Żenczykowski	1b3fc77176	inet_diag: add support for tw_mark Timewait sockets have included mark since approx 4.18. Cc: Eric Dumazet <edumazet@google.com> Cc: Jon Maxwell <jmaxwell37@gmail.com> Fixes: `0048369055` ("tcp: Add mark for TIMEWAIT sockets") Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 00:38:59 -07:00
Florian Westphal	30ad6a84f6	xfrm: avoid compiler warning when ipv6 is disabled with CONFIG_IPV6=n: xfrm_output.c:140:12: warning: 'xfrm6_hdr_offset' defined but not used Fixes: `9acf4d3b9e` ("xfrm: ipv6: add xfrm6_hdr_offset helper") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-16 08:32:14 +02:00
Liu Shixin	b8f6b0522c	netlabel: Fix memory leak in netlbl_mgmt_add_common Hulk Robot reported memory leak in netlbl_mgmt_add_common. The problem is non-freed map in case of netlbl_domhsh_add() failed. BUG: memory leak unreferenced object 0xffff888100ab7080 (size 96): comm "syz-executor537", pid 360, jiffies 4294862456 (age 22.678s) hex dump (first 32 bytes): 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 ................ backtrace: [<0000000008b40026>] netlbl_mgmt_add_common.isra.0+0xb2a/0x1b40 [<000000003be10950>] netlbl_mgmt_add+0x271/0x3c0 [<00000000c70487ed>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 [<000000001f2ff614>] genl_rcv_msg+0x2bf/0x4f0 [<0000000089045792>] netlink_rcv_skb+0x134/0x3d0 [<0000000020e96fdd>] genl_rcv+0x24/0x40 [<0000000042810c66>] netlink_unicast+0x4a0/0x6a0 [<000000002e1659f0>] netlink_sendmsg+0x789/0xc70 [<000000006e43415f>] sock_sendmsg+0x139/0x170 [<00000000680a73d7>] ____sys_sendmsg+0x658/0x7d0 [<0000000065cbb8af>] ___sys_sendmsg+0xf8/0x170 [<0000000019932b6c>] __sys_sendmsg+0xd3/0x190 [<00000000643ac172>] do_syscall_64+0x37/0x90 [<000000009b79d6dc>] entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: `63c4168874` ("netlabel: Add network address selectors to the NetLabel/LSM domain mapping") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-15 11:19:04 -07:00
Changbin Du	e34492dea6	net: inline function get_net_ns_by_fd if NET_NS is disabled The function get_net_ns_by_fd() could be inlined when NET_NS is not enabled. Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-15 11:00:45 -07:00
Boris Sukholitko	0dca2c7404	net/sched: cls_flower: Remove match on n_proto The following flower filters fail to match packets: tc filter add dev eth0 ingress protocol 0x8864 flower \ action simple sdata hi64 tc filter add dev eth0 ingress protocol 802.1q flower \ vlan_ethtype 0x8864 action simple sdata "hi vlan" The protocol 0x8864 (ETH_P_PPP_SES) is a tunnel protocol. As such, it is being dissected by __skb_flow_dissect and it's internal protocol is being set as key->basic.n_proto. IOW, the existence of ETH_P_PPP_SES tunnel is transparent to the callers of __skb_flow_dissect. OTOH, in the filters above, cls_flower configures its key->basic.n_proto to the ETH_P_PPP_SES value configured by the user. Matching on this key fails because of __skb_flow_dissect "transparency" mentioned above. In the following, I would argue that the problem lies with cls_flower, unnessary attempting key->basic.n_proto match. There are 3 close places in fl_set_key in cls_flower setting up mask->basic.n_proto. They are (in reverse order of appearance in the code) due to: (a) No vlan is given: use TCA_FLOWER_KEY_ETH_TYPE parameter (b) One vlan tag is given: use TCA_FLOWER_KEY_VLAN_ETH_TYPE (c) Two vlans are given: use TCA_FLOWER_KEY_CVLAN_ETH_TYPE The match in case (a) is unneeded because flower has no its own eth_type parameter. It was removed by Jamal Hadi Salim in commit 488b41d020fb06428b90289f70a41210718f52b7 in iproute2. For TCA_FLOWER_KEY_ETH_TYPE the userspace uses the generic tc filter protocol field. Therefore the match for the case (a) is done by tc itself. The matches in cases (b), (c) are unneeded because the protocol will appear in and will be matched by flow_dissector_key_vlan.vlan_tpid. Therefore in the best case, key->basic.n_proto will try to repeat vlan key match again. The below patch removes mask->basic.n_proto setting and resets it to 0 in case (c). Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-15 10:26:51 -07:00
Kuniyuki Iwashima	d5e4ddaeb6	bpf: Support socket migration by eBPF. This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled. Currently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit `aac3fc320d` ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type(). Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS. - accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV) In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration. There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp	2021-06-15 18:01:06 +02:00
Kuniyuki Iwashima	e061047684	bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT. We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing to select a new listener. We can currently get a unique ID of each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch makes the pointer of sk available in sk_reuseport_md so that we can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-9-kuniyu@amazon.co.jp	2021-06-15 18:01:06 +02:00
Kuniyuki Iwashima	d4f2c86b2b	tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. This patch also changes the code to call reuseport_migrate_sock() and inet_reqsk_clone(), but unlike the other cases, we do not call inet_reqsk_clone() right after reuseport_migrate_sock(). Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener has three kinds of refcnt: (A) for listener itself (B) carried by reuqest_sock (C) sock_hold() in tcp_v[46]_rcv() While processing the req, (A) may disappear by close(listener). Also, (B) can disappear by accept(listener) once we put the req into the accept queue. So, we have to hold another refcnt (C) for the listener to prevent use-after-free. For socket migration, we call reuseport_migrate_sock() to select a listener with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv(). This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv(). Thus we have to take another refcnt (B) for the newly cloned request_sock. In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and try to put the new req into the accept queue. By migrating req after winning the "own_req" race, we can avoid such a worst situation: CPU 1 looks up req1 CPU 2 looks up req1, unhashes it, then CPU 1 loses the race CPU 3 looks up req2, unhashes it, then CPU 2 loses the race ... Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp	2021-06-15 18:01:06 +02:00
Kuniyuki Iwashima	c905dee622	tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs. As with the preceding patch, this patch changes reqsk_timer_handler() to call reuseport_migrate_sock() and inet_reqsk_clone() to migrate in-flight requests at retransmitting SYN+ACKs. If we can select a new listener and clone the request, we resume setting the SYN+ACK timer for the new req. If we can set the timer, we call inet_ehash_insert() to unhash the old req and put the new req into ehash. The noteworthy point here is that by unhashing the old req, another CPU processing it may lose the "own_req" race in tcp_v[46]_syn_recv_sock() and drop the final ACK packet. However, the new timer will recover this situation. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-7-kuniyu@amazon.co.jp	2021-06-15 18:01:06 +02:00
Kuniyuki Iwashima	54b92e8419	tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues. When we call close() or shutdown() for listening sockets, each child socket in the accept queue are freed at inet_csk_listen_stop(). If we can get a new listener by reuseport_migrate_sock() and clone the request by inet_reqsk_clone(), we try to add it into the new listener's accept queue by inet_csk_reqsk_queue_add(). If it fails, we have to call __reqsk_free() to call sock_put() for its listener and free the cloned request. After putting the full socket into ehash, tcp_v[46]_syn_recv_sock() sets NULL to ireq_opt/pktopts in struct inet_request_sock, but ipv6_opt can be non-NULL. So, we have to set NULL to ipv6_opt of the old request to avoid double free. Note that we do not update req->rsk_listener and instead clone the req to migrate because another path may reference the original request. If we protected it by RCU, we would need to add rcu_read_lock() in many places. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-6-kuniyu@amazon.co.jp	2021-06-15 18:01:05 +02:00
Kuniyuki Iwashima	1cd62c2157	tcp: Add reuseport_migrate_sock() to select a new listener. reuseport_migrate_sock() does the same check done in reuseport_listen_stop_sock(). If the reuseport group is capable of migration, reuseport_migrate_sock() selects a new listener by the child socket hash and increments the listener's sk_refcnt beforehand. Thus, if we fail in the migration, we have to decrement it later. We will support migration by eBPF in the later commits. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp	2021-06-15 18:01:05 +02:00
Kuniyuki Iwashima	333bb73f62	tcp: Keep TCP_CLOSE sockets in the reuseport group. When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, we cannot do that because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and access it while any child socket references them. The point is that reuseport_detach_sock() was called twice from inet_unhash() and sk_destruct(). This patch replaces the first reuseport_detach_sock() with reuseport_stop_listen_sock(), which checks if the reuseport group is capable of migration. If capable, it decrements num_socks, moves the socket backwards in socks[] and increments num_closed_socks. When all connections are migrated, sk_destruct() calls reuseport_detach_sock() to remove the socket from socks[], decrement num_closed_socks, and set NULL to sk_reuseport_cb. By this change, closed or shutdowned sockets can keep sk_reuseport_cb. Consequently, calling listen() after shutdown() can cause EADDRINUSE or EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects such sockets not to have the reuseport group. Therefore, this patch also loosens such validation rules so that a socket can listen again if it has a reuseport group with num_closed_socks more than 0. When such sockets listen again, we handle them in reuseport_resurrect(). If there is an existing reuseport group (reuseport_add_sock() path), we move the socket from the old group to the new one and free the old one if necessary. If there is no existing group (reuseport_alloc() path), we allocate a new reuseport group, detach sk from the old one, and free it if necessary, not to break the current shutdown behaviour: - we cannot carry over the eBPF prog of shutdowned sockets - we cannot attach/detach an eBPF prog to/from listening sockets via shutdowned sockets Note that when the number of sockets gets over U16_MAX, we try to detach a closed socket randomly to make room for the new listening socket in reuseport_grow(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp	2021-06-15 18:01:05 +02:00
Kuniyuki Iwashima	5c040eaf5d	tcp: Add num_closed_socks to struct sock_reuseport. As noted in the following commit, a closed listener has to hold the reference to the reuseport group for socket migration. This patch adds a field (num_closed_socks) to struct sock_reuseport to manage closed sockets within the same reuseport group. Moreover, this and the following commits introduce some helper functions to split socks[] into two sections and keep TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE sockets from the end. TCP_LISTEN----------> <-------TCP_CLOSE +---+---+ --- +---+ --- +---+ --- +---+ \| 0 \| 1 \| ... \| i \| ... \| j \| ... \| k \| +---+---+ --- +---+ --- +---+ --- +---+ i = num_socks - 1 j = max_socks - num_closed_socks k = max_socks - 1 This patch also extends reuseport_add_sock() and reuseport_grow() to support num_closed_socks. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp	2021-06-15 18:01:05 +02:00
Kuniyuki Iwashima	f9ac779f88	net: Introduce net.ipv4.tcp_migrate_req. This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this option is enabled or eBPF program is attached, we will be able to migrate child sockets from a listener to another in the same reuseport group after close() or shutdown() syscalls. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp	2021-06-15 18:01:05 +02:00
Luiz Augusto von Dentz	995fca15b7	Bluetooth: SMP: Fix crash when receiving new connection when debug is enabled When receiving a new connection pchan->conn won't be initialized so the code cannot use bt_dev_dbg as the pointer to hci_dev won't be accessible. Fixes: `2e1614f7d6` ("Bluetooth: SMP: Convert BT_ERR/BT_DBG to bt_dev_err/bt_dev_dbg") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-06-14 22:16:27 +02:00
Vladimir Oltean	ec13357263	net: flow_dissector: fix RPS on DSA masters After the blamed patch, __skb_flow_dissect() on the DSA master stopped adjusting for the length of the DSA headers. This is because it was told to adjust only if the needed_headroom is zero, aka if there is no DSA header. Of course, the adjustment should be done only if there _is_ a DSA header. Modify the comment too so it is clearer. Fixes: `4e50025129` ("net: dsa: generalize overhead for taggers that use both headers and trailers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 13:15:22 -07:00
Oleksandr Mazur	ddee9dbc3d	net: core: devlink: add dropped stats traps field Whenever query statistics is issued for trap, devlink subsystem would also fill-in statistics 'dropped' field. This field indicates the number of packets HW dropped and failed to report to the device driver, and thus - to the devlink subsystem itself. In case if device driver didn't register callback for hard drop statistics querying, 'dropped' field will be omitted and not filled. Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 13:04:25 -07:00
Pavel Skripkin	ad9d24c942	net: qrtr: fix OOB Read in qrtr_endpoint_post Syzbot reported slab-out-of-bounds Read in qrtr_endpoint_post. The problem was in wrong _size_ type: if (len != ALIGN(size, 4) + hdrlen) goto err; If size from qrtr_hdr is 4294967293 (0xfffffffd), the result of ALIGN(size, 4) will be 0. In case of len == hdrlen and size == 4294967293 in header this check won't fail and skb_put_data(skb, data + hdrlen, size); will read out of bound from data, which is hdrlen allocated block. Fixes: `194ccc8829` ("net: qrtr: Support decoding incoming v2 packets") Reported-and-tested-by: syzbot+1917d778024161609247@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 13:01:26 -07:00
Oleksij Rempel	c916e8e1ea	net: dsa: dsa_slave_phy_connect(): extend phy's flags with port specific phy flags The current get_phy_flags() is only processed when we connect to a PHY via a designed phy-handle property via phylink_of_phy_connect(), but if we fallback on the internal MDIO bus created by a switch and take the dsa_slave_phy_connect() path then we would not be processing that flag and using it at PHY connection time. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 12:54:43 -07:00
Taehee Yoo	ffa85b73c3	mld: avoid unnecessary high order page allocation in mld_newpack() If link mtu is too big, mld_newpack() allocates high-order page. But most mld packets don't need high-order page. So, it might waste unnecessary pages. To avoid this, it makes mld_newpack() try to allocate order-0 page. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 12:46:00 -07:00
Colin Ian King	b5ec0705ff	ipv6: fib6: remove redundant initialization of variable err The variable err is being initialized with a value that is never read, the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 12:42:26 -07:00
David Ahern	b87b04f501	ipv4: Fix device used for dst_alloc with local routes Oliver reported a use case where deleting a VRF device can hang waiting for the refcnt to drop to 0. The root cause is that the dst is allocated against the VRF device but cached on the loopback device. The use case (added to the selftests) has an implicit VRF crossing due to the ordering of the FIB rules (lookup local is before the l3mdev rule, but the problem occurs even if the FIB rules are re-ordered with local after l3mdev because the VRF table does not have a default route to terminate the lookup). The end result is is that the FIB lookup returns the loopback device as the nexthop, but the ingress device is in a VRF. The mismatch causes the dst alloc against the VRF device but then cached on the loopback. The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu): pick the dst alloc device based the fib lookup result but with checks that the result has a nexthop device (e.g., not an unreachable or prohibit entry). Fixes: `f5a0aab84b` ("net: ipv4: dst for local input routes should use l3mdev if relevant") Reported-by: Oliver Herms <oliver.peter.herms@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 12:30:53 -07:00
Jakub Kicinski	e175aef902	ethtool: strset: fix message length calculation Outer nest for ETHTOOL_A_STRSET_STRINGSETS is not accounted for. This may result in ETHTOOL_MSG_STRSET_GET producing a warning like: calculated message payload length (684) not sufficient WARNING: CPU: 0 PID: 30967 at net/ethtool/netlink.c:369 ethnl_default_doit+0x87a/0xa20 and a splat. As usually with such warnings three conditions must be met for the warning to trigger: - there must be no skb size rounding up (e.g. reply_size of 684); - string set must be per-device (so that the header gets populated); - the device name must be at least 12 characters long. all in all with current user space it looks like reading priv flags is the only place this could potentially happen. Or with syzbot :) Reported-by: syzbot+59aa77b92d06cd5a54f2@syzkaller.appspotmail.com Fixes: `71921690f9` ("ethtool: provide string sets with STRSET_GET request") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 12:14:24 -07:00
Tyson Moore	4f667b8e04	sch_cake: revise docs for RFC 8622 LE PHB support Commit `b8392808eb` ("sch_cake: add RFC 8622 LE PHB support to CAKE diffserv handling") added the LE mark to the Bulk tin. Update the comments to reflect the change. Signed-off-by: Tyson Moore <tyson@tyson.me> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-14 12:10:57 -07:00
Steffen Klassert	6fd06963fa	xfrm: Fix error reporting in xfrm_state_construct. When memory allocation for XFRMA_ENCAP or XFRMA_COADDR fails, the error will not be reported because the -ENOMEM assignment to the err variable is overwritten before. Fix this by moving these two in front of the function so that memory allocation failures will be reported. Reported-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-06-14 11:58:43 +02:00
Johannes Berg	88b710532e	wwan: add interface creation support Add support to create (and destroy) interfaces via a new rtnetlink kind "wwan". The responsible driver has to use the new wwan_register_ops() to make this possible. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-12 13:16:45 -07:00
Johannes Berg	00e77ed8e6	rtnetlink: add IFLA_PARENT_[DEV\|DEV_BUS]_NAME In some cases, for example in the upcoming WWAN framework changes, there's no natural "parent netdev", so sometimes dummy netdevs are created or similar. IFLA_PARENT_DEV_NAME is a new attribute intended to contain a device (sysfs, struct device) name that can be used instead when creating a new netdev, if the rtnetlink family implements it. As suggested by Parav Pandit, we also introduce IFLA_PARENT_DEV_BUS_NAME attribute in order to uniquely identify a device on the system (with bus/name pair). ip-link(8) support for the generic parent device attributes will help us avoid code duplication, so no other link type will require a custom code to handle the parent name attribute. E.g. the WWAN interface creation command will looks like this: $ ip link add wwan0-1 parent-dev wwan0 type wwan channel-id 1 So, some future subsystem (or driver) FOO will have an interface creation command that looks like this: $ ip link add foo1-3 parent-dev foo1 type foo bar-id 3 baz-type Y Below is an example of dumping link info of a random device with these new attributes: $ ip --details link show wlp0s20f3 4: wlp0s20f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000 ... parent_bus pci parent_dev 0000:00:14.3 Co-developed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Co-developed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Suggested-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-12 13:16:45 -07:00
Johannes Berg	8c713dc93c	rtnetlink: add alloc() method to rtnl_link_ops In order to make rtnetlink ops that can create different kinds of devices, like what we want to add to the WWAN framework, the priv_size and setup parameters aren't quite sufficient. Make this easier to manage by allowing ops to allocate their own netdev via an @alloc method that gets the tb netlink data. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-12 13:16:45 -07:00
Changbin Du	ea6932d70e	net: make get_net_ns return error if NET_NS is disabled There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled. The reason is that nsfs tries to access ns->ops but the proc_ns_operations is not implemented in this case. [7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [7.670268] pgd = 32b54000 [7.670544] [00000010] *pgd=00000000 [7.671861] Internal error: Oops: 5 [#1] SMP ARM [7.672315] Modules linked in: [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16 [7.673309] Hardware name: Generic DT based system [7.673642] PC is at nsfs_evict+0x24/0x30 [7.674486] LR is at clear_inode+0x20/0x9c The same to tun SIOCGSKNS command. To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is disabled. Meanwhile move it to right place net/core/net_namespace.c. Signed-off-by: Changbin Du <changbin.du@gmail.com> Fixes: `c62cce2cae` ("net: add an ioctl to get a socket network namespace") Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-12 13:13:08 -07:00
Julian Wiedmann	87c272c618	net/af_iucv: clean up some forward declarations The forward declarations for the iucv_handler callbacks are causing various compile warnings with gcc-11. Reshuffle the code to get rid of these prototypes. Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-12 13:06:33 -07:00
Arseny Krasnov	6e90a57795	vsock/loopback: enable SEQPACKET for transport Add SEQPACKET ops for loopback transport and 'seqpacket_allow()' callback. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:47 -07:00
Arseny Krasnov	53efbba12c	virtio/vsock: enable SEQPACKET for transport To make transport work with SOCK_SEQPACKET add two things: 1) SOCK_SEQPACKET ops for virtio transport and 'seqpacket_allow()' callback. 2) Handling of SEQPACKET bit: guest tries to negotiate it with vhost, so feature will be enabled only if bit is negotiated with device. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:47 -07:00
Arseny Krasnov	9ac841f5e9	virtio/vsock: rest of SOCK_SEQPACKET support Small updates to make SOCK_SEQPACKET work: 1) Send SHUTDOWN on socket close for SEQPACKET type. 2) Set SEQPACKET packet type during send. 3) Set 'VIRTIO_VSOCK_SEQ_EOR' bit in flags for last packet of message. 4) Implement data check function for SEQPACKET. 5) Check for max datagram size. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:47 -07:00
Arseny Krasnov	e4b1ef152f	virtio/vsock: add SEQPACKET receive logic Update current receive logic for SEQPACKET support: performs check for packet and socket types on receive(if mismatch, then reset connection). Increment EOR counter on receive. Also if buffer of new packet was appended to buffer of last packet in rx queue, update flags of last packet with flags of new packet. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:47 -07:00
Arseny Krasnov	44931195a5	virtio/vsock: dequeue callback for SOCK_SEQPACKET Callback fetches RW packets from rx queue of socket until whole record is copied(if user's buffer is full, user is not woken up). This is done to not stall sender, because if we wake up user and it leaves syscall, nobody will send credit update for rest of record, and sender will wait for next enter of read syscall at receiver's side. So if user buffer is full, we just send credit update and drop data. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:47 -07:00
Arseny Krasnov	c10844c597	virtio/vsock: simplify credit update function API This function is static and 'hdr' arg was always NULL. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:47 -07:00
Arseny Krasnov	b93f8877c1	virtio/vsock: set packet's type in virtio_transport_send_pkt_info() There is no need to set type of packet which differs from type of socket, so move passing type of packet from 'info' structure to 'virtio_transport_send_pkt_info()' function. Since at current time only stream type is supported, set it directly in 'virtio_ transport_send_pkt_info()', so callers don't need to set it. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:47 -07:00
Arseny Krasnov	8cb48554ad	af_vsock: update comments for stream sockets Replace 'stream' to 'connection oriented' in comments as SEQPACKET is also connection oriented. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:46 -07:00
Arseny Krasnov	0798e78b10	af_vsock: rest of SEQPACKET support Add socket ops for SEQPACKET type and .seqpacket_allow() callback to query transports if they support SEQPACKET. Also split path for data check for STREAM and SEQPACKET branches. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:46 -07:00
Arseny Krasnov	fbe70c4807	af_vsock: implement send logic for SEQPACKET Update current stream enqueue function for SEQPACKET support: 1) Call transport's seqpacket enqueue callback. 2) Return value from enqueue function is whole record length or error for SOCK_SEQPACKET. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:46 -07:00
Arseny Krasnov	9942c192b2	af_vsock: implement SEQPACKET receive loop Add receive loop for SEQPACKET. It looks like receive loop for STREAM, but there are differences: 1) It doesn't call notify callbacks. 2) It doesn't care about 'SO_SNDLOWAT' and 'SO_RCVLOWAT' values, because there is no sense for these values in SEQPACKET case. 3) It waits until whole record is received. 4) It processes and sets 'MSG_TRUNC' flag. So to avoid extra conditions for two types of socket inside one loop, two independent functions were created. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-11 13:32:46 -07:00

... 2 3 4 5 6 ...

65728 Commits