Table 9-251—Supported VHT-MCS and NSS Set subfields, it has subfield VHT Extended
NSS BW Capable, its definition is:
Indicates whether the STA is capable of interpreting the Extended NSS BW
Support subfield of the VHT Capabilities Information field.
This patch is to add check for the subfield.
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/20210524033624.16993-1-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
offload_flags has been introduced to indicate encap status of each interface.
An interface can encap offload at runtime, or if it has some extra limitations
it can simply override the flags, so it's more flexible to check offload_flags
in Tx path.
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Link: https://lore.kernel.org/r/177785418cf407808bf3a44760302d0647076990.1623961575.git.ryder.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make ieee80211_tx_h_rate_ctrl() get called on dequeue to improve
performance since it reduces the turnaround time for rate control.
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Link: https://lore.kernel.org/r/20210617163113.75815-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This avoids calling back into tx handlers from within the rate control module.
Preparation for deferring rate control until tx dequeue
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210617163113.75815-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we have been keeping per-CPU statistics, consider them
regardless of USES_RSS, because we may not actually fill
those, for example in non-fast-RX cases when the connection
is not compatible with fast-RX. If we didn't fill them, the
additional data will be zero and not affect anything, and
if we did fill them then it's more correct to consider them.
This fixes an issue in mesh mode where some statistics are
not updated due to USES_RSS being set, but fast-RX isn't
used.
Reported-by: Thiraviyam Mariyappan <tmariyap@codeaurora.org>
Link: https://lore.kernel.org/r/20210610220814.13b35f5797c5.I511e9b33c5694e0d6cef4b6ae755c873d7c22124@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In 2G band, a HE sta can only supports HT and HE, but not supports VHT.
In this case, default HE tx bitrate mask isn't filled, when we use iw to
set bitrates without any parameter.
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Link: https://lore.kernel.org/r/20210609075944.51130-1-pkshih@realtek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Use newly introduced helper function ieee80211_is_tx_data to check if
frame is a data frame. Takes into account that hardware encapsulation
can be enabled for a frame and therefore no ieee80211 header is present.
Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de>
Link: https://lore.kernel.org/r/20210519122019.92359-4-borgers@mi.fu-berlin.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Data Frames with no ack flag set should be handled by the rate
controler. Make sure we reach the rate controler by returning early
from rate_control_send_low if the frame is a data frame with no ack
flag.
Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de>
Link: https://lore.kernel.org/r/20210519122019.92359-3-borgers@mi.fu-berlin.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We need to differentiate these frames since the ones we
currently put on the skb_queue_tdls_chsw have already
been converted to ethernet format, but now that we've
got a single place to enqueue to the sdata->skb_queue
this isn't hard. Just differentiate based on protocol
and adjust the code to queue the SKBs appropriately.
Link: https://lore.kernel.org/r/20210517230754.17034990abef.I5342f2183c0d246b18d36c511eb3b6be298a6572@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
QoS Data Frames that were sent with a No Ack policy should be ignored by
the minstrel statistics. There will never be an Ack for these frames so
there is no way to draw conclusions about the success of the transmission.
Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de>
Link: https://lore.kernel.org/r/20210517120145.132814-1-borgers@mi.fu-berlin.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The "ap_info->tbtt_info_len" and "length" variables are the same value
but it is confusing how the names are mixed up. Let's use "length"
everywhere for consistency.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YJaMNzZENkYFAYQX@mwanda
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Variable 'ret' is set to -ENODEV but this value is never read as it
is overwritten with a new value later on, hence it is a redundant
assignment and can be removed.
Clean up the following clang-analyzer warning:
net/mac80211/debugfs_netdev.c:60:2: warning: Value stored to 'ret' is
never read [clang-analyzer-deadcode.DeadStores]
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/1619774483-116805-1-git-send-email-yang.lee@linux.alibaba.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Variable 'ps' is set to wdev->ps but this value is never read as it is
overwritten with a new value later on, hence it is a redundant
assignment and can be removed.
Cleans up the following clang-analyzer warning:
net/wireless/wext-compat.c:1170:7: warning: Value stored to 'ps' during
its initialization is never read [clang-analyzer-deadcode.DeadStores]
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/1619603945-116891-1-git-send-email-yang.lee@linux.alibaba.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix the following out-of-bounds warning:
net/wireless/wext-spy.c:178:2: warning: 'memcpy' offset [25, 28] from the object at 'threshold' is out of the bounds of referenced subobject 'low' with type 'struct iw_quality' at offset 20 [-Warray-bounds]
The problem is that the original code is trying to copy data into a
couple of struct members adjacent to each other in a single call to
memcpy(). This causes a legitimate compiler warning because memcpy()
overruns the length of &threshold.low and &spydata->spy_thr_low. As
these are just a couple of struct members, fix this by using direct
assignments, instead of memcpy().
This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().
Link: https://github.com/KSPP/linux/issues/109
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210422200032.GA168995@embeddedor
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pablo Neira Ayuso says
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Nicolas Dichtel updates MAINTAINERS file to add Netfilter IRC channel.
2) Skip non-IPv6 packets in nft_exthdr.
3) Skip non-TCP packets in nft_osf.
4) Skip non-TCP/UDP packets in nft_tproxy.
5) Memleak in hardware offload infrastructure when counters are used
for first time in a rule.
6) The VLAN transfer routine must use FLOW_DISSECTOR_KEY_BASIC instead
of FLOW_DISSECTOR_KEY_CONTROL. Moreover, make a more robust check
for 802.1q and 802.1ad to restore simple matching on transport
protocols.
7) Fix bogus EPERM when listing a ruleset when table ownership flag
is set on.
8) Honor table ownership flag when table is referenced by handle.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The inner_ipproto saves the inner IP protocol of the plain
text packet. This allows vendor's IPsec feature making offload
decision at skb's features_check and configuring hardware at
ndo_start_xmit.
For example, ConnectX6-DX IPsec device needs the plaintext's
IP protocol to support partial checksum offload on
VXLAN/GENEVE packet over IPsec transport mode tunnel.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The current cleanup rbuf tries a bit too hard to avoid acquiring
the subflow socket lock. We may end-up delaying the needed ack,
or skip acking a blocked subflow.
Address the above extending the conditions used to trigger the cleanup
to reflect more closely what TCP does and invoking tcp_cleanup_rbuf()
on all the active subflows.
Note that we can't replicate the exact tests implemented in
tcp_cleanup_rbuf(), as MPTCP lacks some of the required info - e.g.
ping-pong mode.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new flag named deny_join_id0 in struct
mptcp_options_received. Set it when MP_CAPABLE with the flag
MPTCP_CAP_DENYJOIN_ID0 is received.
Also add a new flag remote_deny_join_id0 in struct mptcp_pm_data. When the
flag deny_join_id0 is set, set this remote_deny_join_id0 flag.
In mptcp_pm_create_subflow_or_signal_addr, if the remote_deny_join_id0 flag
is set, and the remote address id is zero, stop this connection.
Suggested-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defined a new flag MPTCP_CAP_DENY_JOIN_ID0 for the third bit,
labeled "C" of the MP_CAPABLE option.
Add a new flag allow_join_id0 in struct mptcp_out_options. If this flag is
set, send out the MP_CAPABLE option with the flag MPTCP_CAP_DENY_JOIN_ID0.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new sysctl, named allow_join_initial_addr_port, to
control whether allow peers to send join requests to the IP address and
port number used by the initial subflow.
Suggested-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, sctp over udp was using udp tunnel's icmp err process, which
only does sk lookup on sctp side. However for sctp's icmp error process,
there are more things to do, like syncing assoc pmtu/retransmit packets
for toobig type err, and starting proto_unreach_timer for unreach type
err etc.
Now after adding PLPMTUD, which also requires to process toobig type err
on sctp side. This patch is to process icmp err on sctp side by parsing
the type/code/info in .encap_err_lookup and call sctp's icmp processing
functions. Note as the 'redirect' err process needs to know the outer
ip(v6) header's, we have to leave it to udp(v6)_err to handle it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to extract sctp_v4_err_handle() from sctp_v4_err() to
only handle the icmp err after the sock lookup, and it also makes
the code clearer.
sctp_v4_err_handle() will be used in sctp over udp's err handling
in the following patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to extract sctp_v6_err_handle() from sctp_v6_err() to
only handle the icmp err after the sock lookup, and it also makes
the code clearer.
sctp_v6_err_handle() will be used in sctp over udp's err handling
in the following patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same as in tcp_v6_err() and __udp6_lib_err(), there's no need to
hold idev in sctp_v6_err(), so just call __in6_dev_get() instead.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_transport_pl_reset() is called whenever any of these 3 members in
transport is changed:
- probe_interval
- param_flags & SPP_PMTUD_ENABLE
- state == ACTIVE
If all are true, start the PLPMTUD when it's not yet started. If any of
these is false, stop the PLPMTUD when it's already running.
sctp_transport_pl_update() is called when the transport dst has changed.
It will restart the PLPMTUD probe. Again, the pathmtu won't change but
use the dst's mtu until the Search phase is done.
Note that after using PLPMTUD, the pathmtu is only initialized with the
dst mtu when the transport dst changes. At other time it is updated by
pl.pmtu. So sctp_transport_pmtu_check() will be called only when PLPMTUD
is disabled in sctp_packet_config().
After this patch, the PLPMTUD feature from RFC8899 will be activated
and can be used by users.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PLPMTUD will short-circuit the old process for icmp TOOBIG packets.
This part is described in rfc8899#section-4.6.2 (PL_PTB_SIZE =
PTB_SIZE - other_headers_len). Note that from rfc8899#section-5.2
State Machine, each case below is for some specific states only:
a) PL_PTB_SIZE < MIN_PLPMTU || PL_PTB_SIZE >= PROBED_SIZE,
discard it, for any state
b) MIN_PLPMTU < PL_PTB_SIZE < BASE_PLPMTU,
Base -> Error, for Base state
c) BASE_PLPMTU <= PL_PTB_SIZE < PLPMTU,
Search -> Base or Complete -> Base, for Search and Complete states.
d) PLPMTU < PL_PTB_SIZE < PROBED_SIZE,
set pl.probe_size to PL_PTB_SIZE then verify it, for Search state.
The most important one is case d), which will help find the optimal
fast during searching. Like when pathmtu = 1392 for SCTP over IPv4,
the search will be (20 is iphdr_len):
1. probe with 1200 - 20
2. probe with 1232 - 20
3. probe with 1264 - 20
...
7. probe with 1388 - 20
8. probe with 1420 - 20
When sending the probe with 1420 - 20, TOOBIG may come with PL_PTB_SIZE =
1392 - 20. Then it matches case d), and saves some rounds to try with the
1392 - 20 probe. But of course, PLPMTUD doesn't trust TOOBIG packets, and
it will go back to the common searching once the probe with the new size
can't be verified.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As described in rfc8899#section-5.2, when a probe succeeds, there might
be the following state transitions:
- Base -> Search, occurs when probe succeeds with BASE_PLPMTU,
pl.pmtu is not changing,
pl.probe_size increases by SCTP_PL_BIG_STEP,
- Error -> Search, occurs when probe succeeds with BASE_PLPMTU,
pl.pmtu is changed from SCTP_MIN_PLPMTU to SCTP_BASE_PLPMTU,
pl.probe_size increases by SCTP_PL_BIG_STEP.
- Search -> Search Complete, occurs when probe succeeds with the probe
size SCTP_MAX_PLPMTU less than pl.probe_high,
pl.pmtu is not changing, but update *pathmtu* with it,
pl.probe_size is set back to pl.pmtu to double check it.
- Search Complete -> Search, occurs when probe succeeds with the probe
size equal to pl.pmtu,
pl.pmtu is not changing,
pl.probe_size increases by SCTP_PL_MIN_STEP.
So search process can be described as:
1. When it just enters 'Search' state, *pathmtu* is not updated with
pl.pmtu, and probe_size increases by a big step (SCTP_PL_BIG_STEP)
each round.
2. Until pl.probe_high is set when a probe fails, and probe_size
decreases back to pl.pmtu, as described in the last patch.
3. When the probe with the new size succeeds, probe_size changes to
increase by a small step (SCTP_PL_MIN_STEP) due to pl.probe_high
is set.
4. Until probe_size is next to pl.probe_high, the searching finishes and
it goes to 'Complete' state and updates *pathmtu* with pl.pmtu, and
then probe_size is set to pl.pmtu to confirm by once more probe.
5. This probe occurs after "30 * probe_inteval", a much longer time than
that in Search state. Once it is done it goes to 'Search' state again
with probe_size increased by SCTP_PL_MIN_STEP.
As we can see above, during the searching, pl.pmtu changes while *pathmtu*
doesn't. *pathmtu* is only updated when the search finishes by which it
gets an optimal value for it. A big step is used at the beginning until
it gets close to the optimal value, then it changes to a small step until
it has this optimal value.
The small step is also used in 'Complete' until it goes to 'Search' state
again and the probe with 'pmtu + the small step' succeeds, which means a
higher size could be used. Then probe_size changes to increase by a big
step again until it gets close to the next optimal value.
Note that anytime when black hole is detected, it goes directly to 'Base'
state with pl.pmtu set to SCTP_BASE_PLPMTU, as described in the last patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The state transition is described in rfc8899#section-5.2,
PROBE_COUNT == MAX_PROBES means the probe fails for MAX times, and the
state transition includes:
- Base -> Error, occurs when BASE_PLPMTU Confirmation Fails,
pl.pmtu is set to SCTP_MIN_PLPMTU,
probe_size is still SCTP_BASE_PLPMTU;
- Search -> Base, occurs when Black Hole Detected,
pl.pmtu is set to SCTP_BASE_PLPMTU,
probe_size is set back to SCTP_BASE_PLPMTU;
- Search Complete -> Base, occurs when Black Hole Detected
pl.pmtu is set to SCTP_BASE_PLPMTU,
probe_size is set back to SCTP_BASE_PLPMTU;
Note a black hole is encountered when a sender is unaware that packets
are not being delivered to the destination endpoint. So it includes the
probe failures with equal probe_size to pl.pmtu, and definitely not
include that with greater probe_size than pl.pmtu. The later one is the
normal probe failure where probe_size should decrease back to pl.pmtu
and pl.probe_high is set. pl.probe_high would be used on HB ACK recv
path in the next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does exactly what rfc8899#section-6.2.1.2 says:
The SCTP sender needs to be able to determine the total size of a
probe packet. The HEARTBEAT chunk could carry a Heartbeat
Information parameter that includes, besides the information
suggested in [RFC4960], the probe size to help an implementation
associate a HEARTBEAT ACK with the size of probe that was sent. The
sender could also use other methods, such as sending a nonce and
verifying the information returned also contains the corresponding
nonce. The length of the PAD chunk is computed by reducing the
probing size by the size of the SCTP common header and the HEARTBEAT
chunk.
Note that HB ACK chunk will carry back whatever HB chunk carried, including
the probe_size we put it in; We also check hbinfo->probe_size in the HB ACK
against link->pl.probe_size to validate this HB ACK chunk.
v1->v2:
- Remove the unused 'sp' and add static for sctp_packet_bundle_pad().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are 3 timers described in rfc8899#section-5.1.1:
PROBE_TIMER, PMTU_RAISE_TIMER, CONFIRMATION_TIMER
This patches adds a 'probe_timer' in transport, and it works as either
PROBE_TIMER or PMTU_RAISE_TIMER. At most time, it works as PROBE_TIMER
and expires every a 'probe_interval' time to send the HB probe packet.
When transport pl enters COMPLETE state, it works as PMTU_RAISE_TIMER
and expires in 'probe_interval * 30' time to go back to SEARCH state
and do searching again.
SCTP HB is an acknowledged packet, CONFIRMATION_TIMER is not needed.
The timer will start when transport pl enters BASE state and stop
when it enters DISABLED state.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this socket option, users can change probe_interval for
a transport, asoc or sock after it's created.
Note that if the change is for an asoc, also apply the change
to each transport in this asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PLPMTUD can be enabled by doing 'sysctl -w net.sctp.probe_interval=n'.
'n' is the interval for PLPMTUD probe timer in milliseconds, and it
can't be less than 5000 if it's not 0.
All asoc/transport's PLPMTUD in a new socket will be enabled by default.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This chunk is defined in rfc4820#section-3, and used to pad an
SCTP packet. The receiver must discard this chunk and continue
processing the rest of the chunks in the packet.
Add it now, as it will be bundled with a heartbeat chunk to probe
pmtu in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes openvswitch module use the event tracing framework
to log the upcall interface and action execution pipeline. When
using openvswitch as the packet forwarding engine, some types of
debugging are made possible simply by using the ovs-vswitchd's
ofproto/trace command. However, such a command has some
limitations:
1. When trying to trace packets that go through the CT action,
the state of the packet can't be determined, and probably
would be potentially wrong.
2. Deducing problem packets can sometimes be difficult as well
even if many of the flows are known
3. It's possible to use the openvswitch module even without
the ovs-vswitchd (although, not common use).
Introduce the event tracing points here to make it possible for
working through these problems in kernel space. The style is
copied from the mac80211 driver-trace / trace code for
consistency - this creates some checkpatch splats, but the
official 'guide' for adding tracepoints, as well as the existing
examples all add the same splats so it seems acceptable.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Validate the offset to read from module EEPROM as part of the netlink
policy and remove the corresponding check from the code.
This also makes it possible to query the offset range from user space:
$ genl ctrl policy name ethtool
...
ID: 0x14 policy[32]:attr[2]: type=U32 range:[0,255]
...
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Validate the number of bytes to read from the module EEPROM as part of
the netlink policy and remove the corresponding check from the code.
This also makes it possible to query the length range from user space:
$ genl ctrl policy name ethtool
...
ID: 0x14 policy[32]:attr[3]: type=U32 range:[1,128]
...
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'ETHTOOL_A_MODULE_EEPROM_DATA' attribute is not part of the get
request.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return statements are not needed in Void function.
Signed-off-by: gushengxian <gushengxian@yulong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing source address validation, the flowi4 struct used for
fib_lookup should be in the reverse direction to the given skb.
fl4_dport and fl4_sport returned by fib4_rules_early_flow_dissect
should thus be swapped.
Fixes: 5a847a6e14 ("net/ipv4: Initialize proto and ports in flow struct")
Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6c11fbf97e ("ip6_tunnel: add MPLS transmit support")
moved assiging inner_ipproto down from ipxip6_tnl_xmit() to
its callee ip6_tnl_xmit(). The latter is also used by GRE.
Since commit 3872035241 ("gre: Use inner_proto to obtain inner
header protocol") GRE had been depending on skb->inner_protocol
during segmentation. It sets it in gre_build_header() and reads
it in gre_gso_segment(). Changes to ip6_tnl_xmit() overwrite
the protocol, resulting in GSO skbs getting dropped.
Note that inner_protocol is a union with inner_ipproto,
GRE uses the former while the change switched it to the latter
(always setting it to just IPPROTO_GRE).
Restore the original location of skb_set_inner_ipproto(),
it is unclear why it was moved in the first place.
Fixes: 6c11fbf97e ("ip6_tunnel: add MPLS transmit support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 7896248983 ("mptcp: add skeleton to sync msk socket
options to subflows") introduced a duplicate declaration of
mptcp_setsockopt(), just drop it.
Reported-by: Florian Westphal <fw@strlen.de>
Fixes: 7896248983 ("mptcp: add skeleton to sync msk socket options to subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The msk socket state is currently updated in a few spots without
owning the msk socket lock itself.
Some of such operations are safe, as they happens before exposing
the msk socket to user-space and can't race with other changes.
A couple of them, at connect time, can actually race with close()
or shutdown(), leaving breaking the socket state machine.
This change addresses the issue moving such update under the msk
socket lock with the usual:
<acquire spinlock>
<check sk lock onwers>
<ev defer to release_cb>
scheme.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/56
Fixes: 8fd738049a ("mptcp: fallback in case of simultaneous connect")
Fixes: c3c123d16c ("net: mptcp: don't hang in mptcp_sendmsg() after TCP fallback")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 32-bit architecture, the result of sizeof() is a 32-bit integer so
the expression becomes the multiplication between 2 32-bit integer which
can potentially leads to integer overflow. As a result,
bpf_map_area_alloc() allocates less memory than needed.
Fix this by casting 1 operand to u64.
Fixes: 0d2c4f9640 ("bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps")
Fixes: 99c51064fb ("devmap: Use bpf_map_area_alloc() for allocating hash buckets")
Fixes: 546ac1ffb7 ("bpf: add devmap, a map for storing net device references")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210613143440.71975-1-minhquangbui99@gmail.com
Account this exceptional events for better introspection.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we check the msk state to avoid enqueuing new
skbs at msk shutdown time.
Such test is racy - as we can't acquire the msk socket lock -
and useless, as the caller already checked the subflow
field 'disposable', covering the same scenario in a race
free manner - read and updated under the ssk socket lock.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we don't flush entirely the receive queue, we need set
again such bit later. We can simply avoid clearing it.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a bunch of callsite where the ssk socket
lock is acquired using the full-blown version eligible for
the fast variant. Let's move to the latter.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mentioned cache was introduced to reduce the number of skb
allocation in atomic context, but the required complexity is
excessive.
This change remove the mentioned cache.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nft_table_lookup_byhandle() also needs to validate the netlink PortID
owner when deleting a table by handle.
Fixes: 6001a930ce ("netfilter: nftables: introduce table ownership")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_table_lookup() allows us to obtain the table object by the name and
the family. The netlink portID validation needs to be skipped for the
dump path, since the ownership only applies to commands to update the
given table. Skip validation if the specified netlink PortID is zero
when calling nft_table_lookup().
Fixes: 6001a930ce ("netfilter: nftables: introduce table ownership")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case of xfrm offload, if xdo_dev_state_add() of driver returns
-EOPNOTSUPP, xfrm offload fallback is failed.
In xfrm state_add() both xso->dev and xso->real_dev are initialized to
dev and when err(-EOPNOTSUPP) is returned only xso->dev is set to null.
So in this scenario the condition in func validate_xmit_xfrm(),
if ((x->xso.dev != dev) && (x->xso.real_dev == dev))
return skb;
returns true, due to which skb is returned without calling esp_xmit()
below which has fallback code. Hence the CRYPTO_FALLBACK is failing.
So fixing this with by keeping x->xso.real_dev as NULL when err is
returned in func xfrm_dev_state_add().
Fixes: bdfd2d1fa7 ("bonding/xfrm: use real_dev instead of slave_dev")
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This reverts commit 0dca2c7404.
The commit in question breaks hardware offload of flower filters.
Quoting Vladimir Oltean <olteanv@gmail.com>:
fl_hw_replace_filter() and fl_reoffload() create a struct
flow_cls_offload with a rule->match.mask member derived from the mask
of the software classifier: &f->mask->key - that same mask that is used
for initializing the flow dissector keys, and the one from which Boris
removed the basic.n_proto member because it was bothering him.
Reported-by: Vadym Kochan <vadym.kochan@plvision.eu>
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The client's sk_state will be set to TCP_ESTABLISHED if the server
replay the client's connect request.
However, if the client has pending signal, its sk_state will be set
to TCP_CLOSE without notify the server, so the server will hold the
corrupt connection.
client server
1. sk_state=TCP_SYN_SENT |
2. call ->connect() |
3. wait reply |
| 4. sk_state=TCP_ESTABLISHED
| 5. insert to connected list
| 6. reply to the client
7. sk_state=TCP_ESTABLISHED |
8. insert to connected list |
9. *signal pending* <--------------------- the user kill client
10. sk_state=TCP_CLOSE |
client is exiting... |
11. call ->release() |
virtio_transport_close
if (!(sk->sk_state == TCP_ESTABLISHED ||
sk->sk_state == TCP_CLOSING))
return true; *return at here, the server cannot notice the connection is corrupt*
So the client should notify the peer in this case.
Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Norbert Slusarek <nslusarek@gmx.net>
Cc: Andra Paraschiv <andraprs@amazon.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: David Brazdil <dbrazdil@google.com>
Cc: Alexander Popov <alex.popov@linux.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lkml.org/lkml/2021/5/17/418
Signed-off-by: lixianming <lixianming5@huawei.com>
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify the pr_info content from int to char * in sock_register() and
sock_unregister(), this looks more readable.
Fixed build error in ARCH=sparc64.
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation of 32 bit DSN expansion is buggy.
After the previous patch, we can simply reuse the newly
introduced helper to do the expansion safely.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/120
Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving 32 bits DSS ack from the peer, the MPTCP need
to expand them to 64 bits value. The current code is buggy
WRT detecting 32 bits ack wrap-around: when the wrap-around
happens the current unsigned 32 bit ack value is lower than
the previous one.
Additionally check for possible reverse wrap and make the helper
visible, so that we could re-use it for the next patch.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/204
Fixes: cc9d256698 ("mptcp: update per unacked sequence on pkt reception")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VLAN transfer logic should actually check for
FLOW_DISSECTOR_KEY_BASIC, not FLOW_DISSECTOR_KEY_CONTROL. Moreover, do
not fallback to case 2) .n_proto is set to 802.1q or 802.1ad, if
FLOW_DISSECTOR_KEY_BASIC is unset.
Fixes: 783003f3bb ("netfilter: nftables_offload: special ethertype handling for VLAN")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Release flow from the abort path, this is easy to reproduce since
b72920f6e4 ("netfilter: nftables: counter hardware offload support").
If the preparation phase fails, then the abort path is exercised without
releasing the flow rule object.
unreferenced object 0xffff8881f0fa7700 (size 128):
comm "nft", pid 1335, jiffies 4294931120 (age 4163.740s)
hex dump (first 32 bytes):
08 e4 de 13 82 88 ff ff 98 e4 de 13 82 88 ff ff ................
48 e4 de 13 82 88 ff ff 01 00 00 00 00 00 00 00 H...............
backtrace:
[<00000000634547e7>] flow_rule_alloc+0x26/0x80
[<00000000c8426156>] nft_flow_rule_create+0xc9/0x3f0 [nf_tables]
[<0000000075ff8e46>] nf_tables_newrule+0xc79/0x10a0 [nf_tables]
[<00000000ba65e40e>] nfnetlink_rcv_batch+0xaac/0xf90 [nfnetlink]
[<00000000505c614a>] nfnetlink_rcv+0x1bb/0x1f0 [nfnetlink]
[<00000000eb78e1fe>] netlink_unicast+0x34b/0x480
[<00000000a8f72c94>] netlink_sendmsg+0x3af/0x690
[<000000009cb1ddf4>] sock_sendmsg+0x96/0xa0
[<0000000039d06e44>] ____sys_sendmsg+0x3fe/0x440
[<00000000137e82ca>] ___sys_sendmsg+0xd8/0x140
[<000000000c6bf6a6>] __sys_sendmsg+0xb3/0x130
[<0000000043bd6268>] do_syscall_64+0x40/0xb0
[<00000000afdebc2d>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Remove flow rule release from the offload commit path, otherwise error
from the offload commit phase might trigger a double-free due to the
execution of the abort_offload -> abort. After this patch, the abort
path takes care of releasing the flow rule.
This fix also needs to move the nft_flow_rule_create() call before the
transaction object is added otherwise the abort path might find a NULL
pointer to the flow rule object for the NFT_CHAIN_HW_OFFLOAD case.
While at it, rename BASIC-like goto tags to slightly more meaningful
names rather than adding a new "err3" tag.
Fixes: 63b48c73ff ("netfilter: nf_tables_offload: undo updates if transaction fails")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The kernel version of snprintf() can't return negatives. The
"ret > (int)sizeof(sym)" check is off by one because and it should be
>=. Finally, we need to set a negative error code.
Fixes: e2cf17d377 ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With MRP hardware assist being supported only by the ocelot switch
family, which by design does not support cross-chip bridging, the
current match functions are at best a guess and have not been confirmed
in any way to do anything relevant in a multi-switch topology.
Drop the code and make the notifiers match only on the targeted switch
port.
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dsa_slave_change_mtu() calls dsa_port_mtu_change() twice:
- it sends a cross-chip notifier with the MTU of the CPU port which is
used to update the DSA links.
- it sends one targeted MTU notifier which is supposed to only match the
user port on which we are changing the MTU. The "propagate_upstream"
variable is used here to bypass the cross-chip notifier system from
switch.c
But due to a mistake, the second, targeted notifier matches not only on
the user port, but also on the DSA link which is a member of the same
switch, if that exists.
And because the DSA links of the entire dst were programmed in a
previous round to the largest_mtu via a "propagate_upstream == true"
notification, then the dsa_port_mtu_change(propagate_upstream == false)
call that is immediately upcoming will break the MTU on the one DSA link
which is chip-wise local to the dp whose MTU is changing right now.
Example given this daisy chain topology:
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ cpu ] [ user ] [ user ] [ dsa ] [ user ]
[ x ] [ ] [ ] [ x ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
ip link set sw0p1 mtu 9000
ip link set sw1p1 mtu 9000 # at this stage, sw0p1 and sw1p1 can talk
# to one another using jumbo frames
ip link set sw0p2 mtu 1500 # this programs the sw0p3 DSA link first to
# the largest_mtu of 9000, then reprograms it to
# 1500 with the "propagate_upstream == false"
# notifier, breaking communication between
# sw0p1 and sw1p1
To escape from this situation, make the targeted match really match on a
single port - the user port, and rename the "propagate_upstream"
variable to "targeted_match" to clarify the intention and avoid future
issues.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we have a cross-chip topology like this:
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ cpu ] [ user ] [ user ] [ dsa ] [ user ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
and we issue the following commands:
1. ip link set sw0p1 mtu 1700
2. ip link set sw1p1 mtu 1600
we notice the following happening:
Command 1. emits a non-targeted MTU notifier for the CPU port (sw0p0)
with the largest_mtu calculated across switch 0, of 1700. This matches
sw0p0, sw0p3 and sw1p4 (all CPU ports and DSA links).
Then, it emits a targeted MTU notifier for the user port (sw0p1), again
with MTU 1700 (this doesn't matter).
Command 2. emits a non-targeted MTU notifier for the CPU port (sw0p0)
with the largest_mtu calculated across switch 1, of 1600. This matches
the same group of ports as above, and decreases the MTU for the CPU port
and the DSA links from 1700 to 1600.
As a result, the sw0p1 user port can no longer communicate with its CPU
port at MTU 1700.
To address this, we should calculate the largest_mtu across all switches
that may share a CPU port, and only emit MTU notifiers with that value.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the notifier for adding a multicast MAC address matches on
the targeted port and on all DSA links in the system, be they upstream
or downstream links.
This leads to a considerable amount of useless traffic.
Consider this daisy chain topology, and a MDB add notifier emitted on
sw0p0. It matches on sw0p0, sw0p3, sw1p3 and sw2p4.
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ x ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ x ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
But switch 0 has no reason to send the multicast traffic for that MAC
address on sw0p3, which is how it reaches switches 1 and 2. Those
switches don't expect, according to the user configuration, to receive
this multicast address from switch 1, and they will drop it anyway,
because the only valid destination is the port they received it on.
They only need to configure themselves to deliver that multicast address
_towards_ switch 1, where the MDB entry is installed.
Similarly, switch 1 should not send this multicast traffic towards
sw1p3, because that is how it reaches switch 2.
With this change, the heat map for this MDB notifier changes as follows:
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Now the mdb notifier behaves the same as the fdb notifier.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The difference between dsa_is_user_port and dsa_port_is_user is that the
former needs to look up the list of ports of the DSA switch tree in
order to find the struct dsa_port, while the latter directly receives it
as an argument.
dsa_is_user_port is already in widespread use and has its place, so
there isn't any chance of converting all callers to a single form.
But being able to do:
dsa_port_is_user(dp)
instead of
dsa_is_user_port(dp->ds, dp->index)
is much more efficient too, especially when the "dp" comes from an
iterator over the DSA switch tree - this reduces the complexity from
quadratic to linear.
Move these helpers from dsa2.c to include/net/dsa.h so that others can
use them too.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cross-chip notifiers work by comparing each ds->index against the
info->sw_index value from the notifier. The ds->index is retrieved from
the device tree dsa,member property.
If a single tree cross-chip topology does not declare unique switch IDs,
this will result in hard-to-debug issues/voodoo effects such as the
cross-chip notifier for one switch port also matching the port with the
same number from another switch.
Check in dsa_switch_parse_member_of() whether the DSA switch tree
contains a DSA switch with the index we're preparing to add, before
actually adding it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
We got multiple reports that multi_chunk_sendfile test
case from tls selftest fails. This was sort of expected,
as the original fix was never applied (see it in the first
Link:). The test in question uses sendfile() with count
larger than the size of the underlying file. This will
make splice set MSG_MORE on all sendpage calls, meaning
TLS will never close and flush the last partial record.
Eric seem to have addressed a similar problem in
commit 35f9c09fe9 ("tcp: tcp_sendpages() should call tcp_push() once")
by introducing MSG_SENDPAGE_NOTLAST. Unlike MSG_MORE
MSG_SENDPAGE_NOTLAST is not set on the last call
of a "pipefull" of data (PIPE_DEF_BUFFERS == 16,
so every 16 pages or whenever we run out of data).
Having a break every 16 pages should be fine, TLS
can pack exactly 4 pages into a record, so for
aligned reads there should be no difference,
unaligned may see one extra record per sendpage().
Sticking to TCP semantics seems preferable to modifying
splice, but we can revisit it if real life scenarios
show a regression.
Reported-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Link: https://lore.kernel.org/netdev/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/
Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only care about exclusive or of those, so pass that directly.
Makes life simpler for callers as well...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can do that more or less safely, since the parent is
held locked all along. Yes, somebody might observe the
object via dcache, only to have it disappear afterwards,
but there's really no good way to prevent that. It won't
race with other bind(2) or attempts to move the sucker
elsewhere, or put something else in its place - locked
parent prevents that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Final preparations for doing unlink on failure past the successful
mknod. We can't hold ->bindlock over ->mknod() or ->unlink(), since
either might do sb_start_write() (e.g. on overlayfs). However, we
can do it while holding filesystem and VFS locks - doing
kern_path_create()
vfs_mknod()
grab ->bindlock
if u->addr had been set
drop ->bindlock
done_path_create
return -EINVAL
else
assign the address to socket
drop ->bindlock
done_path_create
return 0
would be deadlock-free. Here we massage unix_bind_bsd() to that
form. We are still doing equivalent transformations.
Next commit will *not* be an equivalent transformation - it will
add a call of vfs_unlink() before done_path_create() in "alread bound"
case.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
unix_bind_bsd() and unix_bind_abstract() respectively.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
We do get some duplication that way, but it's minor compared to
parts that are different. What we get is an ability to change
locking in BSD case without making failure exits very hard to
follow.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
makes it easier to massage; we do pay for that by extra work
(kmalloc+memcpy+kfree) in some error cases, but those are not
on the hot paths anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Duplicated logics in all bind variants (autobind, bind-to-path,
bind-to-abstract) gets taken into a common helper.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmDOZ38THG1rbEBwZW5n
dXRyb25peC5kZQAKCRCpyVqK+u3vqf9CCACcVoZsa+47buGmvQmRpyhbA+/a3LFF
h5kBpts+igtZ2HnB5ODSu+SeqphYtE+eJeLLxbaw8riig0Vz+ogNJUMoalodYIwx
B1jUTKYHg6wxDq6cAqZrG2KpOdIucXEFcugccPF0tjRthet0vZyWxbx66XWzFrp8
+UGK5H/diGihaRqguJwN3P9Mw3SYw4VWo2J2iYQ8WkGT1sy1UO4XuO9U6KqNBHmA
3a48VrgtwC4yZI7+Ar36SNMnL9P3qArE6UlqtpYDudmqpSCX08A4itM5rJ4UGSKk
PwetFCvjhjsxFs081ILaBe2Ktu3fl3i+FsZF/hwv99p45l4OCaYBEwfw
=xYM1
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-5.13-20210619' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2021-06-19
this is a pull request of 5 patches for net/master.
The first patch is by Thadeu Lima de Souza Cascardo and fixes a
potential use-after-free in the CAN broadcast manager socket, by
delaying the release of struct bcm_op after synchronize_rcu().
Oliver Hartkopp's patch fixes a similar potential user-after-free in
the CAN gateway socket by synchronizing RCU operations before removing
gw job entry.
Another patch by Oliver Hartkopp fixes a potential use-after-free in
the ISOTP socket by omitting unintended hrtimer restarts on socket
release.
Oleksij Rempel's patch for the j1939 socket fixes a potential
use-after-free by setting the SOCK_RCU_FREE flag on the socket.
The last patch is by Pavel Skripkin and fixes a use-after-free in the
ems_usb CAN driver.
All patches are intended for stable and have stable@v.k.o on Cc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions return negative ENODATA but the minus sign was left out
in the tests.
Fixes: f0dd7bf5e3 ("net/smc: Add netlink support for SMC fallback statistics")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The preempt disable around do_xdp_generic() has been introduced in
commit
bbbe211c29 ("net: rcu lock and preempt disable missing around generic xdp")
For BPF it is enough to use migrate_disable() and the code was updated
as it can be seen in commit
3c58482a38 ("bpf: Provide bpf_prog_run_pin_on_cpu() helper")
This is a leftover which was not converted.
Use migrate_disable() before invoking do_xdp_generic().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is hard to observe packet drops without increasing relevant
drop counters, here we should increase sk->sk_drops which is
a protocol-independent counter. Fortunately psock is always
associated with a struct sock, we can just use psock->sk.
Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-9-xiyou.wangcong@gmail.com
sk_psock_skb_redirect() only takes skb as a parameter, we
will need to know where this skb is from, so just pass
the source psock to this function as a new parameter.
This patch prepares for the next one.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-8-xiyou.wangcong@gmail.com
Currently sk_psock_verdict_apply() is void, but it handles some
error conditions too. Its caller is impossible to learn whether
it succeeds or fails, especially sk_psock_verdict_recv().
Make it return int to indicate error cases and propagate errors
to callers properly.
Fixes: ef5659280e ("bpf, sockmap: Allow skipping sk_skb parser program")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-7-xiyou.wangcong@gmail.com
If the dest psock does not set SK_PSOCK_TX_ENABLED,
the skb can't be queued anywhere so must be dropped.
This one is found during code review.
Fixes: 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-6-xiyou.wangcong@gmail.com
When we drop skb inside sk_psock_skb_redirect(), we have to clear
its skb->_sk_redir pointer too, otherwise kfree_skb() would
misinterpret it as a valid skb->_skb_refdst and dst_release()
would eventually complain.
Fixes: e3526bb92a ("skmsg: Move sk_redir from TCP_SKB_CB to skb")
Reported-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-5-xiyou.wangcong@gmail.com
sk_psock_verdict_recv() clones the skb and uses the clone
afterward, so udp_read_sock() should free the skb after using
it, regardless of error or not.
This fixes a real kmemleak.
Fixes: d7f571188e ("udp: Implement ->read_sock() for sockmap")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-4-xiyou.wangcong@gmail.com
I tried to reuse sk_msg_wait_data() for different protocols,
but it turns out it can not be simply reused. For example,
UDP actually uses two queues to receive skb:
udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have
to check both of them to know whether we have received any
packet.
Also, UDP does not lock the sock during BH Rx path, it makes
no sense for its ->recvmsg() to lock the sock. It is always
possible for ->recvmsg() to be called before packets actually
arrive in the receive queue, we just use best effort to make
it accurate here.
Fixes: 1f5be6b3b0 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com
This replaces the overflow indirection with the new xfrm_replay_overflow
helper. After this, the 'repl' pointer in xfrm_state is no longer
needed and can be removed as well.
xfrm_replay_overflow() is added in two incarnations, one is used
when the kernel is compiled with xfrm hardware offload support enabled,
the other when its disabled.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Adds new xfrm_replay_recheck() helper and calls it from
xfrm input path instead of the indirection.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Similar to other patches: add a new helper to avoid
an indirection.
v2: fix 'net/xfrm/xfrm_replay.c:519:13: warning: 'seq' may be used
uninitialized in this function' warning.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
replay protection is implemented using a callback structure and then
called via
x->repl->notify(), x->repl->recheck(), and so on.
all the differect functions are always built-in, so this could be direct
calls instead.
This first patch prepares for removal of the x->repl structure.
Add an enum with the three available replay modes to the xfrm_state
structure and then replace all x->repl->notify() calls by the new
xfrm_replay_notify() helper.
The helper checks the enum internally to adapt behaviour as needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Set SOCK_RCU_FREE to let RCU to call sk_destruct() on completion.
Without this patch, we will run in to j1939_can_recv() after priv was
freed by j1939_sk_release()->j1939_sk_sock_destruct()
Fixes: 25fe97cb76 ("can: j1939: move j1939_priv_put() into sk_destruct callback")
Link: https://lore.kernel.org/r/20210617130623.12705-1-o.rempel@pengutronix.de
Cc: linux-stable <stable@vger.kernel.org>
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reported-by: syzbot+bdf710cfc41c186fdff3@syzkaller.appspotmail.com
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When closing the isotp socket, the potentially running hrtimers are
canceled before removing the subscription for CAN identifiers via
can_rx_unregister().
This may lead to an unintended (re)start of a hrtimer in
isotp_rcv_cf() and isotp_rcv_fc() in the case that a CAN frame is
received by isotp_rcv() while the subscription removal is processed.
However, isotp_rcv() is called under RCU protection, so after calling
can_rx_unregister, we may call synchronize_rcu in order to wait for
any RCU read-side critical sections to finish. This prevents the
reception of CAN frames after hrtimer_cancel() and therefore the
unintended (re)start of the hrtimers.
Link: https://lore.kernel.org/r/20210618173713.2296-1-socketcan@hartkopp.net
Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can_can_gw_rcv() is called under RCU protection, so after calling
can_rx_unregister(), we have to call synchronize_rcu in order to wait
for any RCU read-side critical sections to finish before removing the
kmem_cache entry with the referenced gw job entry.
Link: https://lore.kernel.org/r/20210618173645.2238-1-socketcan@hartkopp.net
Fixes: c1aabdf379 ("can-gw: add netlink based CAN routing")
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can_rx_register() callbacks may be called concurrently to the call to
can_rx_unregister(). The callbacks and callback data, though, are
protected by RCU and the struct sock reference count.
So the callback data is really attached to the life of sk, meaning
that it should be released on sk_destruct. However, bcm_remove_op()
calls tasklet_kill(), and RCU callbacks may be called under RCU
softirq, so that cannot be used on kernels before the introduction of
HRTIMER_MODE_SOFT.
However, bcm_rx_handler() is called under RCU protection, so after
calling can_rx_unregister(), we may call synchronize_rcu() in order to
wait for any RCU read-side critical sections to finish. That is,
bcm_rx_handler() won't be called anymore for those ops. So, we only
free them, after we do that synchronize_rcu().
Fixes: ffd980f976 ("[CAN]: Add broadcast manager (bcm) protocol")
Link: https://lore.kernel.org/r/20210619161813.2098382-1-cascardo@canonical.com
Cc: linux-stable <stable@vger.kernel.org>
Reported-by: syzbot+0f7e7e5e2f4f40fa89c0@syzkaller.appspotmail.com
Reported-by: Norbert Slusarek <nslusarek@gmx.net>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh
scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bluetooth, netfilter and can.
Current release - regressions:
- mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
to fix modifying offloaded qdiscs
- lantiq: net: fix duplicated skb in rx descriptor ring
- rtnetlink: fix regression in bridge VLAN configuration, empty info
is not an error, bot-generated "fix" was not needed
- libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix
umem creation
Current release - new code bugs:
- ethtool: fix NULL pointer dereference during module EEPROM dump via
the new netlink API
- mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue
should not be visible to the stack
- mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs
- mlx5e: verify dev is present in get devlink port ndo, avoid a panic
Previous releases - regressions:
- neighbour: allow NUD_NOARP entries to be force GCed
- further fixes for fallout from reorg of WiFi locking
(staging: rtl8723bs, mac80211, cfg80211)
- skbuff: fix incorrect msg_zerocopy copy notifications
- mac80211: fix NULL ptr deref for injected rate info
- Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs
Previous releases - always broken:
- bpf: more speculative execution fixes
- netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local
- udp: fix race between close() and udp_abort() resulting in a panic
- fix out of bounds when parsing TCP options before packets
are validated (in netfilter: synproxy, tc: sch_cake and mptcp)
- mptcp: improve operation under memory pressure, add missing wake-ups
- mptcp: fix double-lock/soft lookup in subflow_error_report()
- bridge: fix races (null pointer deref and UAF) in vlan tunnel egress
- ena: fix DMA mapping function issues in XDP
- rds: fix memory leak in rds_recvmsg
Misc:
- vrf: allow larger MTUs
- icmp: don't send out ICMP messages with a source address of 0.0.0.0
- cdc_ncm: switch to eth%d interface naming
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDNP7EACgkQMUZtbf5S
IrvTmxAAgOAM9MdRl9wnYtqXKPXJ1JJtenozwt1yX6b6OG+Ns7cm6YYafU3KoZWR
KlzpvP90vRrER3RqksbMngHzvGjZKDS4LWRur7sRlJ1TBQoLrQCIbriAh07d7wlU
0nnS4J8mczTCKx78QCUYy1QBIX5TQrUbx0JQZDPoIPBjFeILW+Gx/Ghg5tUR4mhf
6icYqwIPocTXO37ZmWOzezZNVOXJF4kaQUZeuOHNe5hOtm6EeIpZbW1Xx3DIr5bd
80a/uNU7nVyos0n7jxnfVE/oelTnYbT5scZeV/PPVqZ4U113f7uex2QP23/XhGSX
lK1EhwPqPOyaNhQoihLM6Xzd4o7aZOcmF8NY96xqjC+DqdN+juvfJU+ClCZojGIj
H4bwCSaj3y2PiimfQdBiIKvYMc5d4zBdw/Dpk/gLDp4d5N638TAtuunK4Mj+TEuT
QF1qkBLIB4HFtLS0M35/twk93md/5GUdSTij2GB3fOkAWRu2m266P5m+4DigW/TB
Xm8FgKdetvxVP0Qv/p49nPEn24Ny8wCafH1x1wVTmoda2qi6j1EXMuSa0PlCdz70
Sl5FrlxdEkOpC4p+Aoc8APSoBXnOriAlpU+z/EVb8Co4JR/+Ge5zBWpsiZDVD0/K
Ay0FW3I87iyn9tw1H1Fzr9GBlVl5vWRauZFHjzl90fWakCrCzJE=
=xxUe
-----END PGP SIGNATURE-----
Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
bluetooth, netfilter and can.
Current release - regressions:
- mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
to fix modifying offloaded qdiscs
- lantiq: net: fix duplicated skb in rx descriptor ring
- rtnetlink: fix regression in bridge VLAN configuration, empty info
is not an error, bot-generated "fix" was not needed
- libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem
creation
Current release - new code bugs:
- ethtool: fix NULL pointer dereference during module EEPROM dump via
the new netlink API
- mlx5e: don't update netdev RQs with PTP-RQ, the special purpose
queue should not be visible to the stack
- mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs
- mlx5e: verify dev is present in get devlink port ndo, avoid a panic
Previous releases - regressions:
- neighbour: allow NUD_NOARP entries to be force GCed
- further fixes for fallout from reorg of WiFi locking (staging:
rtl8723bs, mac80211, cfg80211)
- skbuff: fix incorrect msg_zerocopy copy notifications
- mac80211: fix NULL ptr deref for injected rate info
- Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs
Previous releases - always broken:
- bpf: more speculative execution fixes
- netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local
- udp: fix race between close() and udp_abort() resulting in a panic
- fix out of bounds when parsing TCP options before packets are
validated (in netfilter: synproxy, tc: sch_cake and mptcp)
- mptcp: improve operation under memory pressure, add missing
wake-ups
- mptcp: fix double-lock/soft lookup in subflow_error_report()
- bridge: fix races (null pointer deref and UAF) in vlan tunnel
egress
- ena: fix DMA mapping function issues in XDP
- rds: fix memory leak in rds_recvmsg
Misc:
- vrf: allow larger MTUs
- icmp: don't send out ICMP messages with a source address of 0.0.0.0
- cdc_ncm: switch to eth%d interface naming"
* tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits)
net: ethernet: fix potential use-after-free in ec_bhf_remove
selftests/net: Add icmp.sh for testing ICMP dummy address responses
icmp: don't send out ICMP messages with a source address of 0.0.0.0
net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
net: ll_temac: Fix TX BD buffer overwrite
net: ll_temac: Add memory-barriers for TX BD access
net: ll_temac: Make sure to free skb when it is completely used
MAINTAINERS: add Guvenc as SMC maintainer
bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
bnxt_en: Fix TQM fastpath ring backing store computation
bnxt_en: Rediscover PHY capabilities after firmware reset
cxgb4: fix wrong shift.
mac80211: handle various extensible elements correctly
mac80211: reset profile_periodicity/ema_ap
cfg80211: avoid double free of PMSR request
cfg80211: make certificate generation more robust
mac80211: minstrel_ht: fix sample time check
net: qed: Fix memcpy() overflow of qed_dcbx_params()
net: cdc_eem: fix tx fixup skb leak
net: hamradio: fix memory leak in mkiss_close
...
Modify the pr_info content from int to char *, this looks more readable.
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When memcpy_to_msg() fails in virtio_transport_seqpacket_do_dequeue(),
we already set `dequeued_len` with the negative error value returned
by memcpy_to_msg().
So we can directly check `dequeued_len` value instead of using a
dedicated flag variable to skip the copy path for the rest of
fragments.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vsock_wait_data() is used only by STREAM and SEQPACKET sockets,
so let's rename it to vsock_connectible_wait_data(), using the same
nomenclature (connectible) used in other functions after the
introduction of SEQPACKET.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vsock_has_data() is used only by STREAM and SEQPACKET sockets,
so let's rename it to vsock_connectible_has_data(), using the same
nomenclature (connectible) used in other functions after the
introduction of SEQPACKET.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* a minstrel HT sample check fix
* peer measurement could double-free on races
* certificate file generation at build time could
sometimes hang
* some parameters weren't reset between connections
in mac80211
* some extensible elements were treated as non-
extensible, possibly causuing bad connections
(or failures) if the AP adds data
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmDMgxsACgkQB8qZga/f
l8QsDBAAhY5zN2LFdxkqyfPd8DJs2KpnE1osSi1qjmPOItn7K7H6hD6jN/UaaysQ
uC1ngAyuiRMtO5JgAtj58NlnDNM3IYvYxt909PnG/NAuNGW9RDebEf2H8JGKzCTR
sFW6QKOj4CkVyLwjRwu3VziI0WOaF0kNoNW2ZSr4DEHSS9siMe5svv5fLqoNxNCP
9fhS1T5xgDZfcGVdedXzilH1waqsEzPeRYY7TKGr/TZwDPksYmNsFU7mETqzKV14
OuGan7eolZ6Q869FydkKs+J9NDiHXEBVM4vt6K/2I+qHXAUUsui01l+l1oV4+XzW
Jh3eS7t72uov1UV5jVvLjrFvKOWBu1RpsO+8XfUqnTa7AvDdC5jrBTWzFYaATmqm
OtfVy3JSkd8d9eMX6Yg3/K/f9WoNPIyrR1BbbOCpWN3tHvE2xc8fWsRmS3o6VnpP
DZ/+Za4csLKl5/D1x3cqYnIaLwQdD75WNGJU10UvvyPyNsKLsw4UxfSm49gWXXBm
/fqXGS2SJX39GiHysZAnQlpRy9x03E/qkWaPZWx+xYP4zkr5MNecM5kmiINZINBA
eJPjO8Ex2ODkNf/BAmzHhIyPilRw0ypDa8K5NS/KCp2WBA01lEgyglRD0Rnz5vjD
MSP+cV38SjFoOxxiN1qtB1bSyN0EN5MdFwyrerJjmDRp/sqA5xE=
=Nh7Q
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2021-06-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A couple of straggler fixes:
* a minstrel HT sample check fix
* peer measurement could double-free on races
* certificate file generation at build time could
sometimes hang
* some parameters weren't reset between connections
in mac80211
* some extensible elements were treated as non-
extensible, possibly causuing bad connections
(or failures) if the AP adds data
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When constructing ICMP response messages, the kernel will try to pick a
suitable source address for the outgoing packet. However, if no IPv4
addresses are configured on the system at all, this will fail and we end up
producing an ICMP message with a source address of 0.0.0.0. This can happen
on a box routing IPv4 traffic via v6 nexthops, for instance.
Since 0.0.0.0 is not generally routable on the internet, there's a good
chance that such ICMP messages will never make it back to the sender of the
original packet that the ICMP message was sent in response to. This, in
turn, can create connectivity and PMTUd problems for senders. Fortunately,
RFC7600 reserves a dummy address to be used as a source for ICMP
messages (192.0.0.8/32), so let's teach the kernel to substitute that
address as a last resort if the regular source address selection procedure
fails.
Below is a quick example reproducing this issue with network namespaces:
ip netns add ns0
ip l add type veth peer netns ns0
ip l set dev veth0 up
ip a add 10.0.0.1/24 dev veth0
ip a add fc00:dead:cafe:42::1/64 dev veth0
ip r add 10.1.0.0/24 via inet6 fc00:dead:cafe:42::2
ip -n ns0 l set dev veth0 up
ip -n ns0 a add fc00:dead:cafe:42::2/64 dev veth0
ip -n ns0 r add 10.0.0.0/24 via inet6 fc00:dead:cafe:42::1
ip netns exec ns0 sysctl -w net.ipv4.icmp_ratelimit=0
ip netns exec ns0 sysctl -w net.ipv4.ip_forward=1
tcpdump -tpni veth0 -c 2 icmp &
ping -w 1 10.1.0.1 > /dev/null
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on veth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 29, seq 1, length 64
IP 0.0.0.0 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92
2 packets captured
2 packets received by filter
0 packets dropped by kernel
With this patch the above capture changes to:
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 31127, seq 1, length 64
IP 192.0.0.8 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Juliusz Chroboczek <jch@irif.fr>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The continue statement at the end of a for-loop has no effect,
invert the if expression and remove the continue.
Addresses-Coverity: ("Continue has no effect")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify the label out_err to out to avoid the meanless kfree.
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently UDP tunnel devices on top of VLANs lose the ability
to offload UDP GSO. Widen the pass thru features from TSO
to all GSO_SOFTWARE.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new sysctl, named checksum_enabled, to control
whether DSS checksum can be enabled.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added the mib for the data checksum, MPTCP_MIB_DATACSUMERR.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the MPTCP-level checksum is enabled, on re-injections we
must spool a complete DSS, or the receive side will not be
able to compute the csum and process any data.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added three new members named data_csum, csum_len and
map_csum in struct mptcp_subflow_context, implemented a new function
named mptcp_validate_data_checksum().
If the current mapping is valid and csum is enabled traverse the later
pending skbs and compute csum incrementally till the whole mapping has
been covered. If not enough data is available in the rx queue, return
MAPPING_EMPTY - that is, no data.
Next subflow_data_ready invocation will trigger again csum computation.
When the full DSS is available, validate the csum and return to the
caller an appropriate error code, to trigger subflow reset of fallback
as required by the RFC.
Additionally:
- if the csum prevence in the DSS don't match the negotiated value e.g.
csum present, but not requested, return invalid mapping to trigger
subflow reset.
- keep some csum state, to avoid re-compute the csum on the same data
when multiple rx queue traversal are required.
- clean-up the uncompleted mapping from the receive queue on close, to
allow proper subflow disposal
Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In mptcp_parse_option, adjust the expected_opsize, and always parse the
data checksum value from the receiving DSS regardless of csum presence.
Then save it in mp_opt->csum.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new member named csum in struct mptcp_options_received.
When parsing the MP_CAPABLE with data, if the checksum is enabled,
adjust the expected_opsize. If the receiving option length matches the
length with the data checksum, get the checksum value and save it in
mp_opt->csum. And in mptcp_incoming_options, pass it to mpext->csum.
We always parse any csum/nocsum combination and delay the presence check
to later code, to allow reset if missing.
Additionally, in the TX path, use the newly introduce ext field to avoid
MPTCP csum recomputation on TCP retransmission and unneeded csum update
on when setting the data fin_flag.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new flag csum_reqd in struct mptcp_options_received, if
the flag MPTCP_CAP_CHECKSUM_REQD is set in the receiving MP_CAPABLE
suboption, set this flag.
In mptcp_sk_clone and subflow_finish_connect, if the csum_reqd flag is set,
enable the msk->csum_enabled flag.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new parameter name sk in mptcp_get_options().
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In mptcp_write_options, if the checksum is enabled, adjust the option
length and send out the data checksum with DSS suboption.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the checksum is enabled, send out the data checksum with the
MP_CAPABLE suboption with data.
In mptcp_established_options_mp, save the data checksum in
opts->ext_copy.csum. In mptcp_write_options, adjust the option length and
send it out with the MP_CAPABLE suboption.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new member csum_reqd in struct mptcp_out_options and
struct mptcp_subflow_request_sock. Initialized it with the helper
function mptcp_is_checksum_enabled().
In mptcp_write_options, if this field is enabled, send out the MP_CAPABLE
suboption with the MPTCP_CAP_CHECKSUM_REQD flag.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new member named csum in struct mptcp_ext, implemented
a new function named mptcp_generate_data_checksum().
Generate the data checksum in mptcp_sendmsg_frag, save it in mpext->csum.
Note that we must generate the csum for zero window probe, too.
Do the csum update incrementally, to avoid multiple csum computation
when the data is appended to existing skb.
Note that in a later patch we will skip unneeded csum related operation.
Changes not included here to keep the delta small.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new member named csum_enabled in struct mptcp_sock,
used a dummy mptcp_is_checksum_enabled() helper to initialize it.
Also added a new member named mptcpi_csum_enabled in struct mptcp_info
to expose the csum_enabled flag.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IETF RFC 8986 [1] includes the definition of SRv6 End.DT4, End.DT6, and
End.DT46 Behaviors.
The current SRv6 code in the Linux kernel only implements End.DT4 and
End.DT6 which can be used respectively to support IPv4-in-IPv6 and
IPv6-in-IPv6 VPNs. With End.DT4 and End.DT6 it is not possible to create a
single SRv6 VPN tunnel to carry both IPv4 and IPv6 traffic.
The proposed End.DT46 implementation is meant to support the decapsulation
of IPv4 and IPv6 traffic coming from a single SRv6 tunnel.
The implementation of the SRv6 End.DT46 Behavior in the Linux kernel
greatly simplifies the setup and operations of SRv6 VPNs.
The SRv6 End.DT46 Behavior leverages the infrastructure of SRv6 End.DT{4,6}
Behaviors implemented so far, because it makes use of a VRF device in
order to force the routing lookup into the associated routing table.
To make the End.DT46 work properly, it must be guaranteed that the routing
table used for routing lookup operations is bound to one and only one VRF
during the tunnel creation. Such constraint has to be enforced by enabling
the VRF strict_mode sysctl parameter, i.e.:
$ sysctl -wq net.vrf.strict_mode=1
Note that the same approach is used for the SRv6 End.DT4 Behavior and for
the End.DT6 Behavior in VRF mode.
The command used to instantiate an SRv6 End.DT46 Behavior is
straightforward, i.e.:
$ ip -6 route add 2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100.
[1] https://www.rfc-editor.org/rfc/rfc8986.html#name-enddt46-decapsulation-and-s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Performance and impact of SRv6 End.DT46 Behavior on the SRv6 Networking
=======================================================================
This patch aims to add the SRv6 End.DT46 Behavior with minimal impact on
the performance of SRv6 End.DT4 and End.DT6 Behaviors.
In order to verify this, we tested the performance of the newly introduced
SRv6 End.DT46 Behavior and compared it with the performance of SRv6
End.DT{4,6} Behaviors, considering both the patched kernel and the kernel
before applying the End.DT46 patch (referred to as vanilla kernel).
In details, the following decapsulation scenarios were considered:
1.a) IPv6 traffic in SRv6 End.DT46 Behavior on patched kernel;
1.b) IPv4 traffic in SRv6 End.DT46 Behavior on patched kernel;
2.a) SRv6 End.DT6 Behavior (VRF mode) on patched kernel;
2.b) SRv6 End.DT4 Behavior on patched kernel;
3.a) SRv6 End.DT6 Behavior (VRF mode) on vanilla kernel (without the
End.DT46 patch);
3.b) SRv6 End.DT4 Behavior on vanilla kernel (without the End.DT46 patch).
All tests were performed on a testbed deployed on the CloudLab [2]
facilities. We considered IPv{4,6} traffic handled by a single core (at 2.4
GHz on a Xeon(R) CPU E5-2630 v3) on kernel 5.13-rc1 using packets of size
~ 100 bytes.
Scenario (1.a): average 684.70 kpps; std. dev. 0.7 kpps;
Scenario (1.b): average 711.69 kpps; std. dev. 1.2 kpps;
Scenario (2.a): average 690.70 kpps; std. dev. 1.2 kpps;
Scenario (2.b): average 722.22 kpps; std. dev. 1.7 kpps;
Scenario (3.a): average 690.02 kpps; std. dev. 2.6 kpps;
Scenario (3.b): average 721.91 kpps; std. dev. 1.2 kpps;
Considering the results for the patched kernel (1.a, 1.b, 2.a, 2.b) we
observe that the performance degradation incurred in using End.DT46 rather
than End.DT6 and End.DT4 respectively for IPv6 and IPv4 traffic is minimal,
around 0.9% and 1.5%. Such very minimal performance degradation is the
price to be paid if one prefers to use a single tunnel capable of handling
both types of traffic (IPv4 and IPv6).
Comparing the results for End.DT4 and End.DT6 under the patched and the
vanilla kernel (2.a, 2.b, 3.a, 3.b) we observe that the introduction of the
End.DT46 patch has no impact on the performance of End.DT4 and End.DT6.
[2] https://www.cloudlab.us
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix broken Tx ring validation for AF_XDP. The commit under the Fixes
tag, fixed an off-by-one error in the validation but introduced
another error. Descriptors are now let through even if they straddle a
chunk boundary which they are not allowed to do in aligned mode. Worse
is that they are let through even if they straddle the end of the umem
itself, tricking the kernel to read data outside the allowed umem
region which might or might not be mapped at all.
Fix this by reintroducing the old code, but subtract the length by one
to fix the off-by-one error that the original patch was
addressing. The test chunk != chunk_end makes sure packets do not
straddle chunk boundraries. Note that packets of zero length are
allowed in the interface, therefore the test if the length is
non-zero.
Fixes: ac31565c21 ("xsk: Fix for xp_aligned_validate_desc() when len == chunk_size")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20210618075805.14412-1-magnus.karlsson@gmail.com
The packet logger backend is unable to provide the incoming (or
outgoing) interface name because that information isn't available.
Pass the hook state, it contains the network namespace, the protocol
family, the network interfaces and other things.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Various elements are parsed with a requirement to have an
exact size, when really we should only check that they have
the minimum size that we need. Check only that and therefore
ignore any additional data that they might carry.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210618133832.cd101f8040a4.Iadf0e9b37b100c6c6e79c7b298cc657c2be9151a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If cfg80211_pmsr_process_abort() moves all the PMSR requests that
need to be freed into a local list before aborting and freeing them.
As a result, it is possible that cfg80211_pmsr_complete() will run in
parallel and free the same PMSR request.
Fix it by freeing the request in cfg80211_pmsr_complete() only if it
is still in the original pmsr list.
Cc: stable@vger.kernel.org
Fixes: 9bb7e0f24e ("cfg80211: add peer measurement with FTM initiator API")
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210618133832.1fbef57e269a.I00294bebdb0680b892f8d1d5c871fd9dbe785a5e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
We need to skip sampling if the next sample time is after jiffies, not before.
This patch fixes an issue where in some cases only very little sampling (or none
at all) is performed, leading to really bad data rates
Fixes: 80d55154b2 ("mac80211: minstrel_ht: significantly redesign the rate probing strategy")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210617103854.61875-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-17
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).
The main changes are:
1) BPF infrastructure to migrate TCP child sockets from a listener to another
in the same reuseport group/map, from Kuniyuki Iwashima.
2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.
3) Streamline error reporting changes in libbpf as planned out in the
'libbpf: the road to v1.0' effort, from Andrii Nakryiko.
4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.
5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.
6) Support new LLVM relocations in libbpf to make them more linker friendly,
also add a doc to describe the BPF backend relocations, from Yonghong Song.
7) Silence long standing KUBSAN complaints on register-based shifts in
interpreter, from Daniel Borkmann and Eric Biggers.
8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
target arch cannot be determined, from Lorenz Bauer.
9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.
10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.
11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
programs to bpf_helpers.h header, from Florent Revest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When nla_put_u32() fails, 'ret' could be 0, it should
return error code in tcf_del_walker().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new optional expression that tells you when last matching on a
given rule / set element element has happened.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo says, tprot_set is only there to detect if tprot was set to
IPPROTO_IP as that evaluates to zero. Therefore, code asserting a
different value in tprot does not need to check tprot_set.
Fixes: 935b7f6430 ("netfilter: nft_exthdr: add TCP option matching")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since user space does not generate a payload dependency, plain sctp
chunk matches cause searching in non-SCTP packets, too. Avoid this
potential mis-interpretation of packet data by checking pkt->tprot.
Fixes: 133dc203d7 ("netfilter: nft_exthdr: Support SCTP chunks")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make the gathered SMC statistics network namespace aware, for each
namespace collect an own set of statistic information.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to collect more detailed SMC fallback reason statistics and
provide these statistics to user space on the netlink interface.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the netlink function which collects the statistics information and
delivers it to the userspace.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to collect SMC statistics information. Per-cpu
variables are used to collect the statistic information for better
performance and for reducing concurrency pitfalls. The code that is
collecting statistic data is implemented in macros to increase code
reuse and readability.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While unix_may_send(sk, osk) is called while osk is locked, it appears
unix_release_sock() can overwrite unix_peer() after this lock has been
released, making KCSAN unhappy.
Changing unix_release_sock() to access/change unix_peer()
before lock is released should fix this issue.
BUG: KCSAN: data-race in unix_dgram_sendmsg / unix_release_sock
write to 0xffff88810465a338 of 8 bytes by task 20852 on cpu 1:
unix_release_sock+0x4ed/0x6e0 net/unix/af_unix.c:558
unix_release+0x2f/0x50 net/unix/af_unix.c:859
__sock_release net/socket.c:599 [inline]
sock_close+0x6c/0x150 net/socket.c:1258
__fput+0x25b/0x4e0 fs/file_table.c:280
____fput+0x11/0x20 fs/file_table.c:313
task_work_run+0xae/0x130 kernel/task_work.c:164
tracehook_notify_resume include/linux/tracehook.h:189 [inline]
exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
exit_to_user_mode_prepare+0x156/0x190 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:302
do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88810465a338 of 8 bytes by task 20888 on cpu 0:
unix_may_send net/unix/af_unix.c:189 [inline]
unix_dgram_sendmsg+0x923/0x1610 net/unix/af_unix.c:1712
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
__do_sys_sendmmsg net/socket.c:2519 [inline]
__se_sys_sendmmsg net/socket.c:2516 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0xffff888167905400 -> 0x0000000000000000
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 20888 Comm: syz-executor.0 Not tainted 5.13.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like prior patch, we need to annotate lockless accesses to po->ifindex
For instance, packet_getname() is reading po->ifindex (twice) while
another thread is able to change po->ifindex.
KCSAN reported:
BUG: KCSAN: data-race in packet_do_bind / packet_getname
write to 0xffff888143ce3cbc of 4 bytes by task 25573 on cpu 1:
packet_do_bind+0x420/0x7e0 net/packet/af_packet.c:3191
packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255
__sys_bind+0x200/0x290 net/socket.c:1637
__do_sys_bind net/socket.c:1648 [inline]
__se_sys_bind net/socket.c:1646 [inline]
__x64_sys_bind+0x3d/0x50 net/socket.c:1646
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888143ce3cbc of 4 bytes by task 25578 on cpu 0:
packet_getname+0x5b/0x1a0 net/packet/af_packet.c:3525
__sys_getsockname+0x10e/0x1a0 net/socket.c:1887
__do_sys_getsockname net/socket.c:1902 [inline]
__se_sys_getsockname net/socket.c:1899 [inline]
__x64_sys_getsockname+0x3e/0x50 net/socket.c:1899
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0x00000001
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 25578 Comm: syz-executor.5 Not tainted 5.13.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tpacket_snd(), packet_snd(), packet_getname() and packet_seq_show()
can read po->num without holding a lock. This means other threads
can change po->num at the same time.
KCSAN complained about this known fact [1]
Add READ_ONCE()/WRITE_ONCE() to address the issue.
[1] BUG: KCSAN: data-race in packet_do_bind / packet_sendmsg
write to 0xffff888131a0dcc0 of 2 bytes by task 24714 on cpu 0:
packet_do_bind+0x3ab/0x7e0 net/packet/af_packet.c:3181
packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255
__sys_bind+0x200/0x290 net/socket.c:1637
__do_sys_bind net/socket.c:1648 [inline]
__se_sys_bind net/socket.c:1646 [inline]
__x64_sys_bind+0x3d/0x50 net/socket.c:1646
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888131a0dcc0 of 2 bytes by task 24719 on cpu 1:
packet_snd net/packet/af_packet.c:2899 [inline]
packet_sendmsg+0x317/0x3570 net/packet/af_packet.c:3040
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmsg+0x1ed/0x270 net/socket.c:2433
__do_sys_sendmsg net/socket.c:2442 [inline]
__se_sys_sendmsg net/socket.c:2440 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2440
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000 -> 0x1200
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 24719 Comm: syz-executor.5 Not tainted 5.13.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmDJ2AUTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCpyVqK+u3vqSfnCAClPazQgSljFZSz4KfH72RQvheGnMKN
5HmZXOn8nR7/3LMGVovtqK+ZVhnRaQdILDp1RIGjv7Itti4gc+vdmMdrX3zPBiEA
CjsHVAP7Fj7CTIUp7X/1SjHrhgjy0RzbzfreXRmUSsQVHIpIsTc5aqNPTwhhPFRU
5wSzs80PPQP5OZjPyPGc3v6gzbiNQH+XqQX/ipR29uteP7eFQBmRkbaga30Q9CrT
9I4748smzsEvbHTpOn4mcsjZJBGN3qT21GvtS/BsoXVBBdoG92Wwoo0ng33dG/FH
j4BG2Mr2KhVgFLSCnQFiM+/SQGBj0KfFdcpX3mISZINmcnksdvkmebld
=4ghS
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-5.13-20210616' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2021-06-16
this is a pull request of 4 patches for net/master.
The first patch is by Oleksij Rempel and fixes a Use-after-Free found
by syzbot in the j1939 stack.
The next patch is by Tetsuo Handa and fixes hung task detected by
syzbot in the bcm, raw and isotp protocols.
Norbert Slusarek's patch fixes a infoleak in bcm's struct
bcm_msg_head.
Pavel Skripkin's patch fixes a memory leak in the mcba_usb driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
BUG: memory leak
unreferenced object 0xffff888101bc4c00 (size 32):
comm "syz-executor527", pid 360, jiffies 4294807421 (age 19.329s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
01 00 00 00 00 00 00 00 ac 14 14 bb 00 00 02 00 ................
backtrace:
[<00000000f17c5244>] kmalloc include/linux/slab.h:558 [inline]
[<00000000f17c5244>] kzalloc include/linux/slab.h:688 [inline]
[<00000000f17c5244>] ip_mc_add1_src net/ipv4/igmp.c:1971 [inline]
[<00000000f17c5244>] ip_mc_add_src+0x95f/0xdb0 net/ipv4/igmp.c:2095
[<000000001cb99709>] ip_mc_source+0x84c/0xea0 net/ipv4/igmp.c:2416
[<0000000052cf19ed>] do_ip_setsockopt net/ipv4/ip_sockglue.c:1294 [inline]
[<0000000052cf19ed>] ip_setsockopt+0x114b/0x30c0 net/ipv4/ip_sockglue.c:1423
[<00000000477edfbc>] raw_setsockopt+0x13d/0x170 net/ipv4/raw.c:857
[<00000000e75ca9bb>] __sys_setsockopt+0x158/0x270 net/socket.c:2117
[<00000000bdb993a8>] __do_sys_setsockopt net/socket.c:2128 [inline]
[<00000000bdb993a8>] __se_sys_setsockopt net/socket.c:2125 [inline]
[<00000000bdb993a8>] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2125
[<000000006a1ffdbd>] do_syscall_64+0x40/0x80 arch/x86/entry/common.c:47
[<00000000b11467c4>] entry_SYSCALL_64_after_hwframe+0x44/0xae
In commit 24803f38a5 ("igmp: do not remove igmp souce list info when set
link down"), the ip_mc_clear_src() in ip_mc_destroy_dev() was removed,
because it was also called in igmpv3_clear_delrec().
Rough callgraph:
inetdev_destroy
-> ip_mc_destroy_dev
-> igmpv3_clear_delrec
-> ip_mc_clear_src
-> RCU_INIT_POINTER(dev->ip_ptr, NULL)
However, ip_mc_clear_src() called in igmpv3_clear_delrec() doesn't
release in_dev->mc_list->sources. And RCU_INIT_POINTER() assigns the
NULL to dev->ip_ptr. As a result, in_dev cannot be obtained through
inetdev_by_index() and then in_dev->mc_list->sources cannot be released
by ip_mc_del1_src() in the sock_close. Rough call sequence goes like:
sock_close
-> __sock_release
-> inet_release
-> ip_mc_drop_socket
-> inetdev_by_index
-> ip_mc_leave_src
-> ip_mc_del_src
-> ip_mc_del1_src
So we still need to call ip_mc_clear_src() in ip_mc_destroy_dev() to free
in_dev->mc_list->sources.
Fixes: 24803f38a5 ("igmp: do not remove igmp souce list info ...")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chengyang Fan <cy.fan@huawei.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't check the sequence number when deciding when to update time_in in
the node table if tag removal is offloaded since the sequence number is
part of the tag. This fixes a problem where the times in the node table
wouldn't update when 0 appeared to be before or equal to seq_out when
tag removal was offloaded.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add unfront check for TCP and UDP packets before performing further
processing.
Fixes: 4ed8eb6570 ("netfilter: nf_tables: Add native tproxy support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The osf expression only supports for TCP packets, add a upfront sanity
check to skip packet parsing if this is not a TCP packet.
Fixes: b96af92d6e ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ipv6_find_hdr() does not validate that this is an IPv6 packet. Add a
sanity check for calling ipv6_find_hdr() to make sure an IPv6 packet
is passed for parsing.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On 64-bit systems, struct bcm_msg_head has an added padding of 4 bytes between
struct members count and ival1. Even though all struct members are initialized,
the 4-byte hole will contain data from the kernel stack. This patch zeroes out
struct bcm_msg_head before usage, preventing infoleaks to userspace.
Fixes: ffd980f976 ("[CAN]: Add broadcast manager (bcm) protocol")
Link: https://lore.kernel.org/r/trinity-7c1b2e82-e34f-4885-8060-2cd7a13769ce-1623532166177@3c-app-gmx-bs52
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Norbert Slusarek <nslusarek@gmx.net>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
syzbot is reporting hung task at register_netdevice_notifier() [1] and
unregister_netdevice_notifier() [2], for cleanup_net() might perform
time consuming operations while CAN driver's raw/bcm/isotp modules are
calling {register,unregister}_netdevice_notifier() on each socket.
Change raw/bcm/isotp modules to call register_netdevice_notifier() from
module's __init function and call unregister_netdevice_notifier() from
module's __exit function, as with gw/j1939 modules are doing.
Link: https://syzkaller.appspot.com/bug?id=391b9498827788b3cc6830226d4ff5be87107c30 [1]
Link: https://syzkaller.appspot.com/bug?id=1724d278c83ca6e6df100a2e320c10d991cf2bce [2]
Link: https://lore.kernel.org/r/54a5f451-05ed-f977-8534-79e7aa2bcc8f@i-love.sakura.ne.jp
Cc: linux-stable <stable@vger.kernel.org>
Reported-by: syzbot <syzbot+355f8edb2ff45d5f95fa@syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+0f1827363a305f74996f@syzkaller.appspotmail.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: syzbot <syzbot+355f8edb2ff45d5f95fa@syzkaller.appspotmail.com>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
There has been a few errors in the ethtool reply size calculations,
most of those are hard to trigger during basic testing because of
skb size rounding up and netdev names being shorter than max.
Add a more precise check.
This change will affect the value of payload length displayed in
case of -EMSGSIZE but that should be okay, "payload length" isn't
a well defined term here.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timewait sockets have included mark since approx 4.18.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jon Maxwell <jmaxwell37@gmail.com>
Fixes: 0048369055 ("tcp: Add mark for TIMEWAIT sockets")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
with CONFIG_IPV6=n:
xfrm_output.c:140:12: warning: 'xfrm6_hdr_offset' defined but not used
Fixes: 9acf4d3b9e ("xfrm: ipv6: add xfrm6_hdr_offset helper")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The function get_net_ns_by_fd() could be inlined when NET_NS is not
enabled.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following flower filters fail to match packets:
tc filter add dev eth0 ingress protocol 0x8864 flower \
action simple sdata hi64
tc filter add dev eth0 ingress protocol 802.1q flower \
vlan_ethtype 0x8864 action simple sdata "hi vlan"
The protocol 0x8864 (ETH_P_PPP_SES) is a tunnel protocol. As such, it is
being dissected by __skb_flow_dissect and it's internal protocol is
being set as key->basic.n_proto. IOW, the existence of ETH_P_PPP_SES
tunnel is transparent to the callers of __skb_flow_dissect.
OTOH, in the filters above, cls_flower configures its key->basic.n_proto
to the ETH_P_PPP_SES value configured by the user. Matching on this key
fails because of __skb_flow_dissect "transparency" mentioned above.
In the following, I would argue that the problem lies with cls_flower,
unnessary attempting key->basic.n_proto match.
There are 3 close places in fl_set_key in cls_flower setting up
mask->basic.n_proto. They are (in reverse order of appearance in the
code) due to:
(a) No vlan is given: use TCA_FLOWER_KEY_ETH_TYPE parameter
(b) One vlan tag is given: use TCA_FLOWER_KEY_VLAN_ETH_TYPE
(c) Two vlans are given: use TCA_FLOWER_KEY_CVLAN_ETH_TYPE
The match in case (a) is unneeded because flower has no its own
eth_type parameter. It was removed by Jamal Hadi Salim in commit
488b41d020fb06428b90289f70a41210718f52b7 in iproute2. For
TCA_FLOWER_KEY_ETH_TYPE the userspace uses the generic tc filter
protocol field. Therefore the match for the case (a) is done by tc
itself.
The matches in cases (b), (c) are unneeded because the protocol will
appear in and will be matched by flow_dissector_key_vlan.vlan_tpid.
Therefore in the best case, key->basic.n_proto will try to repeat vlan
key match again.
The below patch removes mask->basic.n_proto setting and resets it to 0
in case (c).
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT
to check if the attached eBPF program is capable of migrating sockets. When
the eBPF program is attached, we run it for socket migration if the
expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or
net.ipv4.tcp_migrate_req is enabled.
Currently, the expected_attach_type is not enforced for the
BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the
earlier idea in the commit aac3fc320d ("bpf: Post-hooks for sys_bind") to
fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().
Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to
select a new listener based on the child socket. migrating_sk varies
depending on if it is migrating a request in the accept queue or during
3WHS.
- accept_queue : sock (ESTABLISHED/SYN_RECV)
- 3WHS : request_sock (NEW_SYN_RECV)
In the eBPF program, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.
- SK_PASS with selected_sk, select it as a new listener
- SK_PASS with selected_sk NULL, fallbacks to the random selection
- SK_DROP, cancel the migration.
There is a noteworthy point. We select a listening socket in three places,
but we do not have struct skb at closing a listener or retransmitting a
SYN+ACK. On the other hand, some helper functions do not expect skb is NULL
(e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer()
in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing to select a new
listener.
We can currently get a unique ID of each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
This patch makes the pointer of sk available in sk_reuseport_md so that we
can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-9-kuniyu@amazon.co.jp
This patch also changes the code to call reuseport_migrate_sock() and
inet_reqsk_clone(), but unlike the other cases, we do not call
inet_reqsk_clone() right after reuseport_migrate_sock().
Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
has three kinds of refcnt:
(A) for listener itself
(B) carried by reuqest_sock
(C) sock_hold() in tcp_v[46]_rcv()
While processing the req, (A) may disappear by close(listener). Also, (B)
can disappear by accept(listener) once we put the req into the accept
queue. So, we have to hold another refcnt (C) for the listener to prevent
use-after-free.
For socket migration, we call reuseport_migrate_sock() to select a listener
with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
Thus we have to take another refcnt (B) for the newly cloned request_sock.
In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
try to put the new req into the accept queue. By migrating req after
winning the "own_req" race, we can avoid such a worst situation:
CPU 1 looks up req1
CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
As with the preceding patch, this patch changes reqsk_timer_handler() to
call reuseport_migrate_sock() and inet_reqsk_clone() to migrate in-flight
requests at retransmitting SYN+ACKs. If we can select a new listener and
clone the request, we resume setting the SYN+ACK timer for the new req. If
we can set the timer, we call inet_ehash_insert() to unhash the old req and
put the new req into ehash.
The noteworthy point here is that by unhashing the old req, another CPU
processing it may lose the "own_req" race in tcp_v[46]_syn_recv_sock() and
drop the final ACK packet. However, the new timer will recover this
situation.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-7-kuniyu@amazon.co.jp
When we call close() or shutdown() for listening sockets, each child socket
in the accept queue are freed at inet_csk_listen_stop(). If we can get a
new listener by reuseport_migrate_sock() and clone the request by
inet_reqsk_clone(), we try to add it into the new listener's accept queue
by inet_csk_reqsk_queue_add(). If it fails, we have to call __reqsk_free()
to call sock_put() for its listener and free the cloned request.
After putting the full socket into ehash, tcp_v[46]_syn_recv_sock() sets
NULL to ireq_opt/pktopts in struct inet_request_sock, but ipv6_opt can be
non-NULL. So, we have to set NULL to ipv6_opt of the old request to avoid
double free.
Note that we do not update req->rsk_listener and instead clone the req to
migrate because another path may reference the original request. If we
protected it by RCU, we would need to add rcu_read_lock() in many places.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-6-kuniyu@amazon.co.jp
reuseport_migrate_sock() does the same check done in
reuseport_listen_stop_sock(). If the reuseport group is capable of
migration, reuseport_migrate_sock() selects a new listener by the child
socket hash and increments the listener's sk_refcnt beforehand. Thus, if we
fail in the migration, we have to decrement it later.
We will support migration by eBPF in the later commits.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.
The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.
Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.
This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.
By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.
When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:
- we cannot carry over the eBPF prog of shutdowned sockets
- we cannot attach/detach an eBPF prog to/from listening sockets via
shutdowned sockets
Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
As noted in the following commit, a closed listener has to hold the
reference to the reuseport group for socket migration. This patch adds a
field (num_closed_socks) to struct sock_reuseport to manage closed sockets
within the same reuseport group. Moreover, this and the following commits
introduce some helper functions to split socks[] into two sections and keep
TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended
queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE
sockets from the end.
TCP_LISTEN----------> <-------TCP_CLOSE
+---+---+ --- +---+ --- +---+ --- +---+
| 0 | 1 | ... | i | ... | j | ... | k |
+---+---+ --- +---+ --- +---+ --- +---+
i = num_socks - 1
j = max_socks - num_closed_socks
k = max_socks - 1
This patch also extends reuseport_add_sock() and reuseport_grow() to
support num_closed_socks.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled or eBPF program is attached, we will be able to migrate
child sockets from a listener to another in the same reuseport group after
close() or shutdown() syscalls.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
When receiving a new connection pchan->conn won't be initialized so the
code cannot use bt_dev_dbg as the pointer to hci_dev won't be
accessible.
Fixes: 2e1614f7d6 ("Bluetooth: SMP: Convert BT_ERR/BT_DBG to bt_dev_err/bt_dev_dbg")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
After the blamed patch, __skb_flow_dissect() on the DSA master stopped
adjusting for the length of the DSA headers. This is because it was told
to adjust only if the needed_headroom is zero, aka if there is no DSA
header. Of course, the adjustment should be done only if there _is_ a
DSA header.
Modify the comment too so it is clearer.
Fixes: 4e50025129 ("net: dsa: generalize overhead for taggers that use both headers and trailers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever query statistics is issued for trap, devlink subsystem
would also fill-in statistics 'dropped' field. This field indicates
the number of packets HW dropped and failed to report to the device driver,
and thus - to the devlink subsystem itself.
In case if device driver didn't register callback for hard drop
statistics querying, 'dropped' field will be omitted and not filled.
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported slab-out-of-bounds Read in
qrtr_endpoint_post. The problem was in wrong
_size_ type:
if (len != ALIGN(size, 4) + hdrlen)
goto err;
If size from qrtr_hdr is 4294967293 (0xfffffffd), the result of
ALIGN(size, 4) will be 0. In case of len == hdrlen and size == 4294967293
in header this check won't fail and
skb_put_data(skb, data + hdrlen, size);
will read out of bound from data, which is hdrlen allocated block.
Fixes: 194ccc8829 ("net: qrtr: Support decoding incoming v2 packets")
Reported-and-tested-by: syzbot+1917d778024161609247@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current get_phy_flags() is only processed when we connect to a PHY
via a designed phy-handle property via phylink_of_phy_connect(), but if
we fallback on the internal MDIO bus created by a switch and take the
dsa_slave_phy_connect() path then we would not be processing that flag
and using it at PHY connection time.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If link mtu is too big, mld_newpack() allocates high-order page.
But most mld packets don't need high-order page.
So, it might waste unnecessary pages.
To avoid this, it makes mld_newpack() try to allocate order-0 page.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable err is being initialized with a value that is never read, the
assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver reported a use case where deleting a VRF device can hang
waiting for the refcnt to drop to 0. The root cause is that the dst
is allocated against the VRF device but cached on the loopback
device.
The use case (added to the selftests) has an implicit VRF crossing
due to the ordering of the FIB rules (lookup local is before the
l3mdev rule, but the problem occurs even if the FIB rules are
re-ordered with local after l3mdev because the VRF table does not
have a default route to terminate the lookup). The end result is
is that the FIB lookup returns the loopback device as the nexthop,
but the ingress device is in a VRF. The mismatch causes the dst
alloc against the VRF device but then cached on the loopback.
The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu):
pick the dst alloc device based the fib lookup result but with checks
that the result has a nexthop device (e.g., not an unreachable or
prohibit entry).
Fixes: f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Reported-by: Oliver Herms <oliver.peter.herms@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Outer nest for ETHTOOL_A_STRSET_STRINGSETS is not accounted for.
This may result in ETHTOOL_MSG_STRSET_GET producing a warning like:
calculated message payload length (684) not sufficient
WARNING: CPU: 0 PID: 30967 at net/ethtool/netlink.c:369 ethnl_default_doit+0x87a/0xa20
and a splat.
As usually with such warnings three conditions must be met for the warning
to trigger:
- there must be no skb size rounding up (e.g. reply_size of 684);
- string set must be per-device (so that the header gets populated);
- the device name must be at least 12 characters long.
all in all with current user space it looks like reading priv flags
is the only place this could potentially happen. Or with syzbot :)
Reported-by: syzbot+59aa77b92d06cd5a54f2@syzkaller.appspotmail.com
Fixes: 71921690f9 ("ethtool: provide string sets with STRSET_GET request")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b8392808eb ("sch_cake: add RFC 8622 LE PHB support to CAKE
diffserv handling") added the LE mark to the Bulk tin. Update the
comments to reflect the change.
Signed-off-by: Tyson Moore <tyson@tyson.me>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
When memory allocation for XFRMA_ENCAP or XFRMA_COADDR fails,
the error will not be reported because the -ENOMEM assignment
to the err variable is overwritten before. Fix this by moving
these two in front of the function so that memory allocation
failures will be reported.
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Add support to create (and destroy) interfaces via a new
rtnetlink kind "wwan". The responsible driver has to use
the new wwan_register_ops() to make this possible.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some cases, for example in the upcoming WWAN framework changes,
there's no natural "parent netdev", so sometimes dummy netdevs are
created or similar. IFLA_PARENT_DEV_NAME is a new attribute intended to
contain a device (sysfs, struct device) name that can be used instead
when creating a new netdev, if the rtnetlink family implements it.
As suggested by Parav Pandit, we also introduce IFLA_PARENT_DEV_BUS_NAME
attribute in order to uniquely identify a device on the system (with
bus/name pair).
ip-link(8) support for the generic parent device attributes will help
us avoid code duplication, so no other link type will require a custom
code to handle the parent name attribute. E.g. the WWAN interface
creation command will looks like this:
$ ip link add wwan0-1 parent-dev wwan0 type wwan channel-id 1
So, some future subsystem (or driver) FOO will have an interface
creation command that looks like this:
$ ip link add foo1-3 parent-dev foo1 type foo bar-id 3 baz-type Y
Below is an example of dumping link info of a random device with these
new attributes:
$ ip --details link show wlp0s20f3
4: wlp0s20f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP mode DORMANT group default qlen 1000
...
parent_bus pci parent_dev 0000:00:14.3
Co-developed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Co-developed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Suggested-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make rtnetlink ops that can create different
kinds of devices, like what we want to add to the WWAN
framework, the priv_size and setup parameters aren't quite
sufficient. Make this easier to manage by allowing ops to
allocate their own netdev via an @alloc method that gets
the tb netlink data.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled.
The reason is that nsfs tries to access ns->ops but the proc_ns_operations
is not implemented in this case.
[7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[7.670268] pgd = 32b54000
[7.670544] [00000010] *pgd=00000000
[7.671861] Internal error: Oops: 5 [#1] SMP ARM
[7.672315] Modules linked in:
[7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16
[7.673309] Hardware name: Generic DT based system
[7.673642] PC is at nsfs_evict+0x24/0x30
[7.674486] LR is at clear_inode+0x20/0x9c
The same to tun SIOCGSKNS command.
To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is
disabled. Meanwhile move it to right place net/core/net_namespace.c.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Fixes: c62cce2cae ("net: add an ioctl to get a socket network namespace")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The forward declarations for the iucv_handler callbacks are causing
various compile warnings with gcc-11. Reshuffle the code to get rid
of these prototypes.
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SEQPACKET ops for loopback transport and 'seqpacket_allow()'
callback.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make transport work with SOCK_SEQPACKET add two things:
1) SOCK_SEQPACKET ops for virtio transport and 'seqpacket_allow()'
callback.
2) Handling of SEQPACKET bit: guest tries to negotiate it with vhost,
so feature will be enabled only if bit is negotiated with device.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small updates to make SOCK_SEQPACKET work:
1) Send SHUTDOWN on socket close for SEQPACKET type.
2) Set SEQPACKET packet type during send.
3) Set 'VIRTIO_VSOCK_SEQ_EOR' bit in flags for last
packet of message.
4) Implement data check function for SEQPACKET.
5) Check for max datagram size.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update current receive logic for SEQPACKET support: performs
check for packet and socket types on receive(if mismatch, then
reset connection). Increment EOR counter on receive. Also if
buffer of new packet was appended to buffer of last packet in
rx queue, update flags of last packet with flags of new packet.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Callback fetches RW packets from rx queue of socket until whole record
is copied(if user's buffer is full, user is not woken up). This is done
to not stall sender, because if we wake up user and it leaves syscall,
nobody will send credit update for rest of record, and sender will wait
for next enter of read syscall at receiver's side. So if user buffer is
full, we just send credit update and drop data.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is static and 'hdr' arg was always NULL.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to set type of packet which differs from type
of socket, so move passing type of packet from 'info' structure
to 'virtio_transport_send_pkt_info()' function. Since at current
time only stream type is supported, set it directly in 'virtio_
transport_send_pkt_info()', so callers don't need to set it.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace 'stream' to 'connection oriented' in comments as
SEQPACKET is also connection oriented.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add socket ops for SEQPACKET type and .seqpacket_allow() callback
to query transports if they support SEQPACKET. Also split path
for data check for STREAM and SEQPACKET branches.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update current stream enqueue function for SEQPACKET
support:
1) Call transport's seqpacket enqueue callback.
2) Return value from enqueue function is whole record length or error
for SOCK_SEQPACKET.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add receive loop for SEQPACKET. It looks like receive loop for
STREAM, but there are differences:
1) It doesn't call notify callbacks.
2) It doesn't care about 'SO_SNDLOWAT' and 'SO_RCVLOWAT' values, because
there is no sense for these values in SEQPACKET case.
3) It waits until whole record is received.
4) It processes and sets 'MSG_TRUNC' flag.
So to avoid extra conditions for two types of socket inside one loop, two
independent functions were created.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>