There's no need to open-code container_of() when we have
bss_from_pub(). Use it.
Change-Id: I074723717909ba211a40e6499f0c36df0e2ba4be
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The call to ieee80211_do_stop() right after will also do
synchronize_rcu() to ensure the SDATA_STATE_RUNNING bit
is cleared, so we don't need to synchronize_net() here.
Change-Id: Id9f9ffcf195002013e5d9fde288877d219780864
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_get_txq() can only be called with vif != NULL.
Remove not needed NULL test in function.
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/r/20221107161328.2883-1-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In regulatory_init_db(), when it's going to return a error, reg_pdev
should be unregistered. When load_builtin_regdb_keys() fails it doesn't
do it and makes cfg80211 can't be reload with report:
sysfs: cannot create duplicate filename '/devices/platform/regulatory.0'
...
<TASK>
dump_stack_lvl+0x79/0x9b
sysfs_warn_dup.cold+0x1c/0x29
sysfs_create_dir_ns+0x22d/0x290
kobject_add_internal+0x247/0x800
kobject_add+0x135/0x1b0
device_add+0x389/0x1be0
platform_device_add+0x28f/0x790
platform_device_register_full+0x376/0x4b0
regulatory_init+0x9a/0x4b2 [cfg80211]
cfg80211_init+0x84/0x113 [cfg80211]
...
Fixes: 90a53e4432 ("cfg80211: implement regdb signature checking")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Link: https://lore.kernel.org/r/20221109090237.214127-1-chenzhongjin@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the "channel->freq_offset" comparison is omitted in cmp_bss(),
BSS with different kHz units cannot be distinguished in the S1G Band.
So "freq_offset" should also be included in the comparison.
Signed-off-by: JUN-KYU SHIN <jk.shin@newratek.com>
Link: https://lore.kernel.org/r/20221111023301.6395-1-jk.shin@newratek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In ieee80211_lookup_key, the variable named `local` is unused if
compiled without lockdep, getting this warning:
net/mac80211/cfg.c: In function ‘ieee80211_lookup_key’:
net/mac80211/cfg.c:542:26: error: unused variable ‘local’ [-Werror=unused-variable]
struct ieee80211_local *local = sdata->local;
^~~~~
Fix it with __maybe_unused.
Fixes: 8cbf0c2ab6 ("wifi: mac80211: refactor some key code")
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Link: https://lore.kernel.org/r/20221111153622.29016-1-ihuguet@redhat.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For ACKs generated inside the I/O thread, transmit the ACK at the point of
generation. Where the ACK is generated outside of the I/O thread, it's
offloaded to the I/O thread to transmit it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Fold __rxrpc_unuse_local() into rxrpc_unuse_local() as the latter is now
the only user of the former.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When we've gone for >1RTT without transmitting a packet, we should reduce
the ssthresh and cut the cwnd by half (as suggested in RFC2861 sec 3.1).
However, we may receive ACK packets in a batch and the first of these may
cut the cwnd, preventing further transmission, and each subsequent one cuts
the cwnd yet further, reducing it to the floor and killing performance.
Fix this by moving the cwnd reset to after doing the transmission and
resetting the base time such that we don't cut the cwnd by half again for
at least another RTT.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Add a tracepoint to log when a cwnd reset occurs due to lack of
transmission on a call.
Add stat counters to count transmission underflows (ie. when we have tx
window space, but sendmsg doesn't manage to keep up), cwnd resets and
transmission failures.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
None of the spinlocks in rxrpc need a _bh annotation now as the RCU
callback routines no longer take spinlocks and the bulk of the packet
wrangling code is now run in the I/O thread, not softirq context.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Move the functions from the call->processor and local->processor work items
into the domain of the I/O thread.
The call event processor, now called from the I/O thread, then takes over
the job of cranking the call state machine, processing incoming packets and
transmitting DATA, ACK and ABORT packets. In a future patch,
rxrpc_send_ACK() will transmit the ACK on the spot rather than queuing it
for later transmission.
The call event processor becomes purely received-skb driven. It only
transmits things in response to events. We use "pokes" to queue a dummy
skb to make it do things like start/resume transmitting data. Timer expiry
also results in pokes.
The connection event processor, becomes similar, though crypto events, such
as dealing with CHALLENGE and RESPONSE packets is offloaded to a work item
to avoid doing crypto in the I/O thread.
The local event processor is removed and VERSION response packets are
generated directly from the packet parser. Similarly, ABORTs generated in
response to protocol errors will be transmitted immediately rather than
being pushed onto a queue for later transmission.
Changes:
========
ver #2)
- Fix a couple of introduced lock context imbalances.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Extract the peer address from an incoming packet earlier, at the beginning
of rxrpc_input_packet() and thence pass a pointer to it to various
functions that use it as part of the lookup rather than doing it on several
separate paths.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Shrink the region of rxrpc_input_packet() that is covered by the RCU read
lock so that it only covers the connection and call lookup. This means
that the bits now outside of that can call sleepable functions such as
kmalloc and sendmsg.
Also take a ref on the conn or call we're going to use before we drop the
RCU read lock.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
A received skbuff needs a ref when it gets put on a call data queue or conn
packet queue, and rxrpc_input_packet() and co. jump through a lot of hoops
to avoid double-dropping the skbuff ref so that we can avoid getting a ref
when we queue the packet.
Change this so that the skbuff ref is unconditionally dropped by the caller
of rxrpc_input_packet(). An additional ref is then taken on the packet if
it is pushed onto a queue.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Remove the RCU requirements from the peer's list of error targets so that
the error distributor can call sleeping functions.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Move DATA transmission into the call processor work item. In a future
patch, this will be called from the I/O thread rather than being itsown
work item.
This will allow DATA transmission to be driven directly by incoming ACKs,
pokes and timers as those are processed.
The Tx queue is also split: The queue of packets prepared by sendmsg is now
places in call->tx_sendmsg and the packet dispatcher decants the packets
into call->tx_buffer as space becomes available in the transmission
window. This allows sendmsg to run ahead of the available space to try and
prevent an underflow in transmission.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Copy client call parameters into rxrpc_call earlier so that that can be
used to convey them to the connection code - which can then be offloaded to
the I/O thread.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Provide a means by which an event notification can be sent to a call such
that the I/O thread can process it rather than it being done in a separate
workqueue. This will allow a lot of locking to be removed.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Don't use sk->sk_receive_queue.lock to guard socket state changes as the
socket mutex is sufficient.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Remove call->input_lock as it was only necessary to serialise access to the
state stored in the rxrpc_call struct by simultaneous softirq handlers
presenting received packets. They now dump the packets in a queue and a
single process-context handler now processes them.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Move the processing of error packets into the local endpoint I/O thread,
leaving the handover from UDP to merely transfer them into the local
endpoint queue.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Split the packet input handler to make the softirq side just dump the
received packet into the local endpoint receive queue and then call the
remainder of the input function from the I/O thread.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Create a per-local receive queue to which, in a future patch, all incoming
packets will be directed and an I/O thread that will process those packets
and perform all transmission of packets.
Destruction of the local endpoint is also moved from the local processor
work item (which will be absorbed) to the thread.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Split the code that handles packet reception in softirq mode as a prelude
to moving all the packet processing beyond routing to the appropriate call
and setting up of a new call out into process context.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Currently, rxrpc gives the connection's work item a ref on the connection
when it queues it - and this is called from the timer expiration function.
The problem comes when queue_work() fails (ie. the work item is already
queued): the timer routine must put the ref - but this may cause the
cleanup code to run.
This has the unfortunate effect that the cleanup code may then be run in
softirq context - which means that any spinlocks it might need to touch
have to be guarded to disable softirqs (ie. they need a "_bh" suffix).
(1) Don't give a ref to the work item.
(2) Simplify handling of service connections by adding a separate active
count so that the refcount isn't also used for this.
(3) Connection destruction for both client and service connections can
then be cleaned up by putting rxrpc_put_connection() out of line and
making a tidy progression through the destruction code (offloaded to a
workqueue if put from softirq or processor function context). The RCU
part of the cleanup then only deals with the freeing at the end.
(4) Make rxrpc_queue_conn() return immediately if it sees the active count
is -1 rather then queuing the connection.
(5) Make sure that the cleanup routine waits for the work item to
complete.
(6) Stash the rxrpc_net pointer in the conn struct so that the rcu free
routine can use it, even if the local endpoint has been freed.
Unfortunately, neither the timer nor the work item can simply get around
the problem by just using refcount_inc_not_zero() as the waits would still
have to be done, and there would still be the possibility of having to put
the ref in the expiration function.
Note the connection work item is mostly going to go away with the main
event work being transferred to the I/O thread, so the wait in (6) will
become obsolete.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Currently, rxrpc gives the call timer a ref on the call when it starts it
and this is passed along to the workqueue by the timer expiration function.
The problem comes when queue_work() fails (ie. the work item is already
queued): the timer routine must put the ref - but this may cause the
cleanup code to run.
This has the unfortunate effect that the cleanup code may then be run in
softirq context - which means that any spinlocks it might need to touch
have to be guarded to disable softirqs (ie. they need a "_bh" suffix).
Fix this by:
(1) Don't give a ref to the timer.
(2) Making the expiration function not do anything if the refcount is 0.
Note that this is more of an optimisation.
(3) Make sure that the cleanup routine waits for timer to complete.
However, this has a consequence that timer cannot give a ref to the work
item. Therefore the following fixes are also necessary:
(4) Don't give a ref to the work item.
(5) Make the work item return asap if it sees the ref count is 0.
(6) Make sure that the cleanup routine waits for the work item to
complete.
Unfortunately, neither the timer nor the work item can simply get around
the problem by just using refcount_inc_not_zero() as the waits would still
have to be done, and there would still be the possibility of having to put
the ref in the expiration function.
Note the call work item is going to go away with the work being transferred
to the I/O thread, so the wait in (6) will become obsolete.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the sk_buff tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Add a tracepoint for the rxrpc_bundle refcounting.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_call tracepoint
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_conn tracepoint
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_peer tracepoint
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_local tracepoint
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Extract the code from a received rx ABORT packet much earlier and in a
single place and harmonise the responses to malformed ABORT packets.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Remove the rxrpc_conn_parameters struct from the rxrpc_connection and
rxrpc_bundle structs and emplace the members directly. These are going to
get filled in from the rxrpc_call struct in future.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Remove the _net() and knet() debugging macros in favour of tracepoints.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Remove the kproto() and _proto() debugging macros in preference to using
tracepoints for this.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
We should not now see duplicate packets in the recvmsg_queue. At one
point, jumbo packets that overlapped with already queued data would be
added to the queue and dealt with in recvmsg rather than in the softirq
input code, but now jumbo packets are split/cloned before being processed
by the input code and the subpackets can be discarded individually.
So remove the recvmsg-side code for handling this.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When retransmitting a packet, rxrpc_resend() shouldn't be attaching a ref
to the call to the txbuf as that pins the call and prevents the call from
clearing the packet buffer.
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: d57a3a1516 ("rxrpc: Save last ACK's SACK table rather than marking txbufs")
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Implement an in-kernel rxperf server to allow kernel-based rxrpc services
to be tested directly, unlike with AFS where they're accessed by the
fileserver when the latter decides it wants to.
This is implemented as a module that, if loaded, opens UDP port 7009
(afs3-rmtsys) and listens on it for incoming calls. Calls can be generated
using the rxperf command shipped with OpenAFS, for example.
Changes
=======
ver #2)
- Use min_t() instead of min().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: Jakub Kicinski <kuba@kernel.org>
Fix the following checker warning:
../net/rxrpc/key.c:692:9: error: subtraction of different types can't work (different address spaces)
Checker is wrong in this case, but cast the pointers to unsigned long to
avoid the warning.
Whilst we're at it, reduce the assertions to WARN_ON() and return an error.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
As the nla_nest_start() may fail with NULL returned, the return value needs
to be checked.
Fixes: ce08cd344a ("wifi: nl80211: expose link information for interfaces")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Link: https://lore.kernel.org/r/20221129014211.56558-1-yuancan@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ping_lookup() does not acquire the table spinlock, so iteration should
use hlist_nulls_for_each_entry_rcu().
Spotted during code review.
Fixes: dbca1596bb ("ping: convert to RCU lookups, get rid of rwlock")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20221129140644.28525-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Check for interval validity in all concatenation fields in
nft_set_pipapo, from Stefano Brivio.
2) Missing preemption disabled in conntrack and flowtable stat
updates, from Xin Long.
3) Fix compilation warning when CONFIG_NF_CONNTRACK_MARK=n.
Except for 3) which was a bug introduced in a recent fix in 6.1-rc
- anything else, broken for several releases.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ctnetlink: fix compilation warning after data race fixes in ct mark
netfilter: conntrack: fix using __this_cpu_add in preemptible
netfilter: flowtable_offload: fix using __this_cpu_add in preemptible
netfilter: nft_set_pipapo: Actually validate intervals in fields after the first one
====================
Link: https://lore.kernel.org/r/20221130121934.1125-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some drivers only reported the driver name in their
devlink_ops::info_get() callback. Now that the core provides this
information, the callback became empty. For such drivers, just
removing the callback would prevent the core from executing
devlink_nl_info_fill() meaning that "devlink dev info" would not
return anything.
Make the callback function optional by executing
devlink_nl_info_fill() even if devlink_ops::info_get() is NULL.
N.B.: the drivers with devlink support which previously did not
implement devlink_ops::info_get() will now also be able to report
the driver name.
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver name is available in device_driver::name. Right now,
drivers still have to report this piece of information themselves in
their devlink_ops::info_get callback function.
In order to factorize code, make devlink_nl_info_fill() add the driver
name attribute.
Now that the core sets the driver name attribute, drivers are not
supposed to call devlink_info_driver_name_put() anymore. Remove
devlink_info_driver_name_put() and clean-up all the drivers using this
function in their callback.
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Tested-by: Ido Schimmel <idosch@nvidia.com> # mlxsw
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To read from a region, user space must currently request a new snapshot of
the region and then read from that snapshot. This can sometimes be overkill
if user space only reads a tiny portion. They first create the snapshot,
then request a read, then destroy the snapshot.
For regions which have a single underlying "contents", it makes sense to
allow supporting direct reading of the region data.
Extend the DEVLINK_CMD_REGION_READ to allow direct reading from a region if
requested via the new DEVLINK_ATTR_REGION_DIRECT. If this attribute is set,
then perform a direct read instead of using a snapshot. Direct read is
mutually exclusive with DEVLINK_ATTR_REGION_SNAPSHOT_ID, and care is taken
to ensure that we reject commands which provide incorrect attributes.
Regions must enable support for direct read by implementing the .read()
callback function. If a region does not support such direct reads, a
suitable extended error message is reported.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The devlink_nl_region_read_snapshot_fill is used to copy the contents of
a snapshot into a message for reporting to userspace via the
DEVLINK_CMG_REGION_READ netlink message.
A future change is going to add support for directly reading from
a region. Almost all of the logic for this new capability is identical.
To help reduce code duplication and make this logic more generic,
refactor the function to take a cb and cb_priv pointer for doing the
actual copy.
Add a devlink_region_snapshot_fill implementation that will simply copy
the relevant chunk of the region. This does require allocating some
storage for the chunk as opposed to simply passing the correct address
forward to the devlink_nl_cmg_region_read_chunk_fill function.
A future change to implement support for directly reading from a region
without a snapshot will provide a separate implementation that calls the
newly added devlink region operation.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The devlink parameter of the devlink_nl_cmd_region_read_chunk_fill
function is not used. Remove it, to simplify the function signature.
Once removed, it is also obvious that the devlink parameter is not
necessary for the devlink_nl_region_read_snapshot_fill either.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The snapshot pointer is obtained inside of the function
devlink_nl_region_read_snapshot_fill. Simplify this function by locating
the snapshot upfront in devlink_nl_cmd_region_read_dumpit instead. This
aligns with how other netlink attributes are handled, and allows us to
exit slightly earlier if an invalid snapshot ID is provided.
It also allows us to pass the snapshot pointer directly to the
devlink_nl_region_read_snapshot_fill, and remove the now unused attrs
parameter.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Report extended error details in the devlink_nl_cmd_region_read_dumpit()
function, by using the extack structure from the netlink_callback.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The calculation for the data_size in the devlink_nl_read_snapshot_fill
function uses an if statement that is better expressed using the min_t
macro.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the
following compilation error will appear:
$ make LLVM=1 LLVM_IAS=1 -j
...
ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined
cgroup_storage_map_btf_ids:;
^
make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1
In local_storage.c, we have
BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
Commit c4bcfb38a9 ("bpf: Implement cgroup storage available to
non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE
definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker
complains with lto build.
Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times
for sk, inode, task and cgrp local storages. Let us define a single global one
with a different name than cgroup_storage_map_btf_ids, which also fixed
the lto compilation error.
Fixes: c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
Use apply_bytes on ingress redirect, when apply_bytes is less than
the length of msg data, some data may be skipped and lost in
bpf_tcp_ingress().
If there is still data in the scatterlist that has not been consumed,
we cannot move the msg iter.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-4-git-send-email-yangpc@wangsu.com
When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS
flag from the msg->flags. If apply_bytes is used and it is larger than
the current data being processed, sk_psock_msg_verdict() will not be
called when sendmsg() is called again. At this time, the msg->flags is 0,
and we lost the BPF_F_INGRESS flag.
So we need to save the BPF_F_INGRESS flag in sk_psock and use it when
redirection.
Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com
In tcp_bpf_send_verdict() redirection, the eval variable is assigned to
__SK_REDIRECT after the apply_bytes data is sent, if msg has more_data,
sock_put() will be called multiple times.
We should reset the eval variable to __SK_NONE every time more_data
starts.
This causes:
IPv4: Attempt to release TCP socket in state 1 00000000b4c925d7
------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 5 PID: 4482 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0x110
Modules linked in:
CPU: 5 PID: 4482 Comm: sockhash_bypass Kdump: loaded Not tainted 6.0.0 #1
Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
<TASK>
__tcp_transmit_skb+0xa1b/0xb90
? __alloc_skb+0x8c/0x1a0
? __kmalloc_node_track_caller+0x184/0x320
tcp_write_xmit+0x22a/0x1110
__tcp_push_pending_frames+0x32/0xf0
do_tcp_sendpages+0x62d/0x640
tcp_bpf_push+0xae/0x2c0
tcp_bpf_sendmsg_redir+0x260/0x410
? preempt_count_add+0x70/0xa0
tcp_bpf_send_verdict+0x386/0x4b0
tcp_bpf_sendmsg+0x21b/0x3b0
sock_sendmsg+0x58/0x70
__sys_sendto+0xfa/0x170
? xfd_validate_state+0x1d/0x80
? switch_fpu_return+0x59/0xe0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x37/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: cd9733f5d7 ("tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-2-git-send-email-yangpc@wangsu.com
icmp conntrack will set icmp redirects as RELATED, but icmpv6 will not
do this.
For icmpv6, only icmp errors (code <= 128) are examined for RELATED state.
ICMPV6 Redirects are part of neighbour discovery mechanism, those are
handled by marking a selected subset (e.g. neighbour solicitations) as
UNTRACKED, but not REDIRECT -- they will thus be flagged as INVALID.
Add minimal support for REDIRECTs. No parsing of neighbour options is
added for simplicity, so this will only check that we have the embeeded
original header (ND_OPT_REDIRECT_HDR), and then attempt to do a flow
lookup for this tuple.
Also extend the existing test case to cover redirects.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Reported-by: Eric Garver <eric@garver.life>
Link: https://github.com/firewalld/firewalld/issues/1046
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Garver <eric@garver.life>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, the inetdev_destroy() function waits for an RCU grace period
before decrementing the refcount and freeing memory. This causes a delay
with a new RCU configuration that tries to save power, which results in the
network interface disappearing later than expected. The resulting delay
causes test failures on ChromeOS.
Refactor the code such that the refcount is freed before the grace period
and memory is freed after. With this a ChromeOS network test passes that
does 'ip netns del' and polls for an interface disappearing, now passes.
Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: <netdev@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In a networking test on ChromeOS, kernels built with the new
CONFIG_RCU_LAZY=y Kconfig option fail a networking test in the teardown
phase.
This failure may be reproduced as follows: ip netns del <name>
The CONFIG_RCU_LAZY=y Kconfig option was introduced by earlier commits
in this series for the benefit of certain battery-powered systems.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them. This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing. This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.
This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.
Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory. If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order. Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.
However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked. It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU. After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.
Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.
Returning to the test failure, use of ftrace showed that this failure
cause caused by the aadded delays due to this new lazy behavior of
call_rcu() in kernels built with CONFIG_RCU_LAZY=y.
Therefore, make dst_release() use call_rcu_hurry() in order to revert
to the old test-failure-free behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: <netdev@vger.kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Add a new parameter to complement the existing 'netmask' option. The
main difference between netmask and bitmask is that bitmask takes any
arbitrary ip address as input, it does not have to be a valid netmask.
The name of the new parameter is 'bitmask'. This lets us mask out
arbitrary bits in the ip address, for example:
ipset create set1 hash:ip bitmask 255.128.255.0
ipset create set2 hash:ip,port family inet6 bitmask ffff::ff80
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to have distinct functions. After merge, ipv6 can avoid
protooff computation if the connection neither needs sequence adjustment
nor helper invocation -- this is the normal case.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
SCTP conntrack currently assumes that the SCTP endpoints will
probe secondary paths using HEARTBEAT before sending traffic.
But, according to RFC 9260, SCTP endpoints can send any traffic
on any of the confirmed paths after SCTP association is up.
SCTP endpoints that sends INIT will confirm all peer addresses
that upper layer configures, and the SCTP endpoint that receives
COOKIE_ECHO will only confirm the address it sent the INIT_ACK to.
So, we can have a situation where the INIT sender can start to
use secondary paths without the need to send HEARTBEAT. This patch
allows DATA/SACK packets to create new connection tracking entry.
A new state has been added to indicate that a DATA/SACK chunk has
been seen in the original direction - SCTP_CONNTRACK_DATA_SENT.
State transitions mostly follows the HEARTBEAT_SENT, except on
receiving HEARTBEAT/HEARTBEAT_ACK/DATA/SACK in the reply direction.
State transitions in original direction:
- DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks,
except that it remains in DATA_SENT on receving HEARTBEAT,
HEARTBEAT_ACK/DATA/SACK chunks
State transitions in reply direction:
- DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks,
except that it moves to HEARTBEAT_ACKED on receiving
HEARTBEAT/HEARTBEAT_ACK/DATA/SACK chunks
Note: This patch still doesn't solve the problem when the SCTP
endpoint decides to use primary paths for association establishment
but uses a secondary path for association shutdown. We still have
to depend on timeout for connections to expire in such a case.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently in nf_conntrack_hash_check_insert(), when it fails in
nf_ct_ext_valid_pre/post(), NF_CT_STAT_INC() will be called in the
preemptible context, a call trace can be triggered:
BUG: using __this_cpu_add() in preemptible [00000000] code: conntrack/1636
caller is nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x46
check_preemption_disabled+0xc3/0xf0
nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
ctnetlink_create_conntrack+0x3cd/0x4e0 [nf_conntrack_netlink]
ctnetlink_new_conntrack+0x1c0/0x450 [nf_conntrack_netlink]
nfnetlink_rcv_msg+0x277/0x2f0 [nfnetlink]
netlink_rcv_skb+0x50/0x100
nfnetlink_rcv+0x65/0x144 [nfnetlink]
netlink_unicast+0x1ae/0x290
netlink_sendmsg+0x257/0x4f0
sock_sendmsg+0x5f/0x70
This patch is to fix it by changing to use NF_CT_STAT_INC_ATOMIC() for
nf_ct_ext_valid_pre/post() check in nf_conntrack_hash_check_insert(),
as well as nf_ct_ext_valid_post() in __nf_conntrack_confirm().
Note that nf_ct_ext_valid_pre() check in __nf_conntrack_confirm() is
safe to use NF_CT_STAT_INC(), as it's under local_bh_disable().
Fixes: c56716c69c ("netfilter: extensions: introduce extension genid count")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Steffen Klassert says:
====================
ipsec-next 2022-11-26
1) Remove redundant variable in esp6.
From Colin Ian King.
2) Update x->lastused for every packet. It was used only for
outgoing mobile IPv6 packets, but showed to be usefull
to check if the a SA is still in use in general.
From Antony Antony.
3) Remove unused variable in xfrm_byidx_resize.
From Leon Romanovsky.
4) Finalize extack support for xfrm.
From Sabrina Dubroca.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: add extack to xfrm_set_spdinfo
xfrm: add extack to xfrm_alloc_userspi
xfrm: add extack to xfrm_do_migrate
xfrm: add extack to xfrm_new_ae and xfrm_replay_verify_len
xfrm: add extack to xfrm_del_sa
xfrm: add extack to xfrm_add_sa_expire
xfrm: a few coding style clean ups
xfrm: Remove not-used total variable
xfrm: update x->lastused for every packet
esp6: remove redundant variable err
====================
Link: https://lore.kernel.org/r/20221126110303.1859238-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The goal of this socket option is to set different keys per listener,
see commit 1fba70e5b6 ("tcp: socket option to set TCP fast open key")
for more details about this socket option.
The only thing to do here with MPTCP is to relay the request to the
first subflow like it is already done for the other TCP_FASTOPEN* socket
options.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The TCP_FASTOPEN socket option is one way for the application to tell
the kernel TFO support has to be enabled for the listener socket.
The only thing to do here with MPTCP is to relay the request to the
first subflow like it is already done for the other TCP_FASTOPEN* socket
options.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The send_synack() needs to be overridden for MPTCP to support TFO for
two reasons:
- There is not be enough space in the TCP options if the TFO cookie has
to be added in the SYN+ACK with other options: MSS (4), SACK OK (2),
Timestamps (10), Window Scale (3+1), TFO (10+2), MP_CAPABLE (12).
MPTCPv1 specs -- RFC 8684, section B.1 [1] -- suggest to drop the TCP
timestamps option in this case.
- The data received in the SYN has to be handled: the SKB can be
dequeued from the subflow sk and transferred to the MPTCP sk. Counters
need to be updated accordingly and the application can be notified at
the end because some bytes have been received.
[1] https://www.rfc-editor.org/rfc/rfc8684.html#section-b.1
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With fastopen in place, the first subflow socket is created before the
MPC handshake completes, and we need to properly initialize the sequence
numbers at MPC ACK reception.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the initial ack sequence is generated on demand whenever
it's requested and the remote key is handy. The relevant code is
scattered in different places and can lead to multiple, unneeded,
crypto operations.
This change consolidates the ack sequence generation code in a single
helper, storing the sequence number at the subflow level.
The above additionally saves a few conditional in fast-path and will
simplify the upcoming fast-open implementation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently in the receive path we don't need to discriminate
between MPC SYN, MPC SYN-ACK and MPC ACK, but soon the fastopen
code will need that info to properly track the fully established
status.
Track the exact MPC suboption type into the receive opt bitmap.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 54f1944ed6 ("mptcp: factor out mptcp_connect()"), all the
infrastructure is now in place to support the MSG_FASTOPEN flag, we
just need to call into the fastopen path in mptcp_sendmsg().
Co-developed-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - new code bugs:
- eth: mlx5e:
- use kvfree() in mlx5e_accel_fs_tcp_create()
- MACsec, fix RX data path 16 RX security channel limit
- MACsec, fix memory leak when MACsec device is deleted
- MACsec, fix update Rx secure channel active field
- MACsec, fix add Rx security association (SA) rule memory leak
Previous releases - regressions:
- wifi: cfg80211: don't allow multi-BSSID in S1G
- stmmac: set MAC's flow control register to reflect current settings
- eth: mlx5:
- E-switch, fix duplicate lag creation
- fix use-after-free when reverting termination table
Previous releases - always broken:
- ipv4: fix route deletion when nexthop info is not specified
- bpf: fix a local storage BPF map bug where the value's spin lock
field can get initialized incorrectly
- tipc: re-fetch skb cb after tipc_msg_validate
- wifi: wilc1000: fix Information Element parsing
- packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
- sctp: fix memory leak in sctp_stream_outq_migrate()
- can: can327: fix potential skb leak when netdev is down
- can: add number of missing netdev freeing on error paths
- aquantia: do not purge addresses when setting the number of rings
- wwan: iosm:
- fix incorrect skb length leading to truncated packet
- fix crash in peek throughput test due to skb UAF
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOGOdYACgkQMUZtbf5S
IrsknQ//SAoOyDOEu15YzOt8hAupLKoF6MM+D0dwwTEQZLf7IVXCjPpkKtVh7Si7
YCBoyrqrDs7vwaUrVoKY19Amwov+EYrHCpdC+c7wdZ7uxTaYfUbJJUGmxYOR179o
lV1+1Aiqg9F9C6CUsmZ5lDN2Yb7/uPDBICIV8LM+VzJAtXjurBVauyMwAxLxPOAr
cgvM+h5xzE7DXMF2z8R/mUq5MSIWoJo9hy2UwbV+f2liMTQuw9rwTbyw3d7+H/6p
xmJcBcVaABjoUEsEhld3NTlYbSEnlFgCQBfDWzf2e4y6jBxO0JepuIc7SZwJFRJY
XBqdsKcGw5RkgKbksKUgxe126XFX0SUUQEp0UkOIqe15k7eC2yO9uj1gRm6OuV4s
J94HKzHX9WNV5OQ790Ed2JyIJScztMZlNFVJ/cz2/+iKR42xJg6kaO6Rt2fobtmL
VC2cH+RfHzLl+2+7xnfzXEDgFePSBlA02Aq1wihU3zB3r7WCFHchEf9T7sGt1QF0
03R+8E3+N2tYqphPAXyDoy6kXQJTPxJHAe1FNHJlwgfieUDEWZi/Pm+uQrKIkDeo
oq9MAV2QBNSD1w4wl7cXfvicO5kBr/OP6YBqwkpsGao2jCSIgkWEX2DRrUaLczXl
5/Z+m/gCO5tAEcVRYfMivxUIon//9EIhbErVpHTlNWpRHk24eS4=
=0Lnw
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and wifi.
Current release - new code bugs:
- eth: mlx5e:
- use kvfree() in mlx5e_accel_fs_tcp_create()
- MACsec, fix RX data path 16 RX security channel limit
- MACsec, fix memory leak when MACsec device is deleted
- MACsec, fix update Rx secure channel active field
- MACsec, fix add Rx security association (SA) rule memory leak
Previous releases - regressions:
- wifi: cfg80211: don't allow multi-BSSID in S1G
- stmmac: set MAC's flow control register to reflect current settings
- eth: mlx5:
- E-switch, fix duplicate lag creation
- fix use-after-free when reverting termination table
Previous releases - always broken:
- ipv4: fix route deletion when nexthop info is not specified
- bpf: fix a local storage BPF map bug where the value's spin lock
field can get initialized incorrectly
- tipc: re-fetch skb cb after tipc_msg_validate
- wifi: wilc1000: fix Information Element parsing
- packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
- sctp: fix memory leak in sctp_stream_outq_migrate()
- can: can327: fix potential skb leak when netdev is down
- can: add number of missing netdev freeing on error paths
- aquantia: do not purge addresses when setting the number of rings
- wwan: iosm:
- fix incorrect skb length leading to truncated packet
- fix crash in peek throughput test due to skb UAF"
* tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed
MAINTAINERS: Update maintainer list for chelsio drivers
ionic: update MAINTAINERS entry
sctp: fix memory leak in sctp_stream_outq_migrate()
packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
net/mlx5: Lag, Fix for loop when checking lag
Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path"
net: marvell: prestera: Fix a NULL vs IS_ERR() check in some functions
net: tun: Fix use-after-free in tun_detach()
net: mdiobus: fix unbalanced node reference count
net: hsr: Fix potential use-after-free
tipc: re-fetch skb cb after tipc_msg_validate
mptcp: fix sleep in atomic at close time
mptcp: don't orphan ssk in mptcp_close()
dsa: lan9303: Correct stat name
ipv4: Fix route deletion when nexthop info is not specified
net: wwan: iosm: fix incorrect skb length
net: wwan: iosm: fix crash in peek throughput test
net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type
net: wwan: iosm: fix kernel test robot reported error
...
As the nla_nest_start() may fail with NULL returned, the return value
should be checked.
Note that this is not a real bug, nothing will break here.
The next nla_put() will fail as well and we'll bail (and
nla_nest_cancel() can handle NULL). But we keep getting
those "fixes" so whatever.
Signed-off-by: Yuan Can <yuancan@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20221129013934.55184-1-yuancan@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CHECKSUM_COMPLETE signals that skb->csum stores the sum over the
entire packet. It does not imply that an embedded l4 checksum
field has been validated.
Fixes: 682f048bd4 ("af_packet: pass checksum validation status to the user")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20221128161812.640098-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add an internal trace when valid beacons are received.
Signed-off-by: David Girault <david.girault@qorvo.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20221129135535.532513-3-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Let's introduce the basics for advertizing discovered PANs and
coordinators, which is:
- A new "scan" netlink message group.
- A couple of netlink command/attribute.
- The main netlink helper to send a netlink message with all the
necessary information to forward the main information to the user.
Two netlink attributes are proactively added to support future UWB
complex channels, but are not actually used yet.
Co-developed-by: David Girault <david.girault@qorvo.com>
Signed-off-by: David Girault <david.girault@qorvo.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20221129135535.532513-2-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
As the return value is not 0 only in case there is no such notifier
block registered, add a WARN_ON_ONCE() to yell about it.
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20221125100255.1786741-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Export
of_get_mac_addr_nvmem()
and rename it to
of_get_mac_address_nvmem()
in order to fit the convention followed by the existing exported helpers
of the same kind.
This way, OF compatible drivers using eg. fwnode_get_mac_address() can
do a direct call to it instead of calling of_get_mac_address() just for
the nvmem step, avoiding to repeat an expensive DT lookup which has
already been done once.
Eventually, fwnode_get_mac_address() should probably be updated to
perform the nvmem lookup directly, but as of today, nvmem cells seem not
to be supported by ACPI yet which would defeat this kind of extension.
Suggested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY4AC5QAKCRDbK58LschI
g1e0AQCfAqduTy7mYd02jDNCV0wLphNp9FbPiP9OrQT37ABpKAEA1ulj1X59bX3d
HnZdDKuatcPZT9MV5hDLM7MFJ9GjOA4=
=fNmM
-----END PGP SIGNATURE-----
Daniel Borkmann says:
====================
bpf-next 2022-11-25
We've added 101 non-merge commits during the last 11 day(s) which contain
a total of 109 files changed, 8827 insertions(+), 1129 deletions(-).
The main changes are:
1) Support for user defined BPF objects: the use case is to allocate own
objects, build own object hierarchies and use the building blocks to
build own data structures flexibly, for example, linked lists in BPF,
from Kumar Kartikeya Dwivedi.
2) Add bpf_rcu_read_{,un}lock() support for sleepable programs,
from Yonghong Song.
3) Add support storing struct task_struct objects as kptrs in maps,
from David Vernet.
4) Batch of BPF map documentation improvements, from Maryam Tahhan
and Donald Hunter.
5) Improve BPF verifier to propagate nullness information for branches
of register to register comparisons, from Eduard Zingerman.
6) Fix cgroup BPF iter infra to hold reference on the start cgroup,
from Hou Tao.
7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted
given it is not the case for them, from Alexei Starovoitov.
8) Improve BPF verifier's realloc handling to better play along with dynamic
runtime analysis tools like KASAN and friends, from Kees Cook.
9) Remove legacy libbpf mode support from bpftool,
from Sahid Orentino Ferdjaoui.
10) Rework zero-len skb redirection checks to avoid potentially breaking
existing BPF test infra users, from Stanislav Fomichev.
11) Two small refactorings which are independent and have been split out
of the XDP queueing RFC series, from Toke Høiland-Jørgensen.
12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen.
13) Documentation on how to run BPF CI without patch submission,
from Daniel Müller.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The skb is delivered to netif_rx() which may free it, after calling this,
dereferencing skb may trigger use-after-free.
Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20221125075724.27912-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As the call trace shows, the original skb was freed in tipc_msg_validate(),
and dereferencing the old skb cb would cause an use-after-free crash.
BUG: KASAN: use-after-free in tipc_crypto_rcv_complete+0x1835/0x2240 [tipc]
Call Trace:
<IRQ>
tipc_crypto_rcv_complete+0x1835/0x2240 [tipc]
tipc_crypto_rcv+0xd32/0x1ec0 [tipc]
tipc_rcv+0x744/0x1150 [tipc]
...
Allocated by task 47078:
kmem_cache_alloc_node+0x158/0x4d0
__alloc_skb+0x1c1/0x270
tipc_buf_acquire+0x1e/0xe0 [tipc]
tipc_msg_create+0x33/0x1c0 [tipc]
tipc_link_build_proto_msg+0x38a/0x2100 [tipc]
tipc_link_timeout+0x8b8/0xef0 [tipc]
tipc_node_timeout+0x2a1/0x960 [tipc]
call_timer_fn+0x2d/0x1c0
...
Freed by task 47078:
tipc_msg_validate+0x7b/0x440 [tipc]
tipc_crypto_rcv_complete+0x4b5/0x2240 [tipc]
tipc_crypto_rcv+0xd32/0x1ec0 [tipc]
tipc_rcv+0x744/0x1150 [tipc]
This patch fixes it by re-fetching the skb cb from the new allocated skb
after calling tipc_msg_validate().
Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1b1cdba762915325bd8ef9a98d0276eb673df2a5.1669398403.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All of the subflows of a msk will be orphaned in mptcp_close(), which
means the subflows are in DEAD state. After then, DATA_FIN will be sent,
and the other side will response with a DATA_ACK for this DATA_FIN.
However, if the other side still has pending data, the data that received
on these subflows will not be passed to the msk, as they are DEAD and
subflow_data_ready() will not be called in tcp_data_ready(). Therefore,
these data can't be acked, and they will be retransmitted again and again,
until timeout.
Fix this by setting ssk->sk_socket and ssk->sk_wq to 'NULL', instead of
orphaning the subflows in __mptcp_close(), as Paolo suggested.
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Reviewed-by: Biao Jiang <benbjiang@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Third, and hopefully final, set of fixes for v6.1. We are marking the
rsi driver as orphan, have some Information Element parsing fixes to
wilc1000 driver and three small fixes to the stack.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmOEnFsRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZuhVQf/c3MdS++rufVGBcplf7zG5MG8j6pJS9bx
ymuyqsqKoLuPDjLemFIbpcEkkLjt63KYNa8A9wkQ8vSfeg0wPVEym0b234q3XMb0
aAlRBurFAYdIql+aiZI2FxOlKfdLC6d/Sjr/FOe25QfX4H4FxBfZM94nobLawu/d
YBGiI29pauEFqZqBU4I3MAtVbk8/gDIVU318KQavpLuPJZNX5F83M025aokDCVj+
3l69MvphYMVvfOQKS8gLFBBJ41WpSrjIkuH0oQW1fmVc8b6Kiu9wGahOdxDm4JE6
iDiy8LdkvqpXehPMsHPqL/JLMByuJ+q/esZUNoX4vlITordNm/Nj0w==
=iNqa
-----END PGP SIGNATURE-----
Merge tag 'wireless-2022-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.1
Third, and hopefully final, set of fixes for v6.1. We are marking the
rsi driver as orphan, have some Information Element parsing fixes to
wilc1000 driver and three small fixes to the stack.
* tag 'wireless-2022-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac8021: fix possible oob access in ieee80211_get_rate_duration
wifi: cfg80211: don't allow multi-BSSID in S1G
wifi: cfg80211: fix buffer overflow in elem comparison
wifi: wilc1000: validate number of channels
wifi: wilc1000: validate length of IEEE80211_P2P_ATTR_CHANNEL_LIST attribute
wifi: wilc1000: validate length of IEEE80211_P2P_ATTR_OPER_CHANNEL attribute
wifi: wilc1000: validate pairwise and authentication suite offsets
MAINTAINERS: mark rsi wifi driver as orphan
====================
Link: https://lore.kernel.org/r/20221128113513.6F459C433C1@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the kernel receives a route deletion request from user space it
tries to delete a route that matches the route attributes specified in
the request.
If only prefix information is specified in the request, the kernel
should delete the first matching FIB alias regardless of its associated
FIB info. However, an error is currently returned when the FIB info is
backed by a nexthop object:
# ip nexthop add id 1 via 192.0.2.2 dev dummy10
# ip route add 198.51.100.0/24 nhid 1
# ip route del 198.51.100.0/24
RTNETLINK answers: No such process
Fix by matching on such a FIB info when legacy nexthop attributes are
not specified in the request. An earlier check already covers the case
where a nexthop ID is specified in the request.
Add tests that cover these flows. Before the fix:
# ./fib_nexthops.sh -t ipv4_fcnal
...
TEST: Delete route when not specifying nexthop attributes [FAIL]
Tests passed: 11
Tests failed: 1
After the fix:
# ./fib_nexthops.sh -t ipv4_fcnal
...
TEST: Delete route when not specifying nexthop attributes [ OK ]
Tests passed: 12
Tests failed: 0
No regressions in other tests:
# ./fib_nexthops.sh
...
Tests passed: 228
Tests failed: 0
# ./fib_tests.sh
...
Tests passed: 186
Tests failed: 0
Cc: stable@vger.kernel.org
Reported-by: Jonas Gorski <jonas.gorski@gmail.com>
Tested-by: Jonas Gorski <jonas.gorski@gmail.com>
Fixes: 493ced1ac4 ("ipv4: Allow routes to use nexthop objects")
Fixes: 6bf92d70e6 ("net: ipv4: fix route with nexthop object delete warning")
Fixes: 61b91eb33a ("ipv4: Handle attempt to delete multipath route when fib_info contains an nh reference")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20221124210932.2470010-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Clean up after commit 22700f3c6d ("SUNRPC: Improve ordering of
transport processing").
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
flow_offload_queue_work() can be called in workqueue without
bh disabled, like the call trace showed in my act_ct testing,
calling NF_FLOW_TABLE_STAT_INC() there would cause a call
trace:
BUG: using __this_cpu_add() in preemptible [00000000] code: kworker/u4:0/138560
caller is flow_offload_queue_work+0xec/0x1b0 [nf_flow_table]
Workqueue: act_ct_workqueue tcf_ct_flow_table_cleanup_work [act_ct]
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x46
check_preemption_disabled+0xc3/0xf0
flow_offload_queue_work+0xec/0x1b0 [nf_flow_table]
nf_flow_table_iterate+0x138/0x170 [nf_flow_table]
nf_flow_table_free+0x140/0x1a0 [nf_flow_table]
tcf_ct_flow_table_cleanup_work+0x2f/0x2b0 [act_ct]
process_one_work+0x6a3/0x1030
worker_thread+0x8a/0xdf0
This patch fixes it by using NF_FLOW_TABLE_STAT_INC_ATOMIC()
instead in flow_offload_queue_work().
Note that for FLOW_CLS_REPLACE branch in flow_offload_queue_work(),
it may not be called in preemptible path, but it's good to use
NF_FLOW_TABLE_STAT_INC_ATOMIC() for all cases in
flow_offload_queue_work().
Fixes: b038177636 ("netfilter: nf_flow_table: count pending offload workqueue tasks")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Embarrassingly, nft_pipapo_insert() checked for interval validity in
the first field only.
The start_p and end_p pointers were reset to key data from the first
field at every iteration of the loop which was supposed to go over
the set fields.
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Both p9_fd_create_tcp() and p9_fd_create_unix() will call
p9_socket_open(). If the creation of p9_trans_fd fails,
p9_fd_create_tcp() and p9_fd_create_unix() will return an
error directly instead of releasing the cscoket, which will
result in a socket leak.
This patch adds sock_release() to fix the leak issue.
Fixes: 6b18662e23 ("9p connect fixes")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
ACKed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In S1G beacon frames there shouldn't be multi-BSSID elements
since that's not supported, remove that to avoid a potential
integer underflow and/or misparsing the frames due to the
different length of the fixed part of the frame.
While at it, initialize non_tx_data so we don't send garbage
values to the user (even if it doesn't seem to matter now.)
Reported-and-tested-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Fixes: 9eaffe5078 ("cfg80211: convert S1G beacon to scan results")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For vendor elements, the code here assumes that 5 octets
are present without checking. Since the element itself is
already checked to fit, we only need to check the length.
Reported-and-tested-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Fixes: 0b8fb8235b ("cfg80211: Parsing of Multiple BSSID information in scanning")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The dev_uevent() in struct class should not be modifying the device that
is passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russ Weight <russell.h.weight@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Raed Salem <raeds@nvidia.com>
Cc: Chen Zhongjin <chenzhongjin@huawei.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Avihai Horon <avihaih@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jakob Koschel <jakobkoschel@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Wang Yufen <wangyufen@huawei.com>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-pm@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221123122523.1332370-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
del_singleshot_timer_sync() used to be an optimization for deleting timers
which are not rearmed from the timer callback function.
This optimization turned out to be broken and got mapped to
del_timer_sync() about 17 years ago.
Get rid of the undocumented indirection and use del_timer_sync() directly.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.706987932@linutronix.de
The value of an arithmetic expression "n * id.data" is subject
to possible overflow due to a failure to cast operands to a larger data
type before performing arithmetic. Used macro for multiplication instead
operator for avoiding overflow.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Maxim Korotkov <korotkov.maxim.s@gmail.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20221122122901.22294-1-korotkov.maxim.s@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
ipsec 2022-11-23
1) Fix "disable_policy" on ipv4 early demuxP Packets after
the initial packet in a flow might be incorectly dropped
on early demux if there are no matching policies.
From Eyal Birger.
2) Fix a kernel warning in case XFRM encap type is not
available. From Eyal Birger.
3) Fix ESN wrap around for GSO to avoid a double usage of a
sequence number. From Christian Langrock.
4) Fix a send_acquire race with pfkey_register.
From Herbert Xu.
5) Fix a list corruption panic in __xfrm_state_delete().
Thomas Jarosch.
6) Fix an unchecked return value in xfrm6_init().
Chen Zhongjin.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Fix ignored return value in xfrm6_init()
xfrm: Fix oops in __xfrm_state_delete()
af_key: Fix send_acquire race with pfkey_register
xfrm: replay: Fix ESN wrap around for GSO
xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not available
xfrm: fix "disable_policy" on ipv4 early demux
====================
Link: https://lore.kernel.org/r/20221123093117.434274-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix regression in ipset hash:ip with IPv4 range, from Vishwanath Pai.
This is fixing up a bug introduced in the 6.0 release.
2) The "netfilter: ipset: enforce documented limit to prevent allocating
huge memory" patch contained a wrong condition which makes impossible to
add up to 64 clashing elements to a hash:net,iface type of set while it
is the documented feature of the set type. The patch fixes the condition
and thus makes possible to add the elements while keeps preventing
allocating huge memory, from Jozsef Kadlecsik. This has been broken
for several releases.
3) Missing locking when updating the flow block list which might lead
a reader to crash. This has been broken since the introduction of the
flowtable hardware offload support.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: flowtable_offload: add missing locking
netfilter: ipset: restore allowing 64 clashing elements in hash:net,iface
netfilter: ipset: regression in ip_set_hash_ip.c
====================
Link: https://lore.kernel.org/r/20221122212814.63177-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two fixes:
- 9p now uses a variable size for its recv buffer, but every place
hadn't been updated properly to use it and some buffer overflows
have been found and needed fixing.
There's still one place where msize is incorrectly used in a safety
check (p9_check_errors), but all paths leading to it should already
be avoiding overflows and that patch took a bit more time to get
right for zero-copy requests so I'll send it for 6.2
- yet another race condition in p9_conn_cancel introduced by a
fix of a syzbot report in the same place, maybe at some point
we'll get it right without burning it all down...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmN9q0QACgkQq06b7GqY
5nAYfhAAkV4BuDAxkq1J9m2tAr32bNgSWRbwHGoiO3yOgN3OQcl0MRR910VLPaBe
rRMbggIZKp6SjKpL5CQUbIcf+ubvcHZQpglCWdJT6Cch9VAidwKB4y48V0nb9Gos
Z2lAtKCdlczmmbYY+KQ4KrFXKXkHXV7FPH7aLLaVg6VxckbRwunDMAPIQJ2wtUDq
GP+J0OBL6NGb1N1TrAx9uQMzM9Zbu/pfQ8Ik/PdwzcT5fWm6OOlICOe/ImAlFfiN
AU0+jc1c01DhQiGedEytOLWMlvcUtgbN2VoMszAuIdn3MFdRFiklQ/5IYFvEu4z4
vWeSL5ctsrgqhG0i5gNTDspzmX05amBaN7UOyQ8yvCYs9Jgv7x9EghdXA4XMX3UD
l8QcNQgthTVtI2waspAmWOV4anPjr4VJIcoLKapzAXTpd7SCJazzap/0CvXjTINK
IX4Fjeg9eKeQPsHgeLiO00zokM0LPZ+W4X6Xus1rrwzHVtde78bqokrUV/U6wVKF
UT0i06oEYlB79ySOmgqoZuFRbos7VZI6l6hLkbo+XJOPalWzP5MwexGS2JVo4ymS
Xu5DYlqnSdQQxScHdK906Ksbusa3n5jxAURqOIQxJ1UZyC7CPusVy2KA4FiZsL/x
OkYGlSg60tvsOfaNQoXOlziG2u3FJrZFjsurvgBgsgtUWEgVaU0=
=MQOZ
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.1-rc7' of https://github.com/martinetd/linux
Pull 9p fixes from Dominique Martinet:
- 9p now uses a variable size for its recv buffer, but every place
hadn't been updated properly to use it and some buffer overflows have
been found and needed fixing.
There's still one place where msize is incorrectly used in a safety
check (p9_check_errors), but all paths leading to it should already
be avoiding overflows and that patch took a bit more time to get
right for zero-copy requests so I'll send it for 6.2
- yet another race condition in p9_conn_cancel introduced by a fix for
a syzbot report in the same place. Maybe at some point we'll get it
right without burning it all down...
* tag '9p-for-6.1-rc7' of https://github.com/martinetd/linux:
9p/xen: check logical size for buffer size
9p/fd: Use P9_HDRSZ for header size
9p/fd: Fix write overflow in p9_read_work
9p/fd: fix issue of list_del corruption in p9_fd_cancel()
When holding a reader-writer spin lock we cannot sleep. Calling
setup_udp_tunnel_sock() with write lock held violates this rule, because we
end up calling percpu_down_read(), which might sleep, as syzbot reports
[1]:
__might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
Trim the writer-side critical section for sk_callback_lock down to the
minimum, so that it covers only operations on sk_user_data.
Also, when grabbing the sk_callback_lock, we always need to disable BH, as
Eric points out. Failing to do so leads to deadlocks because we acquire
sk_callback_lock in softirq context, which can get stuck waiting on us if:
1) it runs on the same CPU, or
CPU0
----
lock(clock-AF_INET6);
<Interrupt>
lock(clock-AF_INET6);
2) lock ordering leads to priority inversion
CPU0 CPU1
---- ----
lock(clock-AF_INET6);
local_irq_disable();
lock(&tcp_hashinfo.bhash[i].lock);
lock(clock-AF_INET6);
<Interrupt>
lock(&tcp_hashinfo.bhash[i].lock);
... as syzbot reports [2,3]. Use the _bh variants for write_(un)lock.
[1] https://lore.kernel.org/netdev/0000000000004e78ec05eda79749@google.com/
[2] https://lore.kernel.org/netdev/000000000000e38b6605eda76f98@google.com/
[3] https://lore.kernel.org/netdev/000000000000dfa31e05eda76f75@google.com/
v2:
- Check and set sk_user_data while holding sk_callback_lock for both
L2TP encapsulation types (IP and UDP) (Tetsuo)
Cc: Tom Parkin <tparkin@katalix.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: b68777d54f ("l2tp: Serialize access to sk_user_data with sk_callback_lock")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+703d9e154b3b58277261@syzkaller.appspotmail.com
Reported-by: syzbot+50680ced9e98a61f7698@syzkaller.appspotmail.com
Reported-by: syzbot+de987172bb74a381879b@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
trans_xen did not check the data fits into the buffer before copying
from the xen ring, but we probably should.
Add a check that just skips the request and return an error to
userspace if it did not fit
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Link: https://lkml.kernel.org/r/20221118135542.63400-1-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
The last remnants in dsa_priv.h are a netlink-related definition for
which we create a new header, and DSA_MAX_NUM_OFFLOADING_BRIDGES which
is only used from dsa.c, so move it there.
Some inclusions need to be adjusted now that we no longer have headers
included transitively from dsa_priv.h.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tag_8021q definitions are all over the place. Some are exported to
linux/dsa/8021q.h (visible by DSA core, taggers, switch drivers and
everyone else), and some are in dsa_priv.h.
Move the structures that don't need external visibility into tag_8021q.c,
and the ones which don't need the world or switch drivers to see them
into tag_8021q.h.
We also have the tag_8021q.h inclusion from switch.c, which is basically
the entire reason why tag_8021q.c was built into DSA in commit
8b6e638b4b ("net: dsa: build tag_8021q.c as part of DSA core").
I still don't know how to better deal with that, so leave it alone.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are some definitions in dsa_priv.h which are only used from
slave.c. So move them to slave.c.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous change moved the code into the larger file (dsa2.c) to
minimize the delta. Rename that now to dsa.c, and create dsa.h, where
all related definitions from dsa_priv.h go.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no longer a meaningful distinction between what goes into
dsa2.c and what goes into dsa.c. Merge the 2 into a single file.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reduce bloat in dsa_priv.h by moving the cross-chip notifier data
structures to switch.h.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There isn't an intuitive place for these 2 cross-chip notifier functions
according to the function-to-file classification based on names
(dsa_switch_*() goes to switch.c), but I consider these to be part of
the cross-chip notifier handling, therefore part of switch.c. Move them
there to reduce bloat in dsa2.c (the place where all code with no better
place to go goes).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reduce code bloat in dsa_priv.h by moving the prototypes exported by
switch.h into their own header file.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It would be nice if tagging protocol drivers could include just the
header they need, since they are (mostly) data path and isolated from
most of the other DSA core code does.
Create a tag.c and a tag.h file which are meant to support tagging
protocol drivers.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Minimize the use of the bloated dsa_priv.h by moving the prototypes
exported by slave.c to their own header file.
This is just approximate to get the code structure right. There are some
interdependencies with static inline code left in dsa_priv.h, so leave
slave.h included from there for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Minimize the use of the bloated dsa_priv.h by moving the prototypes
exported by master.c to their own header file.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Minimize the use of the bloated dsa_priv.h by moving the prototypes
exported by port.c to their own header file.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The code that needed further refactoring into dedicated functions in
dsa2.c was left aside. Move it now to devlink.c, and make dsa2.c stop
including net/devlink.h.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simplify dsa_switch_teardown() to remove the NULL checking for
ds->devlink.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dsa.c and dsa2.c are bloated with too much off-topic code. Identify all
code related to devlink and move it to a new devlink.c file.
Steer clear of the dsa_priv.h dumping ground antipattern and create a
dedicated devlink.h for it, which will be included only by the C files
which need it. Usage of dsa_priv.h will be minimized in later patches.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no reason that I can see why the no-op tagging protocol should
be registered manually, so make it a module and make all drivers which
have any sort of reference to DSA_TAG_PROTO_NONE select it.
Note that I don't know if ksz_get_tag_protocol() really needs this,
or if it's just the logic which is poorly written. All switches seem to
have their own tagging protocol, and DSA_TAG_PROTO_NONE is just a
fallback that never gets used.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dsa.o and dsa2.o are linked into the same dsa_core.o, there is no reason
to export this symbol when its only caller is local.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reporter health_state is set twice to error in devlink_health_report().
Remove second time as it is redundant.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/1668933412-5498-1-git-send-email-moshe@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cited commit caused the following build break when CONFIG_IPV6 was
disabled
net/ipv4/tcp_input.c: In function ‘tcp_syn_flood_action’:
include/net/sock.h:387:37: error: ‘const struct sock_common’ has no member named ‘skc_v6_rcv_saddr’; did you mean ‘skc_rcv_saddr’?
Fix by using inet6_rcv_saddr() macro which handles this situation
nicely.
Fixes: d9282e48c6 ("tcp: Add listening address to SYN flood message")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
CC: Matthieu Baerts <matthieu.baerts@tessares.net>
CC: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221122184158.170798-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In fib_table_insert(), if the alias was already inserted, but node not
exist, the error code should be set before return from error handling path.
Fixes: a6c76c17df ("ipv4: Notify route after insertion to the routing table")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Link: https://lore.kernel.org/r/20221120072838.2167047-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we call connect() for a socket bound to a wildcard address, we update
saddr locklessly. However, it could result in a data race; another thread
iterating over bhash might see a corrupted address.
Let's update saddr under the bhash bucket's lock.
Fixes: 3df80d9320 ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL.
Let's remove the unnecessary test.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When connect() is called on a socket bound to the wildcard address,
we change the socket's saddr to a local address. If the socket
fails to connect() to the destination, we have to reset the saddr.
However, when an error occurs after inet_hash6?_connect() in
(dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
the socket bound to the address.
From the user's point of view, whether saddr is reset or not varies
with errno. Let's fix this inconsistent behaviour.
Note that after this patch, the repro [0] will trigger the WARN_ON()
in inet_csk_get_port() again, but this patch is not buggy and rather
fixes a bug papering over the bhash2's bug for which we need another
fix.
For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
by this sequence:
s1 = socket()
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.bind(('127.0.0.1', 10000))
s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
# or s1.connect(('127.0.0.1', 10000))
s2 = socket()
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.bind(('0.0.0.0', 10000))
s2.connect(('127.0.0.1', 10000)) # -EADDRNOTAVAIL
s2.listen(32) # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
[0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09
Fixes: 3df80d9320 ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The __nft_expr_type_get() function returns NULL on error. It never
returns error pointers.
Fixes: 3a07327d10 ("netfilter: nft_inner: support for inner tunnel header matching")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_flow_table_block_setup and the driver TC_SETUP_FT call can modify the flow
block cb list while they are being traversed elsewhere, causing a crash.
Add a write lock around the calls to protect readers
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Chad Monroe <chad.monroe@smartrg.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The commit 510841da1f ("netfilter: ipset: enforce documented limit to
prevent allocating huge memory") was too strict and prevented to add up to
64 clashing elements to a hash:net,iface type of set. This patch fixes the
issue and now the type behaves as documented.
Fixes: 510841da1f ("netfilter: ipset: enforce documented limit to prevent allocating huge memory")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The call, kobject_get_ownership(), does not modify the kobject passed
into it, so make it const. This propagates down into the kobj_type
function callbacks so make the kobject passed into them also const,
ensuring that nothing in the kobject is being changed here.
This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20221121094649.1556002-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit f11fe1dae1 ("net/sched: Make NET_ACT_CT depends on NF_NAT"),
it fixed the build failure when NF_NAT is m and NET_ACT_CT is y by
adding depends on NF_NAT for NET_ACT_CT. However, it would also cause
NET_ACT_CT cannot be built without NF_NAT, which is not expected. This
patch fixes it by changing to use "(!NF_NAT || NF_NAT)" as the depend.
Fixes: f11fe1dae1 ("net/sched: Make NET_ACT_CT depends on NF_NAT")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/b6386f28d1ba34721795fb776a91cbdabb203447.1668807183.git.lucien.xin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When IPv6 module initializing in xfrm6_init(), register_pernet_subsys()
is possible to fail but its return value is ignored.
If IPv6 initialization fails later and xfrm6_fini() is called,
removing uninitialized list in xfrm6_net_ops will cause null-ptr-deref:
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 1 PID: 330 Comm: insmod
RIP: 0010:unregister_pernet_operations+0xc9/0x450
Call Trace:
<TASK>
unregister_pernet_subsys+0x31/0x3e
xfrm6_fini+0x16/0x30 [ipv6]
ip6_route_init+0xcd/0x128 [ipv6]
inet6_init+0x29c/0x602 [ipv6]
...
Fix it by catching the error return value of register_pernet_subsys().
Fixes: 8d068875ca ("xfrm: make gc_thresh configurable in all namespaces")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Kernel 5.14 added a new "byseq" index to speed
up xfrm_state lookups by sequence number in commit
fe9f1d8779 ("xfrm: add state hashtable keyed by seq")
While the patch was thorough, the function pfkey_send_new_mapping()
in net/af_key.c also modifies x->km.seq and never added
the current xfrm_state to the "byseq" index.
This leads to the following kernel Ooops:
BUG: kernel NULL pointer dereference, address: 0000000000000000
..
RIP: 0010:__xfrm_state_delete+0xc9/0x1c0
..
Call Trace:
<TASK>
xfrm_state_delete+0x1e/0x40
xfrm_del_sa+0xb0/0x110 [xfrm_user]
xfrm_user_rcv_msg+0x12d/0x270 [xfrm_user]
? remove_entity_load_avg+0x8a/0xa0
? copy_to_user_state_extra+0x580/0x580 [xfrm_user]
netlink_rcv_skb+0x51/0x100
xfrm_netlink_rcv+0x30/0x50 [xfrm_user]
netlink_unicast+0x1a6/0x270
netlink_sendmsg+0x22a/0x480
__sys_sendto+0x1a6/0x1c0
? __audit_syscall_entry+0xd8/0x130
? __audit_syscall_exit+0x249/0x2b0
__x64_sys_sendto+0x23/0x30
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
Exact location of the crash in __xfrm_state_delete():
if (x->km.seq)
hlist_del_rcu(&x->byseq);
The hlist_node "byseq" was never populated.
The bug only triggers if a new NAT traversal mapping (changed IP or port)
is detected in esp_input_done2() / esp6_input_done2(), which in turn
indirectly calls pfkey_send_new_mapping() *if* the kernel is compiled
with CONFIG_NET_KEY and "af_key" is active.
The PF_KEYv2 message SADB_X_NAT_T_NEW_MAPPING is not part of RFC 2367.
Various implementations have been examined how they handle
the "sadb_msg_seq" header field:
- racoon (Android): does not process SADB_X_NAT_T_NEW_MAPPING
- strongswan: does not care about sadb_msg_seq
- openswan: does not care about sadb_msg_seq
There is no standard how PF_KEYv2 sadb_msg_seq should be populated
for SADB_X_NAT_T_NEW_MAPPING and it's not used in popular
implementations either. Herbert Xu suggested we should just
use the current km.seq value as is. This fixes the root cause
of the oops since we no longer modify km.seq itself.
The update of "km.seq" looks like a copy'n'paste error
from pfkey_send_acquire(). SADB_ACQUIRE must indeed assign a unique km.seq
number according to RFC 2367. It has been verified that code paths
involving pfkey_send_acquire() don't cause the same Oops.
PF_KEYv2 SADB_X_NAT_T_NEW_MAPPING support was originally added here:
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
commit cbc3488685b20e7b2a98ad387a1a816aada569d8
Author: Derek Atkins <derek@ihtfp.com>
AuthorDate: Wed Apr 2 13:21:02 2003 -0800
[IPSEC]: Implement UDP Encapsulation framework.
In particular, implement ESPinUDP encapsulation for IPsec
Nat Traversal.
A note on triggering the bug: I was not able to trigger it using VMs.
There is one VPN using a high latency link on our production VPN server
that triggered it like once a day though.
Link: https://github.com/strongswan/strongswan/issues/992
Link: https://lore.kernel.org/netdev/00959f33ee52c4b3b0084d42c430418e502db554.1652340703.git.antony.antony@secunet.com/T/
Link: https://lore.kernel.org/netdev/20221027142455.3975224-1-chenzhihao@meizu.com/T/
Fixes: fe9f1d8779 ("xfrm: add state hashtable keyed by seq")
Reported-by: Roth Mark <rothm@mail.com>
Reported-by: Zhihao Chen <chenzhihao@meizu.com>
Tested-by: Roth Mark <rothm@mail.com>
Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Acked-by: Antony Antony <antony.antony@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If skb_linearize() fails in tipc_disc_rcv(), we need to free the skb instead of
handle it.
Fixes: 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address hash values")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Link: https://lore.kernel.org/r/20221119072832.7896-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
One extra conn_get() is needed in tipc_conn_alloc(), as after
tipc_conn_alloc() is called, tipc_conn_close() may free this
con before deferencing it in tipc_topsrv_accept():
tipc_conn_alloc();
newsk = newsock->sk;
<---- tipc_conn_close();
write_lock_bh(&sk->sk_callback_lock);
newsk->sk_data_ready = tipc_conn_data_ready;
Then an uaf issue can be triggered:
BUG: KASAN: use-after-free in tipc_topsrv_accept+0x1e7/0x370 [tipc]
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x46
print_report+0x178/0x4b0
kasan_report+0x8c/0x100
kasan_check_range+0x179/0x1e0
tipc_topsrv_accept+0x1e7/0x370 [tipc]
process_one_work+0x6a3/0x1030
worker_thread+0x8a/0xdf0
This patch fixes it by holding it in tipc_conn_alloc(), then after
all accessing in tipc_topsrv_accept() releasing it. Note when does
this in tipc_topsrv_kern_subscr(), as tipc_conn_rcv_sub() returns
0 or -1 only, we don't need to check for "> 0".
Fixes: c5fa7b3cf3 ("tipc: introduce new TIPC server infrastructure")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A crash was reported by Wei Chen:
BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:tipc_conn_close+0x12/0x100
Call Trace:
tipc_topsrv_exit_net+0x139/0x320
ops_exit_list.isra.9+0x49/0x80
cleanup_net+0x31a/0x540
process_one_work+0x3fa/0x9f0
worker_thread+0x42/0x5c0
It was caused by !con->sock in tipc_conn_close(). In tipc_topsrv_accept(),
con is allocated in conn_idr then its sock is set:
con = tipc_conn_alloc();
... <----[1]
con->sock = newsock;
If tipc_conn_close() is called in anytime of [1], the null-pointer-def
is triggered by con->sock->sk due to con->sock is not yet set.
This patch fixes it by moving the con->sock setting to tipc_conn_alloc()
under s->idr_lock. So that con->sock can never be NULL when getting the
con from s->conn_idr. It will be also safer to move con->server and flag
CF_CONNECTED setting under s->idr_lock, as they should all be set before
tipc_conn_alloc() is called.
Fixes: c5fa7b3cf3 ("tipc: introduce new TIPC server infrastructure")
Reported-by: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I was mistaken how atomic64_try_cmpxchg(&sk_cookie, &res, new)
is working.
I was assuming @res would contain the final sk_cookie value,
regardless of the success of our cmpxchg()
We could do something like:
if (atomic64_try_cmpxchg(&sk_cookie, &res, new)
res = new;
But we can avoid a conditional and read sk_cookie again.
atomic64_cmpxchg(&sk_cookie, res, new);
res = atomic64_read(&sk_cookie);
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527347 ("Error handling issues")
Fixes: 4ebf802cf1 ("net: __sock_gen_cookie() cleanup")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221118043843.3703186-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To avoid potentially breaking existing users.
Both mac/no-mac cases have to be amended; mac_header >= network_header
is not enough (verified with a new test, see next patch).
Fixes: fd18942244 ("bpf: Don't redirect packets with invalid pkt_len")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221121180340.1983627-1-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This patch introduced a regression: commit 48596a8ddc ("netfilter:
ipset: Fix adding an IPv4 range containing more than 2^31 addresses")
The variable e.ip is passed to adtfn() function which finally adds the
ip address to the set. The patch above refactored the for loop and moved
e.ip = htonl(ip) to the end of the for loop.
What this means is that if the value of "ip" changes between the first
assignement of e.ip and the forloop, then e.ip is pointing to a
different ip address than "ip".
Test case:
$ ipset create jdtest_tmp hash:ip family inet hashsize 2048 maxelem 100000
$ ipset add jdtest_tmp 10.0.1.1/31
ipset v6.21.1: Element cannot be added to the set: it's already added
The value of ip gets updated inside the "else if (tb[IPSET_ATTR_CIDR])"
block but e.ip is still pointing to the old value.
Fixes: 48596a8ddc ("netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses")
Reviewed-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Endpoint creation can fail for a number of reasons; in case of failure
append the error number to the extended ack message, using a newly
introduced generic helper.
Additionally let mptcp_pm_nl_append_new_local_addr() report different
error reasons.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When endpoint creation fails, we need to free the newly allocated
entry and eventually destroy the paired mptcp listener socket.
Consolidate such action in a single point let all the errors path
reach it.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We assume the correct errno is -EADDRINUSE when sk->sk_prot->get_port()
fails, so some ->get_port() functions return just 1 on failure and the
callers return -EADDRINUSE instead.
However, mptcp_get_port() can return -EINVAL. Let's not ignore the error.
Note the only exception is inet_autobind(), all of whose callers return
-EAGAIN instead.
Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zero-length arrays are deprecated[1] and are being replaced with
flexible array members in support of the ongoing efforts to tighten the
FORTIFY_SOURCE routines on memcpy(), correctly instrument array indexing
with UBSAN_BOUNDS, and to globally enable -fstrict-flex-arrays=3.
Replace zero-length array with flexible-array member in struct key_vector.
This results in no differences in binary output.
[1] https://github.com/KSPP/linux/issues/78
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains late Netfilter fixes for net:
1) Use READ_ONCE()/WRITE_ONCE() to update ct->mark, from Daniel Xu.
Not reported by syzbot, but I presume KASAN would trigger post
a splat on this. This is a rather old issue, predating git history.
2) Do not set up extensions for set element with end interval flag
set on. This leads to bogusly skipping this elements as expired
when listing the set/map to userspace as well as increasing
memory consumpton when stateful expressions are used. This issue
has been present since 4.18, when timeout support for rbtree set
was added.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal
to the verifier that it should enforce that a BPF program passes it a
"safe", trusted pointer. Currently, "safe" means that the pointer is
either PTR_TO_CTX, or is refcounted. There may be cases, however, where
the kernel passes a BPF program a safe / trusted pointer to an object
that the BPF program wishes to use as a kptr, but because the object
does not yet have a ref_obj_id from the perspective of the verifier, the
program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS
kfunc.
The solution is to expand the set of pointers that are considered
trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs
with these pointers without getting rejected by the verifier.
There is already a PTR_UNTRUSTED flag that is set in some scenarios,
such as when a BPF program reads a kptr directly from a map
without performing a bpf_kptr_xchg() call. These pointers of course can
and should be rejected by the verifier. Unfortunately, however,
PTR_UNTRUSTED does not cover all the cases for safety that need to
be addressed to adequately protect kfuncs. Specifically, pointers
obtained by a BPF program "walking" a struct are _not_ considered
PTR_UNTRUSTED according to BPF. For example, say that we were to add a
kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to
acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal
that a task was unsafe to pass to a kfunc, the verifier would mistakenly
allow the following unsafe BPF program to be loaded:
SEC("tp_btf/task_newtask")
int BPF_PROG(unsafe_acquire_task,
struct task_struct *task,
u64 clone_flags)
{
struct task_struct *acquired, *nested;
nested = task->last_wakee;
/* Would not be rejected by the verifier. */
acquired = bpf_task_acquire(nested);
if (!acquired)
return 0;
bpf_task_release(acquired);
return 0;
}
To address this, this patch defines a new type flag called PTR_TRUSTED
which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a
KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are
passed directly from the kernel as a tracepoint or struct_ops callback
argument. Any nested pointer that is obtained from walking a PTR_TRUSTED
pointer is no longer PTR_TRUSTED. From the example above, the struct
task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer
obtained from 'task->last_wakee' is not PTR_TRUSTED.
A subsequent patch will add kfuncs for storing a task kfunc as a kptr,
and then another patch will add selftests to validate.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Keeps traffic sent to the switch within link speed limits
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20221116080734.44013-6-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I've added a flex array to struct nlmsghdr in
commit 738136a0e3 ("netlink: split up copies in the ack construction")
to allow accessing the data easily. It leads to warnings with clang,
if user space wraps this structure into another struct and the flex
array is not at the end of the container.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/all/20221114023927.GA685@u2004-local/
Link: https://lore.kernel.org/r/20221118033903.1651026-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add documentation for BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
including kernel version introduced, usage and examples.
Add documentation that describes XDP_REDIRECT.
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221115144921.165483-1-mtahhan@redhat.com
Elements with an end interval flag set on do not store extensions. The
global set definition is currently setting on the timeout and stateful
expression for end interval elements.
This leads to skipping end interval elements from the set->ops->walk()
path as the expired check bogusly reports true.
Moreover, do not set up stateful expressions for elements with end
interval flag set on since this is never used.
Fixes: 65038428b2 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conn:mark can be read from and written to in parallel. Use
READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted
compiler optimizations.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cleanup hardcoded header sizes to use P9_HDRSZ instead of '7'
Link: https://lkml.kernel.org/r/20221117091159.31533-4-guozihua@huawei.com
Signed-off-by: GUO Zihua <guozihua@huawei.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
[Dominique: commit message adjusted to make sense after offset size
adjustment got removed]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
This error was reported while fuzzing:
BUG: KASAN: slab-out-of-bounds in _copy_to_iter+0xd35/0x1190
Write of size 4043 at addr ffff888008724eb1 by task kworker/1:1/24
CPU: 1 PID: 24 Comm: kworker/1:1 Not tainted 6.1.0-rc5-00002-g1adf73218daa-dirty #223
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
Workqueue: events p9_read_work
Call Trace:
<TASK>
dump_stack_lvl+0x4c/0x64
print_report+0x178/0x4b0
kasan_report+0xae/0x130
kasan_check_range+0x179/0x1e0
memcpy+0x38/0x60
_copy_to_iter+0xd35/0x1190
copy_page_to_iter+0x1d5/0xb00
pipe_read+0x3a1/0xd90
__kernel_read+0x2a5/0x760
kernel_read+0x47/0x60
p9_read_work+0x463/0x780
process_one_work+0x91d/0x1300
worker_thread+0x8c/0x1210
kthread+0x280/0x330
ret_from_fork+0x22/0x30
</TASK>
Allocated by task 457:
kasan_save_stack+0x1c/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x7e/0x90
__kmalloc+0x59/0x140
p9_fcall_init.isra.11+0x5d/0x1c0
p9_tag_alloc+0x251/0x550
p9_client_prepare_req+0x162/0x350
p9_client_rpc+0x18d/0xa90
p9_client_create+0x670/0x14e0
v9fs_session_init+0x1fd/0x14f0
v9fs_mount+0xd7/0xaf0
legacy_get_tree+0xf3/0x1f0
vfs_get_tree+0x86/0x2c0
path_mount+0x885/0x1940
do_mount+0xec/0x100
__x64_sys_mount+0x1a0/0x1e0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This BUG pops up when trying to reproduce
https://syzkaller.appspot.com/bug?id=6c7cd46c7bdd0e86f95d26ec3153208ad186f9fa
The callstack is different but the issue is valid and re-producable with
the same re-producer in the link.
The root cause of this issue is that we check the size of the message
received against the msize of the client in p9_read_work. However, it
turns out that capacity is no longer consistent with msize. Thus,
the message size should be checked against sdata capacity.
As the msize is non-consistant with the capacity of the tag and as we
are now checking message size against capacity directly, there is no
point checking message size against msize. So remove it.
Link: https://lkml.kernel.org/r/20221117091159.31533-2-guozihua@huawei.com
Link: https://lkml.kernel.org/r/20221117091159.31533-3-guozihua@huawei.com
Reported-by: syzbot+0f89bd13eaceccc0e126@syzkaller.appspotmail.com
Fixes: 60ece0833b ("net/9p: allocate appropriate reduced message buffers")
Signed-off-by: GUO Zihua <guozihua@huawei.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
[Dominique: squash patches 1 & 2 and fix size including header part]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Syz reported the following issue:
kernel BUG at lib/list_debug.c:53!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:__list_del_entry_valid.cold+0x5c/0x72
Call Trace:
<TASK>
p9_fd_cancel+0xb1/0x270
p9_client_rpc+0x8ea/0xba0
p9_client_create+0x9c0/0xed0
v9fs_session_init+0x1e0/0x1620
v9fs_mount+0xba/0xb80
legacy_get_tree+0x103/0x200
vfs_get_tree+0x89/0x2d0
path_mount+0x4c0/0x1ac0
__x64_sys_mount+0x33b/0x430
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
The process is as follows:
Thread A: Thread B:
p9_poll_workfn() p9_client_create()
... ...
p9_conn_cancel() p9_fd_cancel()
list_del() ...
... list_del() //list_del
corruption
There is no lock protection when deleting list in p9_conn_cancel(). After
deleting list in Thread A, thread B will delete the same list again. It
will cause issue of list_del corruption.
Setting req->status to REQ_STATUS_ERROR under lock prevents other
cleanup paths from trying to manipulate req_list.
The other thread can safely check req->status because it still holds a
reference to req at this point.
Link: https://lkml.kernel.org/r/20221110122606.383352-1-shaozhengchao@huawei.com
Fixes: 52f1c45dde ("9p: trans_fd/p9_conn_cancel: drop client lock earlier")
Reported-by: syzbot+9b69b8d10ab4a7d88056@syzkaller.appspotmail.com
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
[Dominique: add description of the fix in commit message]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Previously we leverage NCI_UNREG and the lock inside nci_close_device to
prevent the race condition between opening a device and closing a
device. However, it still has problem because a failed opening command
will erase the NCI_UNREG flag and allow another opening command to
bypass the status checking.
This fix corrects that by making sure the NCI_UNREG is held.
Reported-by: syzbot+43475bf3cfbd6e41f5b7@syzkaller.appspotmail.com
Fixes: 48b71a9e66 ("NFC: add NCI_UNREG flag to eliminate the race")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The caller of del_timer_sync must prevent restarting of the timer, If
we have no this synchronization, there is a small probability that the
cancellation will not be successful.
And syzbot report the fellowing crash:
==================================================================
BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:929 [inline]
BUG: KASAN: use-after-free in enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
Write at addr f9ff000024df6058 by task syz-fuzzer/2256
Pointer tag: [f9], memory tag: [fe]
CPU: 1 PID: 2256 Comm: syz-fuzzer Not tainted 6.1.0-rc5-syzkaller-00008-
ge01d50cbd6ee #0
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xe0/0xf0 arch/arm64/kernel/stacktrace.c:156
dump_backtrace arch/arm64/kernel/stacktrace.c:162 [inline]
show_stack+0x18/0x40 arch/arm64/kernel/stacktrace.c:163
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x68/0x84 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:284 [inline]
print_report+0x1a8/0x4a0 mm/kasan/report.c:395
kasan_report+0x94/0xb4 mm/kasan/report.c:495
__do_kernel_fault+0x164/0x1e0 arch/arm64/mm/fault.c:320
do_bad_area arch/arm64/mm/fault.c:473 [inline]
do_tag_check_fault+0x78/0x8c arch/arm64/mm/fault.c:749
do_mem_abort+0x44/0x94 arch/arm64/mm/fault.c:825
el1_abort+0x40/0x60 arch/arm64/kernel/entry-common.c:367
el1h_64_sync_handler+0xd8/0xe4 arch/arm64/kernel/entry-common.c:427
el1h_64_sync+0x64/0x68 arch/arm64/kernel/entry.S:576
hlist_add_head include/linux/list.h:929 [inline]
enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
mod_timer+0x14/0x20 kernel/time/timer.c:1161
mrp_periodic_timer_arm net/802/mrp.c:614 [inline]
mrp_periodic_timer+0xa0/0xc0 net/802/mrp.c:627
call_timer_fn.constprop.0+0x24/0x80 kernel/time/timer.c:1474
expire_timers+0x98/0xc4 kernel/time/timer.c:1519
To fix it, we can introduce a new active flags to make sure the timer will
not restart.
Reported-by: syzbot+6fd64001c20aa99e34a4@syzkaller.appspotmail.com
Signed-off-by: Schspa Shi <schspa@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmN0mecACgkQ+7dXa6fL
C2sHIg/8Ce7qGGeGBlDXtpHumvLAOKh/Aq45GC68M6ZyScckIOXUYKSHnM+3XWln
lUcuidsTyjHK7YRXzSLYZ56WREbr3GelEF1jh4iTt+UxBUn0gNV5C5PJQBL4KWcR
qU5ZVlnbOHb19XzRsWSMjAhdAulwnG7nhvuKB+Zo1mx7VVLKED9DCQ3A+Mm92Dm9
DjV/skzh0PI1zTBMdM7DolydftizGOO6yiFjhd8ktzIZj0TdifB63bVbMgoasQrO
SO+ZT9F4l/swiv12qgsYUH09SFdp2fdX3gt4Lj1JhwmXq/iSmeiHnvpJdbUW7RiI
jDKLiE0XpXwix29P26gq+Sdsb2pd7Ni3+YY6Qteln7RekIe6g3g2xwOLbkIgpTvc
NcwAbn0CL+ZLLts/udeIKHL5+ux1HZAAaHwftgysCHULLvxP4NrIcWrzVqzOLA9V
SH2MI6fYuOUbpgsoGxgv0+8f7MOrgUW2C9ySHjZfUPAqhAG8DinqX9gdUiYPMVF9
GrqrETmmaJCxuQaFQ8BsWKkP+KLfsi3UfEOwv7HdHjOqvCKSXOg5hHjv6Ctpp5Kv
yTj2BcAHjKB8FtuJ4h30UzVLhF1gquud+lPiO3Gbvbjhp1G1EQPwtcYjUamUre+w
lxZ870Z/jEbqEOrH7Xh1VvoKcgtp8Y9idJeU+8VNLL/r96nCF2E=
=xivU
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-next-20221116' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fix oops and missing config conditionals
The patches that were pulled into net-next previously[1] had some issues
that this patchset fixes:
(1) Fix missing IPV6 config conditionals.
(2) Fix an oops caused by calling udpv6_sendmsg() directly on an AF_INET
socket.
(3) Fix the validation of network addresses on entry to socket functions
so that we don't allow an AF_INET6 address if we've selected an
AF_INET transport socket.
Link: https://lore.kernel.org/r/166794587113.2389296.16484814996876530222.stgit@warthog.procyon.org.uk/ [1]
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After rxrpc_unbundle_conn() has removed a connection from a bundle, it
checks to see if there are any conns with available channels and, if not,
removes and attempts to destroy the bundle.
Whilst it does check after grabbing client_bundles_lock that there are no
connections attached, this races with rxrpc_look_up_bundle() retrieving the
bundle, but not attaching a connection for the connection to be attached
later.
There is therefore a window in which the bundle can get destroyed before we
manage to attach a new connection to it.
Fix this by adding an "active" counter to struct rxrpc_bundle:
(1) rxrpc_connect_call() obtains an active count by prepping/looking up a
bundle and ditches it before returning.
(2) If, during rxrpc_connect_call(), a connection is added to the bundle,
this obtains an active count, which is held until the connection is
discarded.
(3) rxrpc_deactivate_bundle() is created to drop an active count on a
bundle and destroy it when the active count reaches 0. The active
count is checked inside client_bundles_lock() to prevent a race with
rxrpc_look_up_bundle().
(4) rxrpc_unbundle_conn() then calls rxrpc_deactivate_bundle().
Fixes: 245500d853 ("rxrpc: Rewrite the client connection manager")
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-15975
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: zdi-disclosures@trendmicro.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan reported a new warning after my recent patch:
New smatch warnings:
net/core/dev.c:6409 napi_disable() error: uninitialized symbol 'new'.
Indeed, we must first wait for STATE_SCHED and STATE_NPSVC to be cleared,
to make sure @new variable has been initialized properly.
Fixes: 4ffa1d1c68 ("net: adopt try_cmpxchg() in napi_{enable|disable}()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "pkt" was supposed to have been deleted in a previous patch. It
leads to an uninitialized variable bug.
Fixes: 72f0c6fb05 ("rxrpc: Allocate ACK records at proposal and queue for transmission")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error handling for if skb_copy_bits() fails was accidentally deleted
so the rxkad_decrypt_ticket() function is not called.
Fixes: 5d7edbc923 ("rxrpc: Get rid of the Rx ring")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Second set of patches for v6.2. Only driver patches this time, nothing
really special. Unused platform data support was removed from wl1251
and rtw89 got WoWLAN support.
Major changes:
ath11k
* support configuring channel dwell time during scan
rtw89
* new dynamic header firmware format support
* Wake-over-WLAN support
rtl8xxxu
* enable IEEE80211_HW_SUPPORT_FAST_XMIT
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmN3ZagRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZuk/Af+M9IXyWmis9behHlz2U+4PJ/+JF8VYIck
6Xup+FM0O1w6aHhAL8gNOhLl62jpYB0Exou0YizInGvwjjQk2Gtuq60KshkbN2zL
YK3sbcS66PyE19gfIQKJTW/s+9aBncRuvEb6HTZx5iF96mVFPeBDPduUGQksGs5/
Egf03eAqzdcIIcejKG1SUn2o0fLJ6jds0VftAas+chKI2Z5ADQWPzugvU0vlvXVE
oq3ws7Sg0yGcwOOttFwaXMBPrjmCty5uc/ELMhjvSLAsKULXymjZkrU1m5tanhGu
N1MDqviwCK7PJC/wQ2xo7YKNY/1LUc0xbwfbEXQITfbbPyjOz+ynQQ==
=wKQZ
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2022-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.2
Second set of patches for v6.2. Only driver patches this time, nothing
really special. Unused platform data support was removed from wl1251
and rtw89 got WoWLAN support.
Major changes:
ath11k
* support configuring channel dwell time during scan
rtw89
* new dynamic header firmware format support
* Wake-over-WLAN support
rtl8xxxu
* enable IEEE80211_HW_SUPPORT_FAST_XMIT
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add sysctl net.sctp.l3mdev_accept to allow
users to change the pernet global l3mdev_accept.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch at first adds a pernet global l3mdev_accept to decide if it
accepts the packets from a l3mdev when a SCTP socket doesn't bind to
any interface. It's set to 1 to avoid any possible incompatible issue,
and in next patch, a sysctl will be introduced to allow to change it.
Then similar to inet/udp_sk_bound_dev_eq(), sctp_sk_bound_dev_eq() is
added to check either dif or sdif is equal to sk_bound_dev_if, and to
check sid is 0 or l3mdev_accept is 1 if sk_bound_dev_if is not set.
This function is used to match a association or a endpoint, namely
called by sctp_addrs_lookup_transport() and sctp_endpoint_is_match().
All functions that needs updating are:
sctp_rcv():
asoc:
__sctp_rcv_lookup()
__sctp_lookup_association() -> sctp_addrs_lookup_transport()
__sctp_rcv_lookup_harder()
__sctp_rcv_init_lookup()
__sctp_lookup_association() -> sctp_addrs_lookup_transport()
__sctp_rcv_walk_lookup()
__sctp_rcv_asconf_lookup()
__sctp_lookup_association() -> sctp_addrs_lookup_transport()
ep:
__sctp_rcv_lookup_endpoint() -> sctp_endpoint_is_match()
sctp_connect():
sctp_endpoint_is_peeled_off()
__sctp_lookup_association()
sctp_has_association()
sctp_lookup_association()
__sctp_lookup_association() -> sctp_addrs_lookup_transport()
sctp_diag_dump_one():
sctp_transport_lookup_process() -> sctp_addrs_lookup_transport()
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add skb_sdif function in struct sctp_af to get the enslaved device
for both ipv4 and ipv6 when adding SCTP VRF support in sctp_rcv in
the next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sctp_get_port_local(), when binding to IP and PORT, it should
also check sk_bound_dev_if to match listening sk if it's set by
SO_BINDTOIFINDEX, so that multiple sockets with the same IP and
PORT, but different sk_bound_dev_if can be listened at the same
time.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When binding to an ipv6 address, it calls ipv6_chk_addr() to check if
this address is on any dev. If a socket binds to a l3mdev but no dev
is passed to do this check, all l3mdev and slaves will be skipped and
the check will fail.
This patch is to pass the bound_dev to make sure the devices under the
same l3mdev can be returned in ipv6_chk_addr(). When the bound_dev is
not a l3mdev or l3slave, l3mdev_master_dev_rcu() will return NULL in
__ipv6_chk_addr_and_flags(), it will keep compitable with before when
NULL dev was passed.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After binding to a l3mdev, it should use the route table from the
corresponding VRF to verify the addr when binding to an address.
Note ipv6 doesn't need it, as binding to ipv6 address does not
verify the addr with route lookup.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0ff4eb3d5e ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().
As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.
Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.
Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().
Additionally, use an unsigned integer for the length counter qlen.
Fixes: 0ff4eb3d5e ("neighbour: make proxy_queue.qlen limit per-device")
Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move these two macros from net/sctp/sctp.h to linux/sctp.h, so that
it will be enough to include only linux/sctp.h in nft_exthdr.c and
xt_sctp.c. It should not include "net/sctp/sctp.h" if a module does
not have a dependence on SCTP module.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Link: https://lore.kernel.org/r/ef6468a687f36da06f575c2131cd4612f6b7be88.1668526821.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the driver is able to create leaf nodes for the devlink-rate,
but is unable to set parent for them. This wasn't as issue before the
possibility to export hierarchy from the driver. After adding the export
feature, in order for the driver to supply correct hierarchy, it's
necessary for it to be able to supply a parent name to
devl_rate_leaf_create().
Introduce a new parameter 'parent_name' in devl_rate_leaf_create().
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently it's not possible to reassign the parent of the node using one
command. As the previous commit introduced a way to export entire
hierarchy from the driver, being able to modify and reassign parents
become important. This way user might easily change QoS settings without
interrupting traffic.
Example command:
devlink port function rate set pci/0000:4b:00.0/1 parent node_custom_1
This reassigns leaf node parent to node_custom_1.
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Intel 100G card internal firmware hierarchy for Hierarchicial QoS is very
rigid and can't be easily removed. This requires an ability to export
default hierarchy to allow user to modify it. Currently the driver is
only able to create the 'leaf' nodes, which usually represent the vport.
This is not enough for HQoS implemented in Intel hardware.
Introduce new function devl_rate_node_create() that allows for creation
of the devlink-rate nodes from the driver.
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_weight' needs to be introduced. This attribute allows
for usage of Weighted Fair Queuing arbitration scheme among siblings.
This arbitration scheme can be used simultaneously with the strict
priority.
Introduce new attribute in devlink-rate that will allow for configuration
of Weighted Fair Queueing. New attribute is optional.
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_priority' needs to be introduced. This attribute allows
for usage of strict priority arbiter among siblings. This arbitration
scheme attempts to schedule nodes based on their priority as long as the
nodes remain within their bandwidth limit.
Introduce new attribute in devlink-rate that will allow for configuration
of strict priority. New attribute is optional.
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Issue a request_module() call when an attempt to change the tagging
protocol is made, either by sysfs or by device tree. In the case of
ocelot (the only driver for which the default and the alternative
tagging protocol are compiled as different modules), the user is now no
longer required to insert tag_ocelot_8021q.ko manually.
In the particular case of ocelot, this solves a problem where
tag_ocelot_8021q.ko is built as module, and this is present in the
device tree:
&mscc_felix_port4 {
dsa-tag-protocol = "ocelot-8021q";
};
&mscc_felix_port5 {
dsa-tag-protocol = "ocelot-8021q";
};
Because no one attempts to load the module into the kernel at boot time,
the switch driver will fail to probe (actually forever defer) until
someone manually inserts tag_ocelot_8021q.ko. This is now no longer
necessary and happens automatically.
Rename dsa_find_tagger_by_name() to denote the change in functionality:
there is now feature parity with dsa_tag_driver_get_by_id(), i.o.w. we
also load the module if it's missing.
Link: https://lore.kernel.org/lkml/20221027113248.420216-1-michael@walle.cc/
Suggested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc> # on kontron-sl28 w/ ocelot_8021q
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A future patch will introduce one more way of getting a reference on a
tagging protocl driver (by name). Rename the current method to "by_id".
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, dsa_find_tagger_by_name() uses sysfs_streq() which works both
with strings that contain \n at the end (echo ocelot > .../dsa/tagging)
and with strings that don't (printf ocelot > .../dsa/tagging).
There will be a problem once we'll want to construct the modalias string
based on which we auto-load the protocol kernel module. If the sysfs
buffer ends in a newline, we need to strip it first. This is a
preparatory patch specifically for that.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, tagging protocol drivers have a modalias of
"dsa_tag:id-<number>", where the number is one of DSA_TAG_PROTO_*_VALUE.
This modalias makes it possible for the request_module() call in
dsa_tag_driver_get() to work, given the input it has - an integer
returned by ds->ops->get_tag_protocol().
It is also possible to change tagging protocols at (pseudo-)runtime, via
sysfs or via device tree, and this works via the name string of the
tagging protocol rather than via its id (DSA_TAG_PROTO_*_VALUE).
In the latter case, there is no request_module() call, because there is
no association that the DSA core has between the string name and the ID,
to construct the modalias. The module is simply assumed to have been
inserted. This is actually slightly problematic when the tagging
protocol change should take place at probe time, since it's expected
that the dependency module should get autoloaded.
For this purpose, let's introduce a second modalias, so that the DSA
core can call request_module() by name. There is no reason to make the
modalias by name optional, so just modify the MODULE_ALIAS_DSA_TAG_DRIVER()
macro to take both the ID and the name as arguments, and generate two
modaliases behind the scenes.
Suggested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc> # on kontron-sl28 w/ ocelot_8021q
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's autumn cleanup time, and today's target are modaliases.
Michael says that for users of modinfo, "dsa_tag-20" is not the most
suggestive name, and recommends a change to "dsa_tag-id-20".
Andrew points out that other modaliases have a prefix delimited by
colons, so he recommends "dsa_tag:20" instead of "dsa_tag-20".
To satisfy both proposals, Florian recommends "dsa_tag:id-20".
The modaliases are not stable ABI, and the essential information
(protocol ID) is still conveyed in the new string, which
request_module() must be adapted to form.
Link: 20221027210830.3577793-1-vladimir.oltean@nxp.com
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Michael Walle <michael@walle.cc>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The DSA tagging protocol driver macros are in the public include/net/dsa.h
probably because that's also where the DSA_TAG_PROTO_*_VALUE macros are
(MODULE_ALIAS_DSA_TAG_DRIVER hinges on those macro definitions).
But there is no reason to expose these helpers to <net/dsa.h>. That
header is shared between switch drivers (drivers/net/dsa/), tagging
protocol drivers (net/dsa/tag_*.c), the DSA core (net/dsa/ sans tag_*.c),
and the rest of the world (DSA master drivers, network stack, etc).
Too much exposure.
On the other hand, net/dsa/dsa_priv.h is included only by the DSA core
and by DSA tagging protocol drivers (or IOW, "friend" modules). Also a
bit too much exposure - I've contemplated creating a new header which is
only included by tagging protocol drivers, but completely separating a
new dsa_tag_proto.h from dsa_priv.h is not immediately trivial - for
example dsa_slave_to_port() is used both from the fast path and from the
control path.
So for now, move these definitions to dsa_priv.h which at least hides
them from the world.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a user port does not have a label in device tree, and we thus
fall back to the eth%d scheme, the proper constant to use is
NET_NAME_ENUM. See also commit e9f656b7a2 ("net: ethernet: set
default assignment identifier to NET_NAME_ENUM"), which in turn quoted
commit 685343fc3b ("net: add name_assign_type netdev attribute"):
... when the kernel has given the interface a name using global
device enumeration based on order of discovery (ethX, wlanY, etc)
... are labelled NET_NAME_ENUM.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.faineli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a user port has a label in device tree, the corresponding
netdevice is, to quote include/uapi/linux/netdevice.h, "predictably
named by the kernel". This is also explicitly one of the intended use
cases for NET_NAME_PREDICTABLE, quoting 685343fc3b ("net: add
name_assign_type netdev attribute"):
NET_NAME_PREDICTABLE:
The ifname has been assigned by the kernel in a predictable way
[...] Examples include [...] and names deduced from hardware
properties (including being given explicitly by the firmware).
Expose that information properly for the benefit of userspace tools
that make decisions based on the name_assign_type attribute,
e.g. a systemd-udev rule with "kernel" in NamePolicy.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.faineli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following two patches each have a (small) chance of causing
regressions for userspace and will in that case of course need to be
reverted.
In order to prepare for that and make those two patches independent
and individually revertable, refactor the code which sets the names
for user ports by moving the "fall back to eth%d if no label is given
in device tree" to dsa_slave_create().
No functional change (at least none intended).
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.faineli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These cases were done with this Coccinelle:
@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- + E
- - E
)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- - E
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- - E
+ F
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- + E
+ F
- - E
)
And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
These cases were done with this Coccinelle:
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I > E);
+ I = get_random_u32_below(E + 1);
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I >= E);
+ I = get_random_u32_below(E);
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I < E);
+ I = get_random_u32_above(E - 1);
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I <= E);
+ I = get_random_u32_above(E);
@@
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (!I);
+ I = get_random_u32_above(0);
@@
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I == 0);
+ I = get_random_u32_above(0);
@@
expression E;
@@
- E + 1 + get_random_u32_below(U32_MAX - E)
+ get_random_u32_above(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Annotate the lockless read of queue->synflood_warned.
Following xchg() has the needed data-race resolution.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The valptr pointer is of (void *) type, so other pointers need not be
forced to assign values to it.
Signed-off-by: Li zeming <zeming@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On embedded systems with little memory and no relevant
security concerns, it is beneficial to reduce the size
of the table.
Reducing the size from 2^16 to 2^8 saves 255 KiB
of kernel RAM.
Makes the table size configurable as an expert option.
The size was previously increased from 2^8 to 2^16
in commit 4c2c8f03a5 ("tcp: increase source port perturb table to
2^16").
Signed-off-by: Gleb Mazovetskiy <glex.spb@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_user_data has multiple users, which are not compatible with each
other. Writers must synchronize by grabbing the sk->sk_callback_lock.
l2tp currently fails to grab the lock when modifying the underlying tunnel
socket fields. Fix it by adding appropriate locking.
We err on the side of safety and grab the sk_callback_lock also inside the
sk_destruct callback overridden by l2tp, even though there should be no
refs allowing access to the sock at the time when sk_destruct gets called.
v4:
- serialize write to sk_user_data in l2tp sk_destruct
v3:
- switch from sock lock to sk_callback_lock
- document write-protection for sk_user_data
v2:
- update Fixes to point to origin of the bug
- use real names in Reported/Tested-by tags
Cc: Tom Parkin <tparkin@katalix.com>
Fixes: 3557baabf2 ("[L2TP]: PPP over L2TP driver core")
Reported-by: Haowei Yan <g1042620637@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of code paths in tunnels are lockless (eg NETIF_F_LLTX in tx).
Adopt SMP safe DEV_STATS_INC() to update dev->stats fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of code paths in tunnels are lockless (eg NETIF_F_LLTX in tx).
Adopt SMP safe DEV_STATS_{INC|ADD}() to update dev->stats fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot/KCSAN reported that multiple cpus are updating dev->stats.tx_error
concurrently.
This is because sit tunnels are NETIF_F_LLTX, meaning their ndo_start_xmit()
is not protected by a spinlock.
While original KCSAN report was about tx path, rx path has the same issue.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.
netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64
Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adopt atomic64_try_cmpxchg() and remove the loop,
to make the intent more obvious.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adopting atomic_try_cmpxchg() makes the code cleaner.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adopt atomic_try_cmpxchg() which is slightly more efficient.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adopt atomic_long_try_cmpxchg() in mm_account_pinned_pages()
as it is slightly more efficient.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 2863 says:
The lowerLayerDown state is also a refinement on the down state.
This new state indicates that this interface runs "on top of" one or
more other interfaces (see ifStackTable) and that this interface is
down specifically because one or more of these lower-layer interfaces
are down.
DSA interfaces are virtual network devices, stacked on top of the DSA
master, but they have a physical MAC, with a PHY that reports a real
link status.
But since DSA (perhaps improperly) uses an iflink to describe the
relationship to its master since commit c084080151 ("dsa: set ->iflink
on slave interfaces to the ifindex of the parent"), default_operstate()
will misinterpret this to mean that every time the carrier of a DSA
interface is not ok, it is because of the master being not ok.
In fact, since commit c0a8a9c274 ("net: dsa: automatically bring user
ports down when master goes down"), DSA cannot even in theory be in the
lowerLayerDown state, because it just calls dev_close_many(), thereby
going down, when the master goes down.
We could revert the commit that creates an iflink between a DSA user
port and its master, especially since now we have an alternative
IFLA_DSA_MASTER which has less side effects. But there may be tooling in
use which relies on the iflink, which has existed since 2009.
We could also probably do something local within DSA to overwrite what
rfc2863_policy() did, in a way similar to hsr_set_operstate(), but this
seems like a hack.
What seems appropriate is to follow the iflink, and check the carrier
status of that interface as well. If that's down too, yes, keep
reporting lowerLayerDown, otherwise just down.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use udp_table directly in most places.
Instead, access it via net->ipv4.udp_table.
The access will be valid only while initialising udp_table
itself and creating/destroying each netns.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use the global udp_seq_afinfo.udp_table
to fetch a UDP hash table.
Instead, set NULL to udp_seq_afinfo.udp_table for UDP and get
a proper table from net->ipv4.udp_table.
Note that we still need udp_seq_afinfo.udp_table for UDP LITE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use the global sk->sk_prot->h.udp_table
to fetch a UDP hash table.
Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get
a proper table from net->ipv4.udp_table.
Note that we still need sk->sk_prot->h.udp_table for UDP LITE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds no functional change and cleans up some functions
that the following patches touch around so that we make them tidy
and easy to review/revert. The change is mainly to keep reverse
christmas tree order.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When built with Control Flow Integrity, function prototypes between
caller and function declaration must match. These mismatches are visible
at compile time with the new -Wcast-function-type-strict in Clang[1].
Fix a total of 73 warnings like these:
drivers/net/wireless/intersil/orinoco/wext.c:1379:27: warning: cast from 'int (*)(struct net_device *, struct iw_request_info *, struct iw_param *, char *)' to 'iw_handler' (aka 'int (*)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') converts to incompatible function type [-Wcast-function-type-strict]
IW_HANDLER(SIOCGIWPOWER, (iw_handler)orinoco_ioctl_getpower),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../net/wireless/wext-compat.c:1607:33: warning: cast from 'int (*)(struct net_device *, struct iw_request_info *, struct iw_point *, char *)' to 'iw_handler' (aka 'int (*)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') converts to incompatible function type [-Wcast-function-type-strict]
[IW_IOCTL_IDX(SIOCSIWGENIE)] = (iw_handler) cfg80211_wext_siwgenie,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/wireless/intersil/orinoco/wext.c:1390:27: error: incompatible function pointer types initializing 'const iw_handler' (aka 'int (*const)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') with an expression of type 'int (struct net_device *, struct iw_request_info *, struct iw_param *, char *)' [-Wincompatible-function-pointer-types]
IW_HANDLER(SIOCGIWRETRY, cfg80211_wext_giwretry),
^~~~~~~~~~~~~~~~~~~~~~
The cfg80211 Wireless Extension handler callbacks (iw_handler) use a
union for the data argument. Actually use the union and perform explicit
member selection in the function body instead of having a function
prototype mismatch. There are no resulting binary differences
before/after changes.
These changes were made partly manually and partly with the help of
Coccinelle.
Link: https://github.com/KSPP/linux/issues/234
Link: https://reviews.llvm.org/D134831 [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/a68822bf8dd587988131bb6a295280cb4293f05d.1667934775.git.gustavoars@kernel.org
Fix network address validation on entry to uapi functions such as connect()
for AF_RXRPC. The check for address compatibility with the transport
socket isn't correct and allows an AF_INET6 address to be given to an
AF_INET socket, resulting in an oops now that rxrpc is calling
udp_sendmsg() directly.
Sample program:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <linux/rxrpc.h>
static unsigned char ctrl[256] =
"\x18\x00\x00\x00\x00\x00\x00\x00\x10\x01\x00\x00\x01";
int main(void)
{
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
.transport_type = SOCK_DGRAM,
.transport_len = 28,
.transport.sin6.sin6_family = AF_INET6,
};
struct mmsghdr vec = {
.msg_hdr.msg_control = ctrl,
.msg_hdr.msg_controllen = 0x18,
};
int s;
s = socket(AF_RXRPC, SOCK_DGRAM, AF_INET);
if (s < 0) {
perror("socket");
exit(1);
}
if (connect(s, (struct sockaddr *)&srx, sizeof(srx)) < 0) {
perror("connect");
exit(1);
}
if (sendmmsg(s, &vec, 1, MSG_NOSIGNAL | MSG_MORE) < 0) {
perror("sendmmsg");
exit(1);
}
return 0;
}
If working properly, connect() should fail with EAFNOSUPPORT.
Fixes: ed472b0c87 ("rxrpc: Call udp_sendmsg() directly")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
If rxrpc sees an IPv6 address, it assumes it can call udpv6_sendmsg() on it
- even if it got it on an IPv4 socket. Fix do_udp_sendmsg() to give an
error in such a case.
general protection fault, probably for non-canonical address
0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
...
RIP: 0010:ipv6_addr_v4mapped include/net/ipv6.h:749 [inline]
RIP: 0010:udpv6_sendmsg+0xd0a/0x2c70 net/ipv6/udp.c:1361
...
Call Trace:
do_udp_sendmsg net/rxrpc/output.c:27 [inline]
do_udp_sendmsg net/rxrpc/output.c:21 [inline]
rxrpc_send_abort_packet+0x73b/0x860 net/rxrpc/output.c:367
rxrpc_release_calls_on_socket+0x211/0x300 net/rxrpc/call_object.c:595
rxrpc_release_sock net/rxrpc/af_rxrpc.c:886 [inline]
rxrpc_release+0x263/0x5a0 net/rxrpc/af_rxrpc.c:917
__sock_release+0xcd/0x280 net/socket.c:650
sock_close+0x18/0x20 net/socket.c:1365
__fput+0x27c/0xa90 fs/file_table.c:320
task_work_run+0x16b/0x270 kernel/task_work.c:179
exit_task_work include/linux/task_work.h:38 [inline]
do_exit+0xb35/0x2a20 kernel/exit.c:820
do_group_exit+0xd0/0x2a0 kernel/exit.c:950
__do_sys_exit_group kernel/exit.c:961 [inline]
__se_sys_exit_group kernel/exit.c:959 [inline]
__x64_sys_exit_group+0x3a/0x50 kernel/exit.c:959
Fixes: ed472b0c87 ("rxrpc: Call udp_sendmsg() directly")
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In the initial commit dc452a471d ("net: dsa: introduce tagger-owned
storage for private and shared data"), we had a call to
tag_ops->disconnect(dst) issued from dsa_tree_free(), which is called at
tree teardown time.
There were problems with connecting to a switch tree as a whole, so this
got reworked to connecting to individual switches within the tree. In
this process, tag_ops->disconnect(ds) was made to be called only from
switch.c (cross-chip notifiers emitted as a result of dynamic tag proto
changes), but the normal driver teardown code path wasn't replaced with
anything.
Solve this problem by adding a function that does the opposite of
dsa_switch_setup_tag_protocol(), which is called from the equivalent
spot in dsa_switch_teardown(). The positioning here also ensures that we
won't have any use-after-free in tagging protocol (*rcv) ops, since the
teardown sequence is as follows:
dsa_tree_teardown
-> dsa_tree_teardown_master
-> dsa_master_teardown
-> unsets master->dsa_ptr, making no further packets match the
ETH_P_XDSA packet type handler
-> dsa_tree_teardown_ports
-> dsa_port_teardown
-> dsa_slave_destroy
-> unregisters DSA net devices, there is even a synchronize_net()
in unregister_netdevice_many()
-> dsa_tree_teardown_switches
-> dsa_switch_teardown
-> dsa_switch_teardown_tag_protocol
-> finally frees the tagger-owned storage
Fixes: 7f2973149c ("net: dsa: make tagging protocols connect to individual switches from a tree")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20221114143551.1906361-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As of now, no DSA driver uses a custom link mode validation procedure
anymore. So remove this DSA operation and let phylink determine what is
supported based on config->mac_capabilities (if provided by the driver).
Leave a comment why we left the code that we did, and that there is more
work to do.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Fix sparse warning in the new nft_inner expression, reported
by Jakub Kicinski.
2) Incorrect vlan header check in nft_inner, from Peng Wu.
3) Two patches to pass reset boolean to expression dump operation,
in preparation for allowing to reset stateful expressions in rules.
This adds a new NFT_MSG_GETRULE_RESET command. From Phil Sutter.
4) Inconsistent indentation in nft_fib, from Jiapeng Chong.
5) Speed up siphash calculation in conntrack, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: conntrack: use siphash_4u64
netfilter: rpfilter/fib: clean up some inconsistent indenting
netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET
netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters
netfilter: nft_inner: fix return value check in nft_inner_parse_l2l3()
netfilter: nft_payload: use __be16 to store gre version
====================
Link: https://lore.kernel.org/r/20221115095922.139954-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
x25_lapb_receive_frame() using skb_copy() to get a private copy of
skb, the new skb should be freed in the undersized/fragmented skb
error handling path. Otherwise there is a memory leak.
Fixes: cb101ed2c3 ("x25: Handle undersized/fragmented skbs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20221114110519.514538-1-weiyongjun@huaweicloud.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a metadata dst is present with the type METADATA_HW_PORT_MUX on a dsa cpu
port netdev, assume that it carries the port number and that there is no DSA
tag present in the skb data.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For queueing packets in XDP we want to add a new redirect map type with
support for 64-bit indexes. To prepare fore this, expand the width of the
'key' argument to the bpf_redirect_map() helper. Since BPF registers are
always 64-bit, this should be safe to do after the fact.
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch fixes a frame size warning, reported by kernel test robot.
>> net/dcb/dcbnl.c:1230:1: warning: the frame size of 1244 bytes is
>> larger than 1024 bytes [-Wframe-larger-than=]
The getapptrust part of dcbnl_ieee_fill is moved to a separate function,
and the selector array is now dynamically allocated, instead of stack
allocated.
Tested on microchip sparx5 driver.
Fixes: 6182d5875c ("net: dcb: add new apptrust attribute")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20221114092950.2490451-1-daniel.machon@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The bridge driver can offload VLANs to the underlying hardware either
via switchdev or the 8021q driver. When the former is used, the VLAN is
marked in the bridge driver with the 'BR_VLFLAG_ADDED_BY_SWITCHDEV'
private flag.
To avoid the memory leaks mentioned in the cited commit, the bridge
driver will try to delete a VLAN via the 8021q driver if the VLAN is not
marked with the previously mentioned flag.
When the VLAN protocol of the bridge changes, switchdev drivers are
notified via the 'SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL' attribute, but
the 8021q driver is also called to add the existing VLANs with the new
protocol and delete them with the old protocol.
In case the VLANs were offloaded via switchdev, the above behavior is
both redundant and buggy. Redundant because the VLANs are already
programmed in hardware and drivers that support VLAN protocol change
(currently only mlx5) change the protocol upon the switchdev attribute
notification. Buggy because the 8021q driver is called despite these
VLANs being marked with 'BR_VLFLAG_ADDED_BY_SWITCHDEV'. This leads to
memory leaks [1] when the VLANs are deleted.
Fix by not calling the 8021q driver for VLANs that were already
programmed via switchdev.
[1]
unreferenced object 0xffff8881f6771200 (size 256):
comm "ip", pid 446855, jiffies 4298238841 (age 55.240s)
hex dump (first 32 bytes):
00 00 7f 0e 83 88 ff ff 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000012819ac>] vlan_vid_add+0x437/0x750
[<00000000f2281fad>] __br_vlan_set_proto+0x289/0x920
[<000000000632b56f>] br_changelink+0x3d6/0x13f0
[<0000000089d25f04>] __rtnl_newlink+0x8ae/0x14c0
[<00000000f6276baf>] rtnl_newlink+0x5f/0x90
[<00000000746dc902>] rtnetlink_rcv_msg+0x336/0xa00
[<000000001c2241c0>] netlink_rcv_skb+0x11d/0x340
[<0000000010588814>] netlink_unicast+0x438/0x710
[<00000000e1a4cd5c>] netlink_sendmsg+0x788/0xc40
[<00000000e8992d4e>] sock_sendmsg+0xb0/0xe0
[<00000000621b8f91>] ____sys_sendmsg+0x4ff/0x6d0
[<000000000ea26996>] ___sys_sendmsg+0x12e/0x1b0
[<00000000684f7e25>] __sys_sendmsg+0xab/0x130
[<000000004538b104>] do_syscall_64+0x3d/0x90
[<0000000091ed9678>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
Fixes: 279737939a ("net: bridge: Fix VLANs memory leak")
Reported-by: Vlad Buslov <vladbu@nvidia.com>
Tested-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20221114084509.860831-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk->sk_receive_queue is protected by skb queue lock, but for KCM
sockets its RX path takes mux->rx_lock to protect more than just
skb queue. However, kcm_recvmsg() still only grabs the skb queue
lock, so race conditions still exist.
We can teach kcm_recvmsg() to grab mux->rx_lock too but this would
introduce a potential performance regression as struct kcm_mux can
be shared by multiple KCM sockets.
So we have to enforce skb queue lock in requeue_rx_msgs() and handle
skb peek case carefully in kcm_wait_data(). Fortunately,
skb_recv_datagram() already handles it nicely and is widely used by
other sockets, we can just switch to skb_recv_datagram() after
getting rid of the unnecessary sock lock in kcm_recvmsg() and
kcm_splice_read(). Side note: SOCK_DONE is not used by KCM sockets,
so it is safe to get rid of this check too.
I ran the original syzbot reproducer for 30 min without seeing any
issue.
Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reported-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
Reported-by: shaozhengchao <shaozhengchao@huawei.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20221114005119.597905-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This function is used for every packet, siphash_4u64 is noticeably faster
than using local buffer + siphash:
Before:
1.23% kpktgend_0 [kernel.vmlinux] [k] __siphash_unaligned
0.14% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw
After:
0.79% kpktgend_0 [kernel.vmlinux] [k] siphash_4u64
0.15% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw
In the pktgen test this gives about ~2.4% performance improvement.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Analogous to NFT_MSG_GETOBJ_RESET, but for rules: Reset stateful
expressions like counters or quotas. The latter two are the only
consumers, adjust their 'dump' callbacks to respect the parameter
introduced earlier.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a 'reset' flag just like with nft_object_ops::dump. This will be
useful to reset "anonymous stateful objects", e.g. simple rule counters.
No functional change intended.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of having to pass multiple arguments that describe the register,
pass the bpf_reg_state into the btf_struct_access callback. Currently,
all call sites simply reuse the btf and btf_id of the reg they want to
check the access of. The only exception to this pattern is the callsite
in check_ptr_to_map_access, hence for that case create a dummy reg to
simulate PTR_TO_BTF_ID access.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The SYN flood message prints the listening port number, but with many
processes bound to the same port on different IPs, it's impossible to
tell which socket is the problem.
Add the listen IP address to the SYN flood message.
For IPv6 use "[IP]:port" as per RFC-5952 and to provide ease of
copy-paste to "ss" filters. For IPv4 use "IP:port" to match.
Each protcol's "any" address and a host address now look like:
Possible SYN flooding on port 0.0.0.0:9001.
Possible SYN flooding on port 127.0.0.1:9001.
Possible SYN flooding on port [::]:9001.
Possible SYN flooding on port [fc00::1]:9001.
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/4fedab7ce54a389aeadbdc639f6b4f4988e9d2d7.1668386107.git.jamie.bainbridge@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are multi-generational drivers like mv88e6xxx which have code like
this:
int mv88e6xxx_port_hwtstamp_get(struct dsa_switch *ds, int port,
struct ifreq *ifr)
{
if (!chip->info->ptp_support)
return -EOPNOTSUPP;
...
}
DSA wants to deny PTP timestamping on the master if the switch supports
timestamping too. However it currently relies on the presence of the
port_hwtstamp_get() callback to determine PTP capability, and this
clearly does not work in that case (method is present but returns
-EOPNOTSUPP).
We should not deny PTP on the DSA master for those switches which truly
do not support hardware timestamping.
Create a dsa_port_supports_hwtstamp() method which actually probes for
support by calling port_hwtstamp_get() and seeing whether that returned
-EOPNOTSUPP or not.
Fixes: f685e609a3 ("net: dsa: Deny PTP on master if switch supports it")
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20221110124345.3901389-1-festevam@gmail.com/
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Steffen Bätz <steffen@innosonix.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a new flow_rule_match_arp function that allows drivers
to be able to dissect ARP frames.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The value of 'st->state' has been verified as "TCP_SEQ_STATE_LISTENING",
it's unnecessary to assign TCP_SEQ_STATE_LISTENING to it, so we can remove it.
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
When connecting to client timeout, disconnect client for twice in
chnl_net_open(). Remove one. Compile tested only.
Fixes: 2aa40aef9d ("caif: Use link layer MTU instead of fixed MTU")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix rxrpc_encap_err_rcv() to make the call to ipv6_icmp_error conditional
on IPV6 support being enabled.
Fixes: b6c66c4324 ("rxrpc: Use the core ICMP/ICMP6 parsers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Use try_cmpxchg() (instead of cmpxchg()) in a more readable way.
oval = smp_load_acquire(&sk->sk_tsq_flags);
do {
...
} while (!try_cmpxchg(&sk->sk_tsq_flags, &oval, nval));
Reduce indentation level.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221110190239.3531280-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
try_cmpxchg() is slighly more efficient (at least on x86),
and smp_load_acquire(&sk->sk_tsq_flags) could avoid a KCSAN report.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221110174829.3403442-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No changes in generated code.
Reported-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20221110085422.521059-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We kept getting initial patches from new contributors to remove a
duplicate 'the' (since grammar checking scripts flag it), but submitters
never followed up after code review.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use '(struct sock *)msk' to get 'sk' from 'msk' in a more direct way
instead of using '&msk->sk.icsk_inet.sk'.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>