Commit Graph

984334 Commits

Author SHA1 Message Date
Loic Poulain e6ec3ccd4e net: mhi: Get RX queue size from MHI core
The RX queue size can be determined at runtime by retrieving the
number of available transfer descriptors.

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 19:41:59 -08:00
Jakub Kicinski 2bca263cda Merge branch 'mhi-net-immutable' of https://git.kernel.org/pub/scm/linux/kernel/git/mani/mhi
Needed by mhi-net patches.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 19:40:52 -08:00
Jan Luebbe 5daf83846c docs: networking: timestamping: fix section title markup
This section was missed during the conversion to ReST, so convert it in the
same style as the surrounding section titles.

Signed-off-by: Jan Luebbe <jlu@pengutronix.de>
Link: https://lore.kernel.org/r/20210128111930.29473-1-jlu@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 19:14:23 -08:00
dingsenjie afa4f675aa net/ethernet: convert to use module_platform_driver in octeon_mgmt.c
Simplify the code by using module_platform_driver macro
for octeon_mgmt.

Signed-off-by: dingsenjie <dingsenjie@yulong.com>
Link: https://lore.kernel.org/r/20210128035330.17676-1-dingsenjie@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 19:02:43 -08:00
Emil Renner Berthing a58745979c net: atm: pppoatm: use new API for wakeup tasklet
This converts the driver to use the new tasklet API introduced in
commit 12cc923f1c ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-2-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:24:05 -08:00
Emil Renner Berthing a5b88632fc net: atm: pppoatm: use tasklet_init to initialize wakeup tasklet
Previously a temporary tasklet structure was initialized on the stack
using DECLARE_TASKLET_OLD() and then copied over and modified. Nothing
else in the kernel seems to use this pattern, so let's just call
tasklet_init() like everyone else.

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-1-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:24:05 -08:00
Jakub Kicinski 810e754c7b Merge branch 'net-sched-cls_flower-add-support-for-matching-on-ct_state-reply-flag'
Paul Blakey says:

====================
net/sched: cls_flower: Add support for matching on ct_state reply flag

This patchset adds software match support and offload of flower
match ct_state reply flag (+/-rpl).

The first patch adds the definition for the flag and match to flower.

Second patch gives the direction of the connection to the offloading
drivers via ct_metadata flow offload action.

The last patch does offload of this new ct_state by using the supplied
connection's direction.
====================

Link: https://lore.kernel.org/r/1611757967-18236-1-git-send-email-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:05:32 -08:00
Paul Blakey 6895cb3a95 net/mlx5: CT: Add support for matching on ct_state reply flag
Add support for matching on ct_state reply flag.

Example:
$ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est+rpl \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est-rpl \
  action mirred egress redirect dev ens1f0_0

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:05:30 -08:00
Paul Blakey 941eff5aea net: flow_offload: Add original direction flag to ct_metadata
Give offloading drivers the direction of the offloaded ct flow,
this will be used for matches on direction (ct_state +/-rpl).

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:05:30 -08:00
Paul Blakey 8c85d18ce6 net/sched: cls_flower: Add match on the ct_state reply flag
Add match on the ct_state reply flag.

Example:
$ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est+rpl \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est-rpl \
  action mirred egress redirect dev ens1f0_0

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:05:30 -08:00
Jakub Kicinski cf3c7c7b37 Merge branch 'add-nci-suit-and-virtual-nci-device-driver'
Bongsu Jeon says:

====================
Add nci suit and virtual nci device driver

1/2 is the Virtual NCI device driver.
2/2 is the NCI selftest suite
====================

Link: https://lore.kernel.org/r/20210127130829.4026-1-bongsu.jeon@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:03:37 -08:00
Bongsu Jeon f595cf1242 selftests: Add nci suite
This is the NCI test suite. It tests the NFC/NCI module using virtual NCI
device. Test cases consist of making the virtual NCI device on/off and
controlling the device's polling for NCI1.0 and NCI2.0 version.

Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:03:33 -08:00
Bongsu Jeon e624e6c3e7 nfc: Add a virtual nci device driver
NCI virtual device simulates a NCI device to the user. It can be used to
validate the NCI module and applications. This driver supports
communication between the virtual NCI device and NCI module.

Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:03:33 -08:00
Menglong Dong 8c22475148 net: packet: make pkt_sk() inline
It's better make 'pkt_sk()' inline here, as non-inline function
shouldn't occur in headers. Besides, this function is simple
enough to be inline.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210127123302.29842-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 17:38:55 -08:00
Andrea Parri (Microsoft) 0ba35fe91c hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer
Pointers to receive-buffer packets sent by Hyper-V are used within the
guest VM.  Hyper-V can send packets with erroneous values or modify
packet fields after they are processed by the guest.  To defend against
these scenarios, copy (sections of) the incoming packet after validating
their length and offset fields in netvsc_filter_receive().  In this way,
the packet can no longer be modified by the host.

Reported-by: Juan Vazquez <juvazq@microsoft.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20210126162907.21056-1-parri.andrea@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 16:44:07 -08:00
Bjorn Helgaas 46eb3c108f octeontx2-af: Fix 'physical' typos
Fix misspellings of "physical".

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210127181359.3008316-1-helgaas@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 21:24:47 -08:00
dingsenjie 1d3f9bb1be linux/qed: fix spelling typo in qed_chain.h
allocted -> allocated

Signed-off-by: dingsenjie <dingsenjie@yulong.com>
Link: https://lore.kernel.org/r/20210127022801.8028-1-dingsenjie@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 21:24:40 -08:00
Jakub Kicinski 67d25ce891 Merge branch 'nexthop-preparations-for-resilient-next-hop-groups'
Petr Machata says:

====================
nexthop: Preparations for resilient next-hop groups

At this moment, there is only one type of next-hop group: an mpath group.
Mpath groups implement the hash-threshold algorithm, described in RFC
2992[1].

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus:

             +-------+-------+-------+-------+-------+
             |   1   |   2   |   3   |   4   |   5   |
             +-------+-+-----+---+---+-----+-+-------+
             |    1    |    2    |    4    |    5    |
             +---------+---------+---------+---------+

              Before and after deletion of next hop 3
	      under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them, which
may lead to TCP reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in wrong order.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation to choose a hash bucket, and then reads
the next hop that this bucket contains, and forwards traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops:

             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	                      v v v v
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              Before and after deletion of next hop 3
	      under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

This patchset prepares the next-hop code for eventual introduction of
resilient hashing groups.

- Patches #1-#4 carry otherwise disjoint changes that just remove certain
  assumptions in the next-hop code.

- Patches #5-#6 extend the in-kernel next-hop notifiers to support more
  next-hop group types.

- Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups
  will introduce a new logical object, a hash table bucket. It turns out
  that handling bucket-related messages is similar to how next-hop messages
  are handled. These patches extract the commonalities into reusable
  components.

The plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next hop groups (this patchset)
3) Implementation of resilient next hop group
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1
====================

Link: https://lore.kernel.org/r/cover.1611836479.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:59 -08:00
Petr Machata 0bccf8ed8a nexthop: Extract a helper for validation of get/del RTNL requests
Validation of messages for get / del of a next hop is the same as will be
validation of messages for get of a resilient next hop group bucket. The
difference is that policy for resilient next hop group buckets is a
superset of that used for next-hop get.

It is therefore possible to reuse the code that validates the nhmsg fields,
extracts the next-hop ID, and validates that. To that end, extract from
nh_valid_get_del_req() a helper __nh_valid_get_del_req() that does just
that.

Make the nlh argument const so that the function can be called from the
dump context, which only has a const nlh. Propagate the constness to
nh_valid_get_del_req().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:54 -08:00
Petr Machata e948217d25 nexthop: Add a callback parameter to rtm_dump_walk_nexthops()
In order to allow different handling for next-hop tree dumper and for
bucket dumper, parameterize the next-hop tree walker with a callback. Add
rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree
dumping.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata cbee18071e nexthop: Extract a helper for walking the next-hop tree
Extract from rtm_dump_nexthop() a helper to walk the next hop tree. A
separate function for this will be reusable from the bucket dumper.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata a6fbbaa64c nexthop: Strongly-type context of rtm_dump_nexthop()
The dump operations need to keep state from one invocation to another. A
scratch area is dedicated for this purpose in the passed-in argument, cb,
namely via two aliased arrays, struct netlink_callback.args and .ctx.

Dumping of buckets will end up having to iterate over next hops as well,
and it would be nice to be able to reuse the iteration logic with the NH
dumper. The fact that the logic currently relies on fixed index to the
.args array, and the indices would have to be coordinated between the two
dumpers, makes this somewhat awkward.

To make the access patters clearer, introduce a helper struct with a NH
index, and instead of using the .args array directly, use it through this
structure.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata b9ebea1276 nexthop: Extract a common helper for parsing dump attributes
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. However, they
have different policies. To allow reuse of this code, extract a
policy-agnostic wrapper out of nh_valid_dump_req(), and convert this
function into a thin wrapper around it.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata 56450ec6b7 nexthop: Extract dump filtering parameters into a single structure
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. In order to make
reuse of this code simpler, convert the code to use a single structure with
filtering configuration instead of passing around the parameters one by
one.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:52 -08:00
Petr Machata da230501f2 nexthop: Dispatch notifier init()/fini() by group type
After there are several next-hop group types, initialization and
finalization of notifier type needs to reflect the actual type. Transform
nh_notifier_grp_info_init() and _fini() to make extending them easier.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:52 -08:00
Ido Schimmel 09ad6becf5 nexthop: Use enum to encode notification type
Currently there are only two types of in-kernel nexthop notification.
The two are distinguished by the 'is_grp' boolean field in 'struct
nh_notifier_info'.

As more notification types are introduced for more next-hop group types, a
boolean is not an easily extensible interface. Instead, convert it to an
enum.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:52 -08:00
Petr Machata 720ccd9a72 nexthop: Assert the invariant that a NH group is of only one type
Most of the code that deals with nexthop groups relies on the fact that the
group is of exactly one well-known type. Currently there is only one type,
"mpath", but as more next-hop group types come, it becomes desirable to
have a central place where the setting is validated. Introduce such place
into nexthop_create_group(), such that the check is done before the code
that relies on that invariant is invoked.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00
Petr Machata b9bae61be4 nexthop: Introduce to struct nh_grp_entry a per-type union
The values that a next-hop group needs to keep track of depend on the group
type. Introduce a union to separate fields specific to the mpath groups
from fields specific to other group types.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00
Petr Machata 79bc55e3fe nexthop: Dispatch nexthop_select_path() by group type
The logic for selecting path depends on the next-hop group type. Adapt the
nexthop_select_path() to dispatch according to the group type.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00
David Ahern 5d1f0f09b5 nexthop: Rename nexthop_free_mpath
nexthop_free_mpath really should be nexthop_free_group. Rename it.

Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00
Jakub Kicinski 4915a40437 Merge branch 'net-iucv-updates-2021-01-28'
Julian Wiedmann says:

====================
net/iucv: updates 2021-01-28

This reworks & simplifies the TX notification path in af_iucv, so that we
can send out SG skbs over TRANS_HIPER sockets. Also remove a noisy
WARN_ONCE() in the RX path.
====================

Link: https://lore.kernel.org/r/20210128114108.39409-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:25 -08:00
Julian Wiedmann 2c3b4456c8 net/af_iucv: build SG skbs for TRANS_HIPER sockets
The TX path no longer falls apart when some of its SG skbs are later
linearized by lower layers of the stack. So enable the use of SG skbs
in iucv_sock_sendmsg() again.

This effectively reverts
commit dc5367bcc5 ("net/af_iucv: don't use paged skbs for TX on HiperSockets").

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:22 -08:00
Julian Wiedmann 80bc97aa0a net/af_iucv: don't track individual TX skbs for TRANS_HIPER sockets
Stop maintaining the skb_send_q list for TRANS_HIPER sockets.

Not only is it extra overhead, but keeping around a list of skb clones
means that we later also have to match the ->sk_txnotify() calls
against these clones and free them accordingly.
The current matching logic (comparing the skbs' shinfo location) is
frustratingly fragile, and breaks if the skb's head is mangled in any
sort of way while passing from dev_queue_xmit() to the device's
HW queue.

Also adjust the interface for ->sk_txnotify(), to make clear that we
don't actually care about any skb internals.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Julian Wiedmann ef6af7bdb9 net/af_iucv: count packets in the xmit path
The TX code keeps track of all skbs that are in-flight but haven't
actually been sent out yet. For native IUCV sockets that's not a huge
deal, but with TRANS_HIPER sockets it would be much better if we
didn't need to maintain a list of skb clones.

Note that we actually only care about the _count_ of skbs in this stage
of the TX pipeline. So as prep work for removing the skb tracking on
TRANS_HIPER sockets, keep track of the skb count in a separate variable
and pair any list {enqueue, unlink} with a count {increment, decrement}.

Then replace all occurences where we currently look at the skb list's
fill level.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Julian Wiedmann c464444fa2 net/af_iucv: don't lookup the socket on TX notification
Whoever called iucv_sk(sk)->sk_txnotify() must already know that they're
dealing with an af_iucv socket.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Alexander Egorenkov 27e9c1de52 net/af_iucv: remove WARN_ONCE on malformed RX packets
syzbot reported the following finding:

AF_IUCV failed to receive skb, len=0
WARNING: CPU: 0 PID: 522 at net/iucv/af_iucv.c:2039 afiucv_hs_rcv+0x174/0x190 net/iucv/af_iucv.c:2039
CPU: 0 PID: 522 Comm: syz-executor091 Not tainted 5.10.0-rc1-syzkaller-07082-g55027a88ec9f #0
Hardware name: IBM 3906 M04 701 (KVM/Linux)
Call Trace:
 [<00000000b87ea538>] afiucv_hs_rcv+0x178/0x190 net/iucv/af_iucv.c:2039
([<00000000b87ea534>] afiucv_hs_rcv+0x174/0x190 net/iucv/af_iucv.c:2039)
 [<00000000b796533e>] __netif_receive_skb_one_core+0x13e/0x188 net/core/dev.c:5315
 [<00000000b79653ce>] __netif_receive_skb+0x46/0x1c0 net/core/dev.c:5429
 [<00000000b79655fe>] netif_receive_skb_internal+0xb6/0x220 net/core/dev.c:5534
 [<00000000b796ac3a>] netif_receive_skb+0x42/0x318 net/core/dev.c:5593
 [<00000000b6fd45f4>] tun_rx_batched.isra.0+0x6fc/0x860 drivers/net/tun.c:1485
 [<00000000b6fddc4e>] tun_get_user+0x1c26/0x27f0 drivers/net/tun.c:1939
 [<00000000b6fe0f00>] tun_chr_write_iter+0x158/0x248 drivers/net/tun.c:1968
 [<00000000b4f22bfa>] call_write_iter include/linux/fs.h:1887 [inline]
 [<00000000b4f22bfa>] new_sync_write+0x442/0x648 fs/read_write.c:518
 [<00000000b4f238fe>] vfs_write.part.0+0x36e/0x5d8 fs/read_write.c:605
 [<00000000b4f2984e>] vfs_write+0x10e/0x148 fs/read_write.c:615
 [<00000000b4f29d0e>] ksys_write+0x166/0x290 fs/read_write.c:658
 [<00000000b8dc4ab4>] system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
Last Breaking-Event-Address:
 [<00000000b8dc64d4>] __s390_indirect_jump_r14+0x0/0xc

Malformed RX packets shouldn't generate any warnings because
debugging info already flows to dropmon via the kfree_skb().

Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Jakub Kicinski 14a6daf3a4 Merge branch 's390-qeth-updates-2021-01-28'
Julian Wiedmann says:

====================
s390/qeth: updates 2021-01-28

Nothing special, mostly fine-tuning and follow-on cleanups for earlier fixes.
====================

Link: https://lore.kernel.org/r/20210128112551.18780-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:01 -08:00
Julian Wiedmann d6e5150315 s390/qeth: don't fake a TX completion interrupt after TX error
When do_qdio() returns with an unexpected error, qeth_flush_buffers()
kicks off a recovery action.

In such a case there's no point in starting TX completion processing,
the device gets torn down anyway. So take a closer look at do_qdio()'s
return value, and skip the TX completion processing accordingly.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:35:58 -08:00
Julian Wiedmann a667fee181 s390/qeth: make cast type selection for af_iucv skbs robust
As part of the TX queue selection for af_iucv skbs,
qeth_l3_get_cast_type_rcu() ends up calling qeth_get_ether_cast_type().
Which is rather fragile, since such skbs don't have a proper ETH header
and we rely on it being zeroed out in the right places. Add a separate
case for ETH_P_AF_IUCV instead that does the right thing.

When later building the HW header for such skbs, don't hard-code the
cast type but follow the same path as for other protocol types. Here
the cast type should naturally come from the skb's queue mapping.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:35:58 -08:00
Julian Wiedmann c61dff3c1e s390/qeth: pass proto to qeth_l3_get_cast_type()
qeth_l3_hard_start_xmit() already determined the skb's proto. Avoid
doing so a second time when it calls qeth_l3_get_cast_type().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:35:58 -08:00
Julian Wiedmann 17f3a8b5f5 s390/qeth: remove qeth_get_ip_version()
Replace our home-grown helper with the more robust vlan_get_protocol().
This is pretty much a 1:1 replacement, we just need to pass around a
proper ETH_P_* everyhwere and convert the old value range.

For readability also convert the protocol checks in
qeth_l3_hard_start_xmit() to a switch statement.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:35:58 -08:00
Julian Wiedmann ea12f1b3c8 s390/qeth: clean up load/remove code for disciplines
We have two usage patterns:
1. get & ->setup() a new discipline, or
2. ->remove() & put the currently loaded one.

Add corresponding helpers that hide the internals & error handling.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:35:57 -08:00
Jakub Kicinski 699e4bc8c3 Merge branch 'net-ipa-hardware-pipeline-cleanup-fixes'
Alex Elder says:

====================
net: ipa: hardware pipeline cleanup fixes

There is a procedure currently referred to as a "tag process" that
is performed to clear the IPA hardware pipeline--either at the time
of a modem crash, or when suspending modem GSI channels.

One thing done in this procedure is issuing a command that sends a
data packet originating from the AP->command TX endpoint, destined
for the AP<-LAN RX (default) endpoint.  And although we currently
wait for the send to complete, we do *not* wait for the packet to be
received.  But the pipeline can't be assumed clear until we have
actually received this packet.

This series addresses this by detecting when the pipeline-clearing
packet has been received, and using a completion to allow a waiter
to know when that has happened.  This uses the IPA status capability
(which sends an extra status buffer for certain packets).  It also
uses the ability to supply a "tag" with a packet, which will be
delivered with the packet's status buffer.  We tag the data packet
that's sent to clear the pipeline, and use the receipt of a status
buffer associated with a tagged packet to determine when that packet
has arrived.

"Tag status" just desribes one aspect of this procedure, so some
symbols are renamed to be more like "pipeline clear" so they better
describe the larger purpose.  Finally, two functions used in this
code don't use their arguments, so those arguments are removed.
====================

Link: https://lore.kernel.org/r/20210126185703.29087-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:23:30 -08:00
Alex Elder 070740d389 net: ipa: don't pass size to ipa_cmd_transfer_add()
The only time we transfer data (rather than issuing a command) out
of the AP->command TX endpoint is when we're clearing the hardware
pipeline.  All that's needed is a "small" data buffer, and its
contents aren't even important.

For convenience, we just transfer a command structure in this case
(it's already mapped for DMA).  The TRE is added to a transaction
using ipa_cmd_ip_tag_status_add(), but we ignore the size value
provided to that function.  So just get rid of the size argument.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:23:26 -08:00
Alex Elder 792b75b147 net: ipa: don't pass tag value to ipa_cmd_ip_tag_status_add()
We only send a tagged packet from the AP->command TX endpoint when
we're clearing the hardware pipeline.  And when we receive the
tagged packet we don't care what the actual tag value is.

Stop passing a tag value to ipa_cmd_ip_tag_status_add(), and just
encode 0 as the tag sent.  Fix the function that encodes the tag so
it uses the proper byte ordering.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:23:26 -08:00
Alex Elder 51c48ce264 net: ipa: signal when tag transfer completes
There are times, such as when the modem crashes, when we issue
commands to clear the IPA hardware pipeline.  These commands include
a data transfer command that delivers a small packet directly to the
default (AP<-LAN RX) endpoint.

The places that do this wait for the transactions that contain these
commands to complete, but the pipeline can't be assumed clear until
the sent packet has been *received*.

The small transfer will be delivered with a status structure, and
that status will indicate its tag is valid.  This is the only place
we send a tagged packet, so we use the tag to determine when the
pipeline clear packet has arrived.

Add a completion to the IPA structure to to be used to signal
the receipt of a pipeline clear packet.  Create a new function
ipa_cmd_pipeline_clear_wait() that will wait for that completion.

Reinitialize the completion whenever pipeline clear commands are
added to a transaction.  Extend ipa_endpoint_status_tag() to check
whether a packet whose status contains a valid tag was sent from the
AP->command TX endpoint, and if so, signal the new IPA completion.

Have all callers of ipa_cmd_pipeline_clear_add() wait for the
pipeline clear indication after the transaction that clears the
pipeline has completed.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:23:26 -08:00
Alex Elder f6aba7b519 net: ipa: drop packet if status has valid tag
Introduce ipa_endpoint_status_tag(), which returns true if received
status indicates its tag field is valid.  The endpoint parameter is
not yet used.

Call this from ipa_status_drop_packet(), and drop the packet if the
status indicates the tag was valid.  Pass the endpoint pointer to
ipa_status_drop_packet(), and rename it ipa_endpoint_status_drop().
The endpoint will be used in the next patch.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:23:26 -08:00
Alex Elder 162fbc6f45 net: ipa: minor update to handling of packet with status
Rearrange some comments and assignments made when handling a packet
that is received with status, aiming to improve understandability.

Use DIV_ROUND_CLOSEST() to get a better per-packet true size estimate.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:23:25 -08:00
Alex Elder aa56e3e5cd net: ipa: rename "tag status" symbols
There is a set of functions and symbols related to performing
"tag_process" immediate commands to clear the IPA pipeline.  The
name is related to one of the commands issued when doing this, but
it doesn't really convey the overall purpose of taking this action.

The purpose is to take some steps to "clear out" the hardware
pipeline, and to wait until that process completes, to ensure the
IPA hardware is in a well-defined state.

Rename these symbols to use "pipeline_clear" in their names instead.
Add some comments to explain a bit more about what's going on.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:23:25 -08:00
Jesper Dangaard Brouer 28af22c6c8 net: adjust net_device layout for cacheline usage
The current layout of net_device is not optimal for cacheline usage.

The member adj_list.lower linked list is split between cacheline 2 and 3.
The ifindex is placed together with stats (struct net_device_stats),
although most modern drivers don't update this stats member.

The members netdev_ops, mtu and hard_header_len are placed on three
different cachelines. These members are accessed for XDP redirect into
devmap, which were noticeably with perf tool. When not using the map
redirect variant (like TC-BPF does), then ifindex is also used, which is
placed on a separate fourth cacheline. These members are also accessed
during forwarding with regular network stack. The members priv_flags and
flags are on fast-path for network stack transmit path in __dev_queue_xmit
(currently located together with mtu cacheline).

This patch creates a read mostly cacheline, with the purpose of keeping the
above mentioned members on the same cacheline.

Some netdev_features_t members also becomes part of this cacheline, which is
on purpose, as function netif_skb_features() is on fast-path via
validate_xmit_skb().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/161168277983.410784.12401225493601624417.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:19:06 -08:00