As part of creating the tunnel headers while offloading TC encap rules,
we resolve the route and neighbour in order to get the source /
destination mac.
Since the way we offload multipath route is by having two HW rules,
one per uplink port, doing naive route lookup might get us a "wrong"
routing path which goes through the peer uplink and this will get us
eventually to create a wrong L2 header for the tunnel.
To avoid that, we use a device hint to get the correct route.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Under multipath when encap rules are duplicated to HW in the driver,
it's possible for one flow to be currently un-offloaded (e.g. lack of
next-hop route or neigh entry) while the other flow is offloaded. As
such, we move to query the counters of both flows at all times.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Under multipath it's possible for us to offload the flow only through
the e-switch for which proper route through the uplink exists.
When the port is up and the next-hop route is set again we want to
offload through it as well.
We generate SW event from the FIB event handler when multipath port
affinity changes. The tc offloads code gets this event, goes over the
flows which were marked as of having missing route and attempts to
offload them.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Under multipath offload scheme, as part of handling fib events, emit
mlx5 port affinity event on the enabled ports which will be handled by
the tc offloads code.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In a similar manner to uplink/VF LAG, under multipath we add encap peer
rule on the second port as well.
However, unlike the LAG case, we do want to allow failure for adding
one of the rules. This happens due to using a routing hint while doing
the route lookup when one path (next hop device) is down.
Introduce a new flag to indicate that route lookup failed for encap
flow. Note that a flow may still not be offloaded to hw due to missing
neighbour, in that case, the neigh update event will take care of it.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently the peer flow inherits the flags from the original flow
after we've set it. At this time the flags are set according to
the flow state, e.g marked as going to slow path and such.
Even if not getting us to real bugs now, this opens the door to
get us to troubles later. Future proof the code and avoid the
inheritance, use the peer flags as were set on input when we
started adding the original flow.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To support multipath offload we are going to track SW multipath route
and related nexthops. To do that we register to FIB notifier and handle
the route and next-hops events and reflect that as port affinity to HW.
When there is a new multipath route entry that all next-hops are the
ports of an HCA we will activate LAG in HW.
Egress wise, we use HW LAG as the means to emulate multipath on current
HW which doesn't support port selection based on xmit hash. In the
presence of multiple VFs which use multiple SQs (send queues) this
yields fairly good distribution.
HA wise, HW LAG buys us the ability for a given RQ (receive queue) to
receive traffic from both ports and for SQs to migrate xmitting over
the active port if their base port fails.
When the route entry is being updated to single path we will update
the HW port affinity to use that port only.
If a next-hop becomes dead we update the HW port affinity to the living
port.
When all next-hops are alive again we reset the affinity to default.
Due to FW/HW limitations, when a route is deleted we are not disabling
the HW LAG since doing so will not allow us to enable it again while
VFs are bounded. Typically this is just a temporary state when a
routing daemon removes dead routes and later adds them back as needed.
This patch only handles events for AF_INET.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In order to offload ecmp-on-host scheme where next-hop routes are used,
we will make use of HW LAG. Add accessor function to let upper layers
in the driver to realize if the lag acts in multi-path mode.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Instead of using the system workqueue, allocate our own workqueue.
This workqueue will be used to handle more work in the next patch.
This patch doesn't change functionality.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The change is a refactoring step towards a multipath use case.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This fix checkpatch check
CHECK: Avoid using bool structure members because of possible alignment
issues
Signed-off-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
EAGAIN is treated as a specific case when we consider the attachment
successful but wait for neigh event before offloading the flow.
This can result in unwanted behavior when sub calls on the offloading
path will return EAGAIN and we pass this error up.
Instead of attaching to a specific error code return a boolean value
from the attach encap operation saying if the encap is valid or not.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Remove the tunnel info argument which we can get from the other args.
Also reorder the args to have input args first and output args later.
This patch doesn't change functionality.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Function mlx5e_tx_reporter_recover_from_ctx is only used within mlx5e tx
reporter, move it to be statically declared in en/reporter_tx.c.
Fixes: de8650a820 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
'ip' can switch network namespaces internally and then run a given
command relative to that namespace without the need to fork and exec
another ip instance. Update all references of the form:
ip netns exec "$testns" ip ...
to
ip -netns "$testns" ...
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann says:
====================
s390/qeth: updates 2019-02-28
please apply one more qeth patch series for net-next. This eliminates
some of the quirks in our reset code, and slims down the internal
state machine.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that qeth always uses dev_close() to shutdown the interface, we can
trust the locking and remove some custom state checks.
qeth_l?_stop_card() is no longer called for a card in UP state, so remove
the checks there too. This basically makes the UP state obsolete, so rip
out the whole thing (except for the sysfs-visible string).
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It makes no difference whether we
1. manually disarm the HW trap and call the offline code with
recovery_mode == 1, or
2. call the offline code with recovery_mode == 0, and let it disarm the
HW trap for us.
So consolidate the two code paths in the suspend callback.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The qeth-wide workqueue is now only used by a single caller to schedule
close_dev work. Just put it on a system queue instead.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recovery code already runs in a kthread, we don't have to defer the
offlining further.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smatch complains that __qeth_l3_set_offline() first accesses card->dev,
and then later checks whether the pointer is valid.
Since commit d3d1b205e8 ("s390/qeth: allocate netdevice early"), the
pointer is _always_ valid - that patch merely missed to remove this one
check.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When resetting an interface ("recovery"), qeth currently attempts to
elide the call to dev_close(). We initially only call .ndo_close to
quiesce the data path, and then offline & online the ccwgroup device.
If the reset succeeded, a call to .ndo_open then resumes the data path
along with some internal setup (dev_addr validation, RX modeset) that
dev_open() would have usually triggered.
dev_close() only gets called (via the close_dev worker) if the reset
action fails.
It's unclear whether this was initially done due to locking concerns, or
rather to execute the reset transparently. Either way, temporarily
closing the interface without dev_close() is fragile, and means we're
susceptible to various races and unexpected behaviour. For instance:
- Bypassing dev_deactivate_many() means that the qdiscs are not set to
__QDISC_STATE_DEACTIVATED. Consequently any intermittent TX completion
can wake up the txq, resulting in calls to .ndo_start_xmit while the
data path is down. We have custom state checking to detect this case
and drop such packets.
- Because the IFF_UP flag doesn't reflect the interface's actual state
during a reset, we have custom state checking in .ndo_open and
.ndo_close to guard against invalid calls.
- Considering that the reset might take a considerable amount of time
(in particular if an IO fails and we end up waiting for its timeout), we
_do_ want NETDEV_GOING_DOWN and NETDEV_DOWN events so that components
like bonding, team, bridge, macvlan, vlan, ... can take appropriate
action.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In its attempt to run only the minimal amount of tear down steps,
qeth_l2_stop_card() fails to reset the "is dev_addr registered?" flag
in some rare scenarios. But a future change to the tear down sequence
would cause us to _always_ hit this issue, so patch it up before that
code lands.
Fix it by unconditionally clearing the flag bit. This also allows us to
remove the additional cleanup step in qeth_dev_layer2_store().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting a L2 qeth device online, enable the HW trap as soon as the
control plane is available. This allows us to catch any error that
occurs during the very first commands.
In the same spirit, the offline code should disable the HW trap as the
very first step of its processing.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The offline code uses a specific RECOVER state to indicate that the
interface should be brought up when a qeth device is set online again.
Rather than having a specific card-state for this, just put it in an
internal flag bit and set the state to DOWN. When working with the
card's state transitions, this reduces the complexity quite a bit.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without hardware pnetid support there must currently be a pnet
table configured to determine the IB device port to be used for SMC
RDMA traffic. This patch enables a setup without pnet table, if
the used handshake interface belongs already to a RoCE port.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As per RFC 8033, it is sufficient for the drop probability
decay factor to have a value of (1 - 1/64) instead of 98%.
This avoids the need to do slow division.
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we are not able to reach firmware, enter debugging mode that will
help us to get adapter logs.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
T6 adapters support outer UDP checksum offload for
encapsulated packets, hence enabling netdev feature flag
NETIF_F_GSO_UDP_TUNNEL_CSUM.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO is done by cxgb4/cxgb4vf. Hence set NETIF_F_GRO flag for
both cxgb4/cxgb4vf.
Cleaned up VLAN netdev features in cxgb4vf. Also fixed
NETIF_F_HIGHDMA being set unconditionally for vlan netdev
features.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The csum calculation is different for IPv4/6. For VLAN packets,
tc_skb_protocol returns the VLAN protocol rather than the packet's one
(e.g. IPv4/6), so csum is not calculated. Furthermore, VLAN may not be
stripped so csum is not calculated in this case too. Calculate the
csum for those cases.
Fixes: d8b9605d26 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
instance = devm_kzalloc(dev, sizeof(struct foo) + sizeof(struct boo) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = devm_kzalloc(dev, struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier says:
====================
net: phy: marvell10g: Clean .get_features by using C45 helpers
Recent work on C45 helpers by Heiner made the
genphy_c45_pma_read_abilities function generic enough to use as a
default .get_featutes implementation.
This series removes the remaining redundant code in
mv3310_get_features(), and makes the 2110 PHY use
genphy_c45_pma_read_abilities() directly, since it doesn't have the
issue with the wrong abilities being reported.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Contrary to the 3310, the 2110 PHY correctly reports it's 2.5G/5G
abilities. We can therefore use the genphy_c45_pma_read_abilities helper
to build the list of features.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The genphy_c45_pma_read_abilities helper now sets the Autoneg ability
in phydev->supported according to what the AN MMD reports.
We therefore don't need to manually do that in mv3310_get_features().
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Suggested-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for Generic Mux controls, when Mdio mux node is a consumer
of mux produced by some other device.
Signed-off-by: Pankaj Bansal <pankaj.bansal@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we use the bindings defined in Documentation/devicetree/bindings/mux
to define mdio mux in producer and consumer terms, it results in two
devices. one is mux producer and other is mux consumer.
Add the bindings needed for Mdio mux consumer devices.
Signed-off-by: Pankaj Bansal <pankaj.bansal@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current fib_multipath_hash_policy can make hash based on the L3 or
L4. But it only work on the outer IP. So a specific tunnel always
has the same hash value. But a specific tunnel may contain so many
inner connections.
This patch provide a generic multipath_hash in floi_common. It can
make a user-define hash which can mix with L3 or L4 hash.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli says:
====================
net: Remove switchdev_ops
This patch series completes the removal of the switchdev_ops by
converting switchdev_port_attr_set() to use either the blocking
(process) or non-blocking (atomic) notifier since we typically need to
deal with both depending on where in the bridge code we get called from.
This was tested with the forwarding selftests and DSA hardware.
Ido, hopefully this captures your comments done on v1, if not, can you
illustrate with some pseudo-code what you had in mind if that's okay?
Changes in v3:
- added Reviewed-by tags from Ido where relevant
- added missing notifier_to_errno() in net/bridge/br_switchdev.c when
calling the atomic notifier for PRE_BRIDGE_FLAGS
- kept mlxsw_sp_switchdev_init() in mlxsw/
Changes in v2:
- do not check for SWITCHDEV_F_DEFER when calling the blocking notifier
and instead directly call the atomic notifier from the single location
where this is required
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have converted all possible callers to using a switchdev
notifier for attributes we do not have a need for implementing
switchdev_ops anymore, and this can be removed from all drivers the
net_device structure.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop switchdev_ops.switchdev_port_attr_set. Drop the uses of this field
from all clients, which were migrated to use switchdev notification in
the previous patches.
Add a new function switchdev_port_attr_notify() that sends the switchdev
notifications SWITCHDEV_PORT_ATTR_SET and calls the blocking (process)
notifier chain.
We have one odd case within net/bridge/br_switchdev.c with the
SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS attribute identifier that
requires executing from atomic context, we deal with that one
specifically.
Drop __switchdev_port_attr_set() and update switchdev_port_attr_set()
likewise.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will change the way we communicate setting a port's
attribute and use a blocking notifier to perform those tasks.
Prepare ethsw to support receiving notifier events targeting
SWITCHDEV_PORT_ATTR_SET and simply translate that into the existing
swdev_port_attr_set() call.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will change the way we communicate setting a port's
attribute and use notifiers to perform those tasks.
Ocelot does not currently have an atomic notifier registered for
switchdev events, so we need to register one in order to deal with
atomic context SWITCHDEV_PORT_ATTR_SET events.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will change the way we communicate setting a port's
attribute and use a notifier to perform those tasks.
Prepare mlxsw to support receiving notifier events targeting
SWITCHDEV_PORT_ATTR_SET and utilize the switchdev_handle_port_attr_set()
to handle stacking of devices.
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will change the way we communicate setting a port's
attribute and use notifiers towards that goal.
Prepare DSA to support receiving notifier events targeting
SWITCHDEV_PORT_ATTR_SET from both atomic and process context and use a
small helper to translate the event notifier into something that
dsa_slave_port_attr_set() can process.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will change the way we communicate setting a port's
attribute and use notifiers towards that goal.
Prepare rocker to support receiving notifier events targeting
SWITCHDEV_PORT_ATTR_SET from both atomic and process context and use a
small helper to translate the event notifier into something that
rocker_port_attr_set() can process.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for allowing switchdev enabled drivers to veto specific
attribute settings from within the context of the caller, introduce a
new switchdev notifier type for port attributes.
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 31a9984876 ("net: sched: fw: don't set arg->stop in
fw_walk() when empty")
Cls API function tcf_proto_is_empty() was changed in commit
6676d5e416 ("net: sched: set dedicated tcf_walker flag when tp is empty")
to no longer depend on arg->stop to determine that classifier instance is
empty. Instead, it adds dedicated arg->nonempty field, which makes the fix
in fw classifier no longer necessary.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the .cmd member by using a designated struct
initializer. This fixes warning of missing field initializers,
and makes code a little easier to read.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>