844 lines
42 KiB
ReStructuredText
844 lines
42 KiB
ReStructuredText
============
|
|
Architecture
|
|
============
|
|
|
|
This document describes the **Distributed Switch Architecture (DSA)** subsystem
|
|
design principles, limitations, interactions with other subsystems, and how to
|
|
develop drivers for this subsystem as well as a TODO for developers interested
|
|
in joining the effort.
|
|
|
|
Design principles
|
|
=================
|
|
|
|
The Distributed Switch Architecture is a subsystem which was primarily designed
|
|
to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
|
|
using Linux, but has since evolved to support other vendors as well.
|
|
|
|
The original philosophy behind this design was to be able to use unmodified
|
|
Linux tools such as bridge, iproute2, ifconfig to work transparently whether
|
|
they configured/queried a switch port network device or a regular network
|
|
device.
|
|
|
|
An Ethernet switch is typically comprised of multiple front-panel ports, and one
|
|
or more CPU or management port. The DSA subsystem currently relies on the
|
|
presence of a management port connected to an Ethernet controller capable of
|
|
receiving Ethernet frames from the switch. This is a very common setup for all
|
|
kinds of Ethernet switches found in Small Home and Office products: routers,
|
|
gateways, or even top-of-the rack switches. This host Ethernet controller will
|
|
be later referred to as "master" and "cpu" in DSA terminology and code.
|
|
|
|
The D in DSA stands for Distributed, because the subsystem has been designed
|
|
with the ability to configure and manage cascaded switches on top of each other
|
|
using upstream and downstream Ethernet links between switches. These specific
|
|
ports are referred to as "dsa" ports in DSA terminology and code. A collection
|
|
of multiple switches connected to each other is called a "switch tree".
|
|
|
|
For each front-panel port, DSA will create specialized network devices which are
|
|
used as controlling and data-flowing endpoints for use by the Linux networking
|
|
stack. These specialized network interfaces are referred to as "slave" network
|
|
interfaces in DSA terminology and code.
|
|
|
|
The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
|
|
which is a hardware feature making the switch insert a specific tag for each
|
|
Ethernet frames it received to/from specific ports to help the management
|
|
interface figure out:
|
|
|
|
- what port is this frame coming from
|
|
- what was the reason why this frame got forwarded
|
|
- how to send CPU originated traffic to specific ports
|
|
|
|
The subsystem does support switches not capable of inserting/stripping tags, but
|
|
the features might be slightly limited in that case (traffic separation relies
|
|
on Port-based VLAN IDs).
|
|
|
|
Note that DSA does not currently create network interfaces for the "cpu" and
|
|
"dsa" ports because:
|
|
|
|
- the "cpu" port is the Ethernet switch facing side of the management
|
|
controller, and as such, would create a duplication of feature, since you
|
|
would get two interfaces for the same conduit: master netdev, and "cpu" netdev
|
|
|
|
- the "dsa" port(s) are just conduits between two or more switches, and as such
|
|
cannot really be used as proper network interfaces either, only the
|
|
downstream, or the top-most upstream interface makes sense with that model
|
|
|
|
Switch tagging protocols
|
|
------------------------
|
|
|
|
DSA supports many vendor-specific tagging protocols, one software-defined
|
|
tagging protocol, and a tag-less mode as well (``DSA_TAG_PROTO_NONE``).
|
|
|
|
The exact format of the tag protocol is vendor specific, but in general, they
|
|
all contain something which:
|
|
|
|
- identifies which port the Ethernet frame came from/should be sent to
|
|
- provides a reason why this frame was forwarded to the management interface
|
|
|
|
All tagging protocols are in ``net/dsa/tag_*.c`` files and implement the
|
|
methods of the ``struct dsa_device_ops`` structure, which are detailed below.
|
|
|
|
Tagging protocols generally fall in one of three categories:
|
|
|
|
1. The switch-specific frame header is located before the Ethernet header,
|
|
shifting to the right (from the perspective of the DSA master's frame
|
|
parser) the MAC DA, MAC SA, EtherType and the entire L2 payload.
|
|
2. The switch-specific frame header is located before the EtherType, keeping
|
|
the MAC DA and MAC SA in place from the DSA master's perspective, but
|
|
shifting the 'real' EtherType and L2 payload to the right.
|
|
3. The switch-specific frame header is located at the tail of the packet,
|
|
keeping all frame headers in place and not altering the view of the packet
|
|
that the DSA master's frame parser has.
|
|
|
|
A tagging protocol may tag all packets with switch tags of the same length, or
|
|
the tag length might vary (for example packets with PTP timestamps might
|
|
require an extended switch tag, or there might be one tag length on TX and a
|
|
different one on RX). Either way, the tagging protocol driver must populate the
|
|
``struct dsa_device_ops::needed_headroom`` and/or ``struct dsa_device_ops::needed_tailroom``
|
|
with the length in octets of the longest switch frame header/trailer. The DSA
|
|
framework will automatically adjust the MTU of the master interface to
|
|
accommodate for this extra size in order for DSA user ports to support the
|
|
standard MTU (L2 payload length) of 1500 octets. The ``needed_headroom`` and
|
|
``needed_tailroom`` properties are also used to request from the network stack,
|
|
on a best-effort basis, the allocation of packets with enough extra space such
|
|
that the act of pushing the switch tag on transmission of a packet does not
|
|
cause it to reallocate due to lack of memory.
|
|
|
|
Even though applications are not expected to parse DSA-specific frame headers,
|
|
the format on the wire of the tagging protocol represents an Application Binary
|
|
Interface exposed by the kernel towards user space, for decoders such as
|
|
``libpcap``. The tagging protocol driver must populate the ``proto`` member of
|
|
``struct dsa_device_ops`` with a value that uniquely describes the
|
|
characteristics of the interaction required between the switch hardware and the
|
|
data path driver: the offset of each bit field within the frame header and any
|
|
stateful processing required to deal with the frames (as may be required for
|
|
PTP timestamping).
|
|
|
|
From the perspective of the network stack, all switches within the same DSA
|
|
switch tree use the same tagging protocol. In case of a packet transiting a
|
|
fabric with more than one switch, the switch-specific frame header is inserted
|
|
by the first switch in the fabric that the packet was received on. This header
|
|
typically contains information regarding its type (whether it is a control
|
|
frame that must be trapped to the CPU, or a data frame to be forwarded).
|
|
Control frames should be decapsulated only by the software data path, whereas
|
|
data frames might also be autonomously forwarded towards other user ports of
|
|
other switches from the same fabric, and in this case, the outermost switch
|
|
ports must decapsulate the packet.
|
|
|
|
Note that in certain cases, it might be the case that the tagging format used
|
|
by a leaf switch (not connected directly to the CPU) to not be the same as what
|
|
the network stack sees. This can be seen with Marvell switch trees, where the
|
|
CPU port can be configured to use either the DSA or the Ethertype DSA (EDSA)
|
|
format, but the DSA links are configured to use the shorter (without Ethertype)
|
|
DSA frame header, in order to reduce the autonomous packet forwarding overhead.
|
|
It still remains the case that, if the DSA switch tree is configured for the
|
|
EDSA tagging protocol, the operating system sees EDSA-tagged packets from the
|
|
leaf switches that tagged them with the shorter DSA header. This can be done
|
|
because the Marvell switch connected directly to the CPU is configured to
|
|
perform tag translation between DSA and EDSA (which is simply the operation of
|
|
adding or removing the ``ETH_P_EDSA`` EtherType and some padding octets).
|
|
|
|
It is possible to construct cascaded setups of DSA switches even if their
|
|
tagging protocols are not compatible with one another. In this case, there are
|
|
no DSA links in this fabric, and each switch constitutes a disjoint DSA switch
|
|
tree. The DSA links are viewed as simply a pair of a DSA master (the out-facing
|
|
port of the upstream DSA switch) and a CPU port (the in-facing port of the
|
|
downstream DSA switch).
|
|
|
|
The tagging protocol of the attached DSA switch tree can be viewed through the
|
|
``dsa/tagging`` sysfs attribute of the DSA master::
|
|
|
|
cat /sys/class/net/eth0/dsa/tagging
|
|
|
|
If the hardware and driver are capable, the tagging protocol of the DSA switch
|
|
tree can be changed at runtime. This is done by writing the new tagging
|
|
protocol name to the same sysfs device attribute as above (the DSA master and
|
|
all attached switch ports must be down while doing this).
|
|
|
|
It is desirable that all tagging protocols are testable with the ``dsa_loop``
|
|
mockup driver, which can be attached to any network interface. The goal is that
|
|
any network interface should be capable of transmitting the same packet in the
|
|
same way, and the tagger should decode the same received packet in the same way
|
|
regardless of the driver used for the switch control path, and the driver used
|
|
for the DSA master.
|
|
|
|
The transmission of a packet goes through the tagger's ``xmit`` function.
|
|
The passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
|
|
``skb_mac_header(skb)``, i.e. at the destination MAC address, and the passed
|
|
``struct net_device *dev`` represents the virtual DSA user network interface
|
|
whose hardware counterpart the packet must be steered to (i.e. ``swp0``).
|
|
The job of this method is to prepare the skb in a way that the switch will
|
|
understand what egress port the packet is for (and not deliver it towards other
|
|
ports). Typically this is fulfilled by pushing a frame header. Checking for
|
|
insufficient size in the skb headroom or tailroom is unnecessary provided that
|
|
the ``needed_headroom`` and ``needed_tailroom`` properties were filled out
|
|
properly, because DSA ensures there is enough space before calling this method.
|
|
|
|
The reception of a packet goes through the tagger's ``rcv`` function. The
|
|
passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
|
|
``skb_mac_header(skb) + ETH_ALEN`` octets, i.e. to where the first octet after
|
|
the EtherType would have been, were this frame not tagged. The role of this
|
|
method is to consume the frame header, adjust ``skb->data`` to really point at
|
|
the first octet after the EtherType, and to change ``skb->dev`` to point to the
|
|
virtual DSA user network interface corresponding to the physical front-facing
|
|
switch port that the packet was received on.
|
|
|
|
Since tagging protocols in category 1 and 2 break software (and most often also
|
|
hardware) packet dissection on the DSA master, features such as RPS (Receive
|
|
Packet Steering) on the DSA master would be broken. The DSA framework deals
|
|
with this by hooking into the flow dissector and shifting the offset at which
|
|
the IP header is to be found in the tagged frame as seen by the DSA master.
|
|
This behavior is automatic based on the ``overhead`` value of the tagging
|
|
protocol. If not all packets are of equal size, the tagger can implement the
|
|
``flow_dissect`` method of the ``struct dsa_device_ops`` and override this
|
|
default behavior by specifying the correct offset incurred by each individual
|
|
RX packet. Tail taggers do not cause issues to the flow dissector.
|
|
|
|
Due to various reasons (most common being category 1 taggers being associated
|
|
with DSA-unaware masters, mangling what the master perceives as MAC DA), the
|
|
tagging protocol may require the DSA master to operate in promiscuous mode, to
|
|
receive all frames regardless of the value of the MAC DA. This can be done by
|
|
setting the ``promisc_on_master`` property of the ``struct dsa_device_ops``.
|
|
Note that this assumes a DSA-unaware master driver, which is the norm.
|
|
|
|
Master network devices
|
|
----------------------
|
|
|
|
Master network devices are regular, unmodified Linux network device drivers for
|
|
the CPU/management Ethernet interface. Such a driver might occasionally need to
|
|
know whether DSA is enabled (e.g.: to enable/disable specific offload features),
|
|
but the DSA subsystem has been proven to work with industry standard drivers:
|
|
``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these
|
|
drivers. Such network devices are also often referred to as conduit network
|
|
devices since they act as a pipe between the host processor and the hardware
|
|
Ethernet switch.
|
|
|
|
Networking stack hooks
|
|
----------------------
|
|
|
|
When a master netdev is used with DSA, a small hook is placed in the
|
|
networking stack is in order to have the DSA subsystem process the Ethernet
|
|
switch specific tagging protocol. DSA accomplishes this by registering a
|
|
specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
|
|
networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
|
|
Ethernet Frame receive sequence looks like this:
|
|
|
|
Master network device (e.g.: e1000e):
|
|
|
|
1. Receive interrupt fires:
|
|
|
|
- receive function is invoked
|
|
- basic packet processing is done: getting length, status etc.
|
|
- packet is prepared to be processed by the Ethernet layer by calling
|
|
``eth_type_trans``
|
|
|
|
2. net/ethernet/eth.c::
|
|
|
|
eth_type_trans(skb, dev)
|
|
if (dev->dsa_ptr != NULL)
|
|
-> skb->protocol = ETH_P_XDSA
|
|
|
|
3. drivers/net/ethernet/\*::
|
|
|
|
netif_receive_skb(skb)
|
|
-> iterate over registered packet_type
|
|
-> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
|
|
|
|
4. net/dsa/dsa.c::
|
|
|
|
-> dsa_switch_rcv()
|
|
-> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c'
|
|
|
|
5. net/dsa/tag_*.c:
|
|
|
|
- inspect and strip switch tag protocol to determine originating port
|
|
- locate per-port network device
|
|
- invoke ``eth_type_trans()`` with the DSA slave network device
|
|
- invoked ``netif_receive_skb()``
|
|
|
|
Past this point, the DSA slave network devices get delivered regular Ethernet
|
|
frames that can be processed by the networking stack.
|
|
|
|
Slave network devices
|
|
---------------------
|
|
|
|
Slave network devices created by DSA are stacked on top of their master network
|
|
device, each of these network interfaces will be responsible for being a
|
|
controlling and data-flowing end-point for each front-panel port of the switch.
|
|
These interfaces are specialized in order to:
|
|
|
|
- insert/remove the switch tag protocol (if it exists) when sending traffic
|
|
to/from specific switch ports
|
|
- query the switch for ethtool operations: statistics, link state,
|
|
Wake-on-LAN, register dumps...
|
|
- external/internal PHY management: link, auto-negotiation etc.
|
|
|
|
These slave network devices have custom net_device_ops and ethtool_ops function
|
|
pointers which allow DSA to introduce a level of layering between the networking
|
|
stack/ethtool, and the switch driver implementation.
|
|
|
|
Upon frame transmission from these slave network devices, DSA will look up which
|
|
switch tagging protocol is currently registered with these network devices, and
|
|
invoke a specific transmit routine which takes care of adding the relevant
|
|
switch tag in the Ethernet frames.
|
|
|
|
These frames are then queued for transmission using the master network device
|
|
``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the
|
|
Ethernet switch will be able to process these incoming frames from the
|
|
management interface and delivers these frames to the physical switch port.
|
|
|
|
Graphical representation
|
|
------------------------
|
|
|
|
Summarized, this is basically how DSA looks like from a network device
|
|
perspective::
|
|
|
|
Unaware application
|
|
opens and binds socket
|
|
| ^
|
|
| |
|
|
+-----------v--|--------------------+
|
|
|+------+ +------+ +------+ +------+|
|
|
|| swp0 | | swp1 | | swp2 | | swp3 ||
|
|
|+------+-+------+-+------+-+------+|
|
|
| DSA switch driver |
|
|
+-----------------------------------+
|
|
| ^
|
|
Tag added by | | Tag consumed by
|
|
switch driver | | switch driver
|
|
v |
|
|
+-----------------------------------+
|
|
| Unmodified host interface driver | Software
|
|
--------+-----------------------------------+------------
|
|
| Host interface (eth0) | Hardware
|
|
+-----------------------------------+
|
|
| ^
|
|
Tag consumed by | | Tag added by
|
|
switch hardware | | switch hardware
|
|
v |
|
|
+-----------------------------------+
|
|
| Switch |
|
|
|+------+ +------+ +------+ +------+|
|
|
|| swp0 | | swp1 | | swp2 | | swp3 ||
|
|
++------+-+------+-+------+-+------++
|
|
|
|
Slave MDIO bus
|
|
--------------
|
|
|
|
In order to be able to read to/from a switch PHY built into it, DSA creates a
|
|
slave MDIO bus which allows a specific switch driver to divert and intercept
|
|
MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
|
|
switches, these functions would utilize direct or indirect PHY addressing mode
|
|
to return standard MII registers from the switch builtin PHYs, allowing the PHY
|
|
library and/or to return link status, link partner pages, auto-negotiation
|
|
results etc..
|
|
|
|
For Ethernet switches which have both external and internal MDIO busses, the
|
|
slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
|
|
internal or external MDIO devices this switch might be connected to: internal
|
|
PHYs, external PHYs, or even external switches.
|
|
|
|
Data structures
|
|
---------------
|
|
|
|
DSA data structures are defined in ``include/net/dsa.h`` as well as
|
|
``net/dsa/dsa_priv.h``:
|
|
|
|
- ``dsa_chip_data``: platform data configuration for a given switch device,
|
|
this structure describes a switch device's parent device, its address, as
|
|
well as various properties of its ports: names/labels, and finally a routing
|
|
table indication (when cascading switches)
|
|
|
|
- ``dsa_platform_data``: platform device configuration data which can reference
|
|
a collection of dsa_chip_data structure if multiples switches are cascaded,
|
|
the master network device this switch tree is attached to needs to be
|
|
referenced
|
|
|
|
- ``dsa_switch_tree``: structure assigned to the master network device under
|
|
``dsa_ptr``, this structure references a dsa_platform_data structure as well as
|
|
the tagging protocol supported by the switch tree, and which receive/transmit
|
|
function hooks should be invoked, information about the directly attached
|
|
switch is also provided: CPU port. Finally, a collection of dsa_switch are
|
|
referenced to address individual switches in the tree.
|
|
|
|
- ``dsa_switch``: structure describing a switch device in the tree, referencing
|
|
a ``dsa_switch_tree`` as a backpointer, slave network devices, master network
|
|
device, and a reference to the backing``dsa_switch_ops``
|
|
|
|
- ``dsa_switch_ops``: structure referencing function pointers, see below for a
|
|
full description.
|
|
|
|
Design limitations
|
|
==================
|
|
|
|
Lack of CPU/DSA network devices
|
|
-------------------------------
|
|
|
|
DSA does not currently create slave network devices for the CPU or DSA ports, as
|
|
described before. This might be an issue in the following cases:
|
|
|
|
- inability to fetch switch CPU port statistics counters using ethtool, which
|
|
can make it harder to debug MDIO switch connected using xMII interfaces
|
|
|
|
- inability to configure the CPU port link parameters based on the Ethernet
|
|
controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
|
|
|
|
- inability to configure specific VLAN IDs / trunking VLANs between switches
|
|
when using a cascaded setup
|
|
|
|
Common pitfalls using DSA setups
|
|
--------------------------------
|
|
|
|
Once a master network device is configured to use DSA (dev->dsa_ptr becomes
|
|
non-NULL), and the switch behind it expects a tagging protocol, this network
|
|
interface can only exclusively be used as a conduit interface. Sending packets
|
|
directly through this interface (e.g.: opening a socket using this interface)
|
|
will not make us go through the switch tagging protocol transmit function, so
|
|
the Ethernet switch on the other end, expecting a tag will typically drop this
|
|
frame.
|
|
|
|
Interactions with other subsystems
|
|
==================================
|
|
|
|
DSA currently leverages the following subsystems:
|
|
|
|
- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
|
|
- Switchdev:``net/switchdev/*``
|
|
- Device Tree for various of_* functions
|
|
- Devlink: ``net/core/devlink.c``
|
|
|
|
MDIO/PHY library
|
|
----------------
|
|
|
|
Slave network devices exposed by DSA may or may not be interfacing with PHY
|
|
devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
|
|
subsystem deals with all possible combinations:
|
|
|
|
- internal PHY devices, built into the Ethernet switch hardware
|
|
- external PHY devices, connected via an internal or external MDIO bus
|
|
- internal PHY devices, connected via an internal MDIO bus
|
|
- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
|
|
fixed PHYs
|
|
|
|
The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the
|
|
logic basically looks like this:
|
|
|
|
- if Device Tree is used, the PHY device is looked up using the standard
|
|
"phy-handle" property, if found, this PHY device is created and registered
|
|
using ``of_phy_connect()``
|
|
|
|
- if Device Tree is used, and the PHY device is "fixed", that is, conforms to
|
|
the definition of a non-MDIO managed PHY as defined in
|
|
``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
|
|
and connected transparently using the special fixed MDIO bus driver
|
|
|
|
- finally, if the PHY is built into the switch, as is very common with
|
|
standalone switch packages, the PHY is probed using the slave MII bus created
|
|
by DSA
|
|
|
|
|
|
SWITCHDEV
|
|
---------
|
|
|
|
DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
|
|
more specifically with its VLAN filtering portion when configuring VLANs on top
|
|
of per-port slave network devices. As of today, the only SWITCHDEV objects
|
|
supported by DSA are the FDB and VLAN objects.
|
|
|
|
Devlink
|
|
-------
|
|
|
|
DSA registers one devlink device per physical switch in the fabric.
|
|
For each devlink device, every physical port (i.e. user ports, CPU ports, DSA
|
|
links or unused ports) is exposed as a devlink port.
|
|
|
|
DSA drivers can make use of the following devlink features:
|
|
|
|
- Regions: debugging feature which allows user space to dump driver-defined
|
|
areas of hardware information in a low-level, binary format. Both global
|
|
regions as well as per-port regions are supported. It is possible to export
|
|
devlink regions even for pieces of data that are already exposed in some way
|
|
to the standard iproute2 user space programs (ip-link, bridge), like address
|
|
tables and VLAN tables. For example, this might be useful if the tables
|
|
contain additional hardware-specific details which are not visible through
|
|
the iproute2 abstraction, or it might be useful to inspect these tables on
|
|
the non-user ports too, which are invisible to iproute2 because no network
|
|
interface is registered for them.
|
|
- Params: a feature which enables user to configure certain low-level tunable
|
|
knobs pertaining to the device. Drivers may implement applicable generic
|
|
devlink params, or may add new device-specific devlink params.
|
|
- Resources: a monitoring feature which enables users to see the degree of
|
|
utilization of certain hardware tables in the device, such as FDB, VLAN, etc.
|
|
- Shared buffers: a QoS feature for adjusting and partitioning memory and frame
|
|
reservations per port and per traffic class, in the ingress and egress
|
|
directions, such that low-priority bulk traffic does not impede the
|
|
processing of high-priority critical traffic.
|
|
|
|
For more details, consult ``Documentation/networking/devlink/``.
|
|
|
|
Device Tree
|
|
-----------
|
|
|
|
DSA features a standardized binding which is documented in
|
|
``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
|
|
functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
|
|
per-port PHY specific details: interface connection, MDIO bus location etc..
|
|
|
|
Driver development
|
|
==================
|
|
|
|
DSA switch drivers need to implement a dsa_switch_ops structure which will
|
|
contain the various members described below.
|
|
|
|
``register_switch_driver()`` registers this dsa_switch_ops in its internal list
|
|
of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite.
|
|
|
|
Unless requested differently by setting the priv_size member accordingly, DSA
|
|
does not allocate any driver private context space.
|
|
|
|
Switch configuration
|
|
--------------------
|
|
|
|
- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported,
|
|
should be a valid value from the ``dsa_tag_protocol`` enum
|
|
|
|
- ``probe``: probe routine which will be invoked by the DSA platform device upon
|
|
registration to test for the presence/absence of a switch device. For MDIO
|
|
devices, it is recommended to issue a read towards internal registers using
|
|
the switch pseudo-PHY and return whether this is a supported device. For other
|
|
buses, return a non-NULL string
|
|
|
|
- ``setup``: setup function for the switch, this function is responsible for setting
|
|
up the ``dsa_switch_ops`` private structure with all it needs: register maps,
|
|
interrupts, mutexes, locks etc.. This function is also expected to properly
|
|
configure the switch to separate all network interfaces from each other, that
|
|
is, they should be isolated by the switch hardware itself, typically by creating
|
|
a Port-based VLAN ID for each port and allowing only the CPU port and the
|
|
specific port to be in the forwarding vector. Ports that are unused by the
|
|
platform should be disabled. Past this function, the switch is expected to be
|
|
fully configured and ready to serve any kind of request. It is recommended
|
|
to issue a software reset of the switch during this setup function in order to
|
|
avoid relying on what a previous software agent such as a bootloader/firmware
|
|
may have previously configured.
|
|
|
|
PHY devices and link management
|
|
-------------------------------
|
|
|
|
- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
|
|
if the PHY library PHY driver needs to know about information it cannot obtain
|
|
on its own (e.g.: coming from switch memory mapped registers), this function
|
|
should return a 32-bits bitmask of "flags", that is private between the switch
|
|
driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
|
|
|
|
- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
|
|
the switch port MDIO registers. If unavailable, return 0xffff for each read.
|
|
For builtin switch Ethernet PHYs, this function should allow reading the link
|
|
status, auto-negotiation results, link partner pages etc..
|
|
|
|
- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
|
|
to the switch port MDIO registers. If unavailable return a negative error
|
|
code.
|
|
|
|
- ``adjust_link``: Function invoked by the PHY library when a slave network device
|
|
is attached to a PHY device. This function is responsible for appropriately
|
|
configuring the switch port link parameters: speed, duplex, pause based on
|
|
what the ``phy_device`` is providing.
|
|
|
|
- ``fixed_link_update``: Function invoked by the PHY library, and specifically by
|
|
the fixed PHY driver asking the switch driver for link parameters that could
|
|
not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
|
|
This is particularly useful for specific kinds of hardware such as QSGMII,
|
|
MoCA or other kinds of non-MDIO managed PHYs where out of band link
|
|
information is obtained
|
|
|
|
Ethtool operations
|
|
------------------
|
|
|
|
- ``get_strings``: ethtool function used to query the driver's strings, will
|
|
typically return statistics strings, private flags strings etc.
|
|
|
|
- ``get_ethtool_stats``: ethtool function used to query per-port statistics and
|
|
return their values. DSA overlays slave network devices general statistics:
|
|
RX/TX counters from the network device, with switch driver specific statistics
|
|
per port
|
|
|
|
- ``get_sset_count``: ethtool function used to query the number of statistics items
|
|
|
|
- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
|
|
function may, for certain implementations also query the master network device
|
|
Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
|
|
|
|
- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
|
|
direct counterpart to set_wol with similar restrictions
|
|
|
|
- ``set_eee``: ethtool function which is used to configure a switch port EEE (Green
|
|
Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
|
|
PHY level if relevant. This function should enable EEE at the switch port MAC
|
|
controller and data-processing logic
|
|
|
|
- ``get_eee``: ethtool function which is used to query a switch port EEE settings,
|
|
this function should return the EEE state of the switch port MAC controller
|
|
and data-processing logic as well as query the PHY for its currently configured
|
|
EEE settings
|
|
|
|
- ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM
|
|
length/size in bytes
|
|
|
|
- ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents
|
|
|
|
- ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM
|
|
|
|
- ``get_regs_len``: ethtool function returning the register length for a given
|
|
switch
|
|
|
|
- ``get_regs``: ethtool function returning the Ethernet switch internal register
|
|
contents. This function might require user-land code in ethtool to
|
|
pretty-print register values and registers
|
|
|
|
Power management
|
|
----------------
|
|
|
|
- ``suspend``: function invoked by the DSA platform device when the system goes to
|
|
suspend, should quiesce all Ethernet switch activities, but keep ports
|
|
participating in Wake-on-LAN active as well as additional wake-up logic if
|
|
supported
|
|
|
|
- ``resume``: function invoked by the DSA platform device when the system resumes,
|
|
should resume all Ethernet switch activities and re-configure the switch to be
|
|
in a fully active state
|
|
|
|
- ``port_enable``: function invoked by the DSA slave network device ndo_open
|
|
function when a port is administratively brought up, this function should be
|
|
fully enabling a given switch port. DSA takes care of marking the port with
|
|
``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
|
|
was not, and propagating these changes down to the hardware
|
|
|
|
- ``port_disable``: function invoked by the DSA slave network device ndo_close
|
|
function when a port is administratively brought down, this function should be
|
|
fully disabling a given switch port. DSA takes care of marking the port with
|
|
``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
|
|
disabled while being a bridge member
|
|
|
|
Bridge layer
|
|
------------
|
|
|
|
- ``port_bridge_join``: bridge layer function invoked when a given switch port is
|
|
added to a bridge, this function should be doing the necessary at the switch
|
|
level to permit the joining port from being added to the relevant logical
|
|
domain for it to ingress/egress traffic with other members of the bridge.
|
|
|
|
- ``port_bridge_leave``: bridge layer function invoked when a given switch port is
|
|
removed from a bridge, this function should be doing the necessary at the
|
|
switch level to deny the leaving port from ingress/egress traffic from the
|
|
remaining bridge members. When the port leaves the bridge, it should be aged
|
|
out at the switch hardware for the switch to (re) learn MAC addresses behind
|
|
this port.
|
|
|
|
- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
|
|
state is computed by the bridge layer and should be propagated to switch
|
|
hardware to forward/block/learn traffic. The switch driver is responsible for
|
|
computing a STP state change based on current and asked parameters and perform
|
|
the relevant ageing based on the intersection results
|
|
|
|
- ``port_bridge_flags``: bridge layer function invoked when a port must
|
|
configure its settings for e.g. flooding of unknown traffic or source address
|
|
learning. The switch driver is responsible for initial setup of the
|
|
standalone ports with address learning disabled and egress flooding of all
|
|
types of traffic, then the DSA core notifies of any change to the bridge port
|
|
flags when the port joins and leaves a bridge. DSA does not currently manage
|
|
the bridge port flags for the CPU port. The assumption is that address
|
|
learning should be statically enabled (if supported by the hardware) on the
|
|
CPU port, and flooding towards the CPU port should also be enabled, due to a
|
|
lack of an explicit address filtering mechanism in the DSA core.
|
|
|
|
- ``port_bridge_tx_fwd_offload``: bridge layer function invoked after
|
|
``port_bridge_join`` when a driver sets ``ds->num_fwd_offloading_bridges`` to
|
|
a non-zero value. Returning success in this function activates the TX
|
|
forwarding offload bridge feature for this port, which enables the tagging
|
|
protocol driver to inject data plane packets towards the bridging domain that
|
|
the port is a part of. Data plane packets are subject to FDB lookup, hardware
|
|
learning on the CPU port, and do not override the port STP state.
|
|
Additionally, replication of data plane packets (multicast, flooding) is
|
|
handled in hardware and the bridge driver will transmit a single skb for each
|
|
packet that needs replication. The method is provided as a configuration
|
|
point for drivers that need to configure the hardware for enabling this
|
|
feature.
|
|
|
|
- ``port_bridge_tx_fwd_unoffload``: bridge layer function invoken when a driver
|
|
leaves a bridge port which had the TX forwarding offload feature enabled.
|
|
|
|
Bridge VLAN filtering
|
|
---------------------
|
|
|
|
- ``port_vlan_filtering``: bridge layer function invoked when the bridge gets
|
|
configured for turning on or off VLAN filtering. If nothing specific needs to
|
|
be done at the hardware level, this callback does not need to be implemented.
|
|
When VLAN filtering is turned on, the hardware must be programmed with
|
|
rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
|
|
VLAN ID map/rules. If there is no PVID programmed into the switch port,
|
|
untagged frames must be rejected as well. When turned off the switch must
|
|
accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
|
|
allowed.
|
|
|
|
- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
|
|
(tagged or untagged) for the given switch port. If the operation is not
|
|
supported by the hardware, this function should return ``-EOPNOTSUPP`` to
|
|
inform the bridge code to fallback to a software implementation.
|
|
|
|
- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
|
|
given switch port
|
|
|
|
- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback
|
|
function that the driver has to call for each VLAN the given port is a member
|
|
of. A switchdev object is used to carry the VID and bridge flags.
|
|
|
|
- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
|
|
Forwarding Database entry, the switch hardware should be programmed with the
|
|
specified address in the specified VLAN Id in the forwarding database
|
|
associated with this VLAN ID. If the operation is not supported, this
|
|
function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to
|
|
a software implementation.
|
|
|
|
.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
|
|
of DSA, would be its port-based VLAN, used by the associated bridge device.
|
|
|
|
- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
|
|
Forwarding Database entry, the switch hardware should be programmed to delete
|
|
the specified MAC address from the specified VLAN ID if it was mapped into
|
|
this port forwarding database
|
|
|
|
- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback
|
|
function that the driver has to call for each MAC address known to be behind
|
|
the given port. A switchdev object is used to carry the VID and FDB info.
|
|
|
|
- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
|
|
a multicast database entry. If the operation is not supported, this function
|
|
should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to a
|
|
software implementation. The switch hardware should be programmed with the
|
|
specified address in the specified VLAN ID in the forwarding database
|
|
associated with this VLAN ID.
|
|
|
|
.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
|
|
of DSA, would be its port-based VLAN, used by the associated bridge device.
|
|
|
|
- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
|
|
multicast database entry, the switch hardware should be programmed to delete
|
|
the specified MAC address from the specified VLAN ID if it was mapped into
|
|
this port forwarding database.
|
|
|
|
- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback
|
|
function that the driver has to call for each MAC address known to be behind
|
|
the given port. A switchdev object is used to carry the VID and MDB info.
|
|
|
|
Link aggregation
|
|
----------------
|
|
|
|
Link aggregation is implemented in the Linux networking stack by the bonding
|
|
and team drivers, which are modeled as virtual, stackable network interfaces.
|
|
DSA is capable of offloading a link aggregation group (LAG) to hardware that
|
|
supports the feature, and supports bridging between physical ports and LAGs,
|
|
as well as between LAGs. A bonding/team interface which holds multiple physical
|
|
ports constitutes a logical port, although DSA has no explicit concept of a
|
|
logical port at the moment. Due to this, events where a LAG joins/leaves a
|
|
bridge are treated as if all individual physical ports that are members of that
|
|
LAG join/leave the bridge. Switchdev port attributes (VLAN filtering, STP
|
|
state, etc) and objects (VLANs, MDB entries) offloaded to a LAG as bridge port
|
|
are treated similarly: DSA offloads the same switchdev object / port attribute
|
|
on all members of the LAG. Static bridge FDB entries on a LAG are not yet
|
|
supported, since the DSA driver API does not have the concept of a logical port
|
|
ID.
|
|
|
|
- ``port_lag_join``: function invoked when a given switch port is added to a
|
|
LAG. The driver may return ``-EOPNOTSUPP``, and in this case, DSA will fall
|
|
back to a software implementation where all traffic from this port is sent to
|
|
the CPU.
|
|
- ``port_lag_leave``: function invoked when a given switch port leaves a LAG
|
|
and returns to operation as a standalone port.
|
|
- ``port_lag_change``: function invoked when the link state of any member of
|
|
the LAG changes, and the hashing function needs rebalancing to only make use
|
|
of the subset of physical LAG member ports that are up.
|
|
|
|
Drivers that benefit from having an ID associated with each offloaded LAG
|
|
can optionally populate ``ds->num_lag_ids`` from the ``dsa_switch_ops::setup``
|
|
method. The LAG ID associated with a bonding/team interface can then be
|
|
retrieved by a DSA switch driver using the ``dsa_lag_id`` function.
|
|
|
|
IEC 62439-2 (MRP)
|
|
-----------------
|
|
|
|
The Media Redundancy Protocol is a topology management protocol optimized for
|
|
fast fault recovery time for ring networks, which has some components
|
|
implemented as a function of the bridge driver. MRP uses management PDUs
|
|
(Test, Topology, LinkDown/Up, Option) sent at a multicast destination MAC
|
|
address range of 01:15:4e:00:00:0x and with an EtherType of 0x88e3.
|
|
Depending on the node's role in the ring (MRM: Media Redundancy Manager,
|
|
MRC: Media Redundancy Client, MRA: Media Redundancy Automanager), certain MRP
|
|
PDUs might need to be terminated locally and others might need to be forwarded.
|
|
An MRM might also benefit from offloading to hardware the creation and
|
|
transmission of certain MRP PDUs (Test).
|
|
|
|
Normally an MRP instance can be created on top of any network interface,
|
|
however in the case of a device with an offloaded data path such as DSA, it is
|
|
necessary for the hardware, even if it is not MRP-aware, to be able to extract
|
|
the MRP PDUs from the fabric before the driver can proceed with the software
|
|
implementation. DSA today has no driver which is MRP-aware, therefore it only
|
|
listens for the bare minimum switchdev objects required for the software assist
|
|
to work properly. The operations are detailed below.
|
|
|
|
- ``port_mrp_add`` and ``port_mrp_del``: notifies driver when an MRP instance
|
|
with a certain ring ID, priority, primary port and secondary port is
|
|
created/deleted.
|
|
- ``port_mrp_add_ring_role`` and ``port_mrp_del_ring_role``: function invoked
|
|
when an MRP instance changes ring roles between MRM or MRC. This affects
|
|
which MRP PDUs should be trapped to software and which should be autonomously
|
|
forwarded.
|
|
|
|
IEC 62439-3 (HSR/PRP)
|
|
---------------------
|
|
|
|
The Parallel Redundancy Protocol (PRP) is a network redundancy protocol which
|
|
works by duplicating and sequence numbering packets through two independent L2
|
|
networks (which are unaware of the PRP tail tags carried in the packets), and
|
|
eliminating the duplicates at the receiver. The High-availability Seamless
|
|
Redundancy (HSR) protocol is similar in concept, except all nodes that carry
|
|
the redundant traffic are aware of the fact that it is HSR-tagged (because HSR
|
|
uses a header with an EtherType of 0x892f) and are physically connected in a
|
|
ring topology. Both HSR and PRP use supervision frames for monitoring the
|
|
health of the network and for discovery of other nodes.
|
|
|
|
In Linux, both HSR and PRP are implemented in the hsr driver, which
|
|
instantiates a virtual, stackable network interface with two member ports.
|
|
The driver only implements the basic roles of DANH (Doubly Attached Node
|
|
implementing HSR) and DANP (Doubly Attached Node implementing PRP); the roles
|
|
of RedBox and QuadBox are not implemented (therefore, bridging a hsr network
|
|
interface with a physical switch port does not produce the expected result).
|
|
|
|
A driver which is able of offloading certain functions of a DANP or DANH should
|
|
declare the corresponding netdev features as indicated by the documentation at
|
|
``Documentation/networking/netdev-features.rst``. Additionally, the following
|
|
methods must be implemented:
|
|
|
|
- ``port_hsr_join``: function invoked when a given switch port is added to a
|
|
DANP/DANH. The driver may return ``-EOPNOTSUPP`` and in this case, DSA will
|
|
fall back to a software implementation where all traffic from this port is
|
|
sent to the CPU.
|
|
- ``port_hsr_leave``: function invoked when a given switch port leaves a
|
|
DANP/DANH and returns to normal operation as a standalone port.
|
|
|
|
TODO
|
|
====
|
|
|
|
Making SWITCHDEV and DSA converge towards an unified codebase
|
|
-------------------------------------------------------------
|
|
|
|
SWITCHDEV properly takes care of abstracting the networking stack with offload
|
|
capable hardware, but does not enforce a strict switch device driver model. On
|
|
the other DSA enforces a fairly strict device driver model, and deals with most
|
|
of the switch specific. At some point we should envision a merger between these
|
|
two subsystems and get the best of both worlds.
|
|
|
|
Other hanging fruits
|
|
--------------------
|
|
|
|
- allowing more than one CPU/management interface:
|
|
http://comments.gmane.org/gmane.linux.network/365657
|