summaryrefslogtreecommitdiffstats
path: root/kernel/Documentation/networking/switchdev.txt
diff options
context:
space:
mode:
Diffstat (limited to 'kernel/Documentation/networking/switchdev.txt')
-rw-r--r--kernel/Documentation/networking/switchdev.txt453
1 files changed, 394 insertions, 59 deletions
diff --git a/kernel/Documentation/networking/switchdev.txt b/kernel/Documentation/networking/switchdev.txt
index f981a9295..91994134e 100644
--- a/kernel/Documentation/networking/switchdev.txt
+++ b/kernel/Documentation/networking/switchdev.txt
@@ -1,59 +1,394 @@
-Switch (and switch-ish) device drivers HOWTO
-===========================
-
-Please note that the word "switch" is here used in very generic meaning.
-This include devices supporting L2/L3 but also various flow offloading chips,
-including switches embedded into SR-IOV NICs.
-
-Lets describe a topology a bit. Imagine the following example:
-
- +----------------------------+ +---------------+
- | SOME switch chip | | CPU |
- +----------------------------+ +---------------+
- port1 port2 port3 port4 MNGMNT | PCI-E |
- | | | | | +---------------+
- PHY PHY | | | | NIC0 NIC1
- | | | | | |
- | | +- PCI-E -+ | |
- | +------- MII -------+ |
- +------------- MII ------------+
-
-In this example, there are two independent lines between the switch silicon
-and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are
-separate from the switch driver. SOME switch chip is by managed by a driver
-via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be
-connected to some other type of bus.
-
-Now, for the previous example show the representation in kernel:
-
- +----------------------------+ +---------------+
- | SOME switch chip | | CPU |
- +----------------------------+ +---------------+
- sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT | PCI-E |
- | | | | | +---------------+
- PHY PHY | | | | eth0 eth1
- | | | | | |
- | | +- PCI-E -+ | |
- | +------- MII -------+ |
- +------------- MII ------------+
-
-Lets call the example switch driver for SOME switch chip "SOMEswitch". This
-driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX
-created for each port of a switch. These netdevices are instances
-of "SOMEswitch" driver. sw0pX netdevices serve as a "representation"
-of the switch chip. eth0 and eth1 are instances of some other existing driver.
-
-The only difference of the switch-port netdevice from the ordinary netdevice
-is that is implements couple more NDOs:
-
- ndo_switch_parent_id_get - This returns the same ID for two port netdevices
- of the same physical switch chip. This is
- mandatory to be implemented by all switch drivers
- and serves the caller for recognition of a port
- netdevice.
- ndo_switch_parent_* - Functions that serve for a manipulation of the switch
- chip itself (it can be though of as a "parent" of the
- port, therefore the name). They are not port-specific.
- Caller might use arbitrary port netdevice of the same
- switch and it will make no difference.
- ndo_switch_port_* - Functions that serve for a port-specific manipulation.
+Ethernet switch device driver model (switchdev)
+===============================================
+Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com>
+
+
+The Ethernet switch device driver model (switchdev) is an in-kernel driver
+model for switch devices which offload the forwarding (data) plane from the
+kernel.
+
+Figure 1 is a block diagram showing the components of the switchdev model for
+an example setup using a data-center-class switch ASIC chip. Other setups
+with SR-IOV or soft switches, such as OVS, are possible.
+
+
+                             User-space tools                                 
+                                                                              
+       user space                   |                                         
+      +-------------------------------------------------------------------+   
+       kernel                       | Netlink                                 
+                                    |                                         
+                     +--------------+-------------------------------+         
+                     |         Network stack                        |         
+                     |           (Linux)                            |         
+                     |                                              |         
+                     +----------------------------------------------+         
+                                                                              
+ sw1p2 sw1p4 sw1p6
+                      sw1p1  + sw1p3 +  sw1p5 +         eth1             
+                        +    |    +    |    +    |            +               
+                        |    |    |    |    |    |            |               
+                     +--+----+----+----+-+--+----+---+  +-----+-----+         
+                     |         Switch driver         |  |    mgmt   |         
+                     |        (this document)        |  |   driver  |         
+                     |                               |  |           |         
+                     +--------------+----------------+  +-----------+         
+                                    |                                         
+       kernel                       | HW bus (eg PCI)                         
+      +-------------------------------------------------------------------+   
+       hardware                     |                                         
+                     +--------------+---+------------+                        
+                     |         Switch device (sw1)   |                        
+                     |  +----+                       +--------+               
+                     |  |    v offloaded data path   | mgmt port              
+                     |  |    |                       |                        
+                     +--|----|----+----+----+----+---+                        
+                        |    |    |    |    |    |                            
+                        +    +    +    +    +    +                            
+                       p1   p2   p3   p4   p5   p6
+                                       
+                             front-panel ports                                
+                                                                              
+
+ Fig 1.
+
+
+Include Files
+-------------
+
+#include <linux/netdevice.h>
+#include <net/switchdev.h>
+
+
+Configuration
+-------------
+
+Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
+support is built for driver.
+
+
+Switch Ports
+------------
+
+On switchdev driver initialization, the driver will allocate and register a
+struct net_device (using register_netdev()) for each enumerated physical switch
+port, called the port netdev. A port netdev is the software representation of
+the physical port and provides a conduit for control traffic to/from the
+controller (the kernel) and the network, as well as an anchor point for higher
+level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using
+standard netdev tools (iproute2, ethtool, etc), the port netdev can also
+provide to the user access to the physical properties of the switch port such
+as PHY link state and I/O statistics.
+
+There is (currently) no higher-level kernel object for the switch beyond the
+port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops.
+
+A switch management port is outside the scope of the switchdev driver model.
+Typically, the management port is not participating in offloaded data plane and
+is loaded with a different driver, such as a NIC driver, on the management port
+device.
+
+Port Netdev Naming
+^^^^^^^^^^^^^^^^^^
+
+Udev rules should be used for port netdev naming, using some unique attribute
+of the port as a key, for example the port MAC address or the port PHYS name.
+Hard-coding of kernel netdev names within the driver is discouraged; let the
+kernel pick the default netdev name, and let udev set the final name based on a
+port attribute.
+
+Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
+useful for dynamically-named ports where the device names its ports based on
+external configuration. For example, if a physical 40G port is split logically
+into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
+name for each port using port PHYS name. The udev rule would be:
+
+SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \
+ NAME="$attr{phys_port_name}"
+
+Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
+is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
+would be sub-port 0 on port 1 on switch 1.
+
+Switch ID
+^^^^^^^^^
+
+The switchdev driver must implement the switchdev op switchdev_port_attr_get
+for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same
+physical ID for each port of a switch. The ID must be unique between switches
+on the same system. The ID does not need to be unique between switches on
+different systems.
+
+The switch ID is used to locate ports on a switch and to know if aggregated
+ports belong to the same switch.
+
+Port Features
+^^^^^^^^^^^^^
+
+NETIF_F_NETNS_LOCAL
+
+If the switchdev driver (and device) only supports offloading of the default
+network namespace (netns), the driver should set this feature flag to prevent
+the port netdev from being moved out of the default netns. A netns-aware
+driver/device would not set this flag and be responsible for partitioning
+hardware to preserve netns containment. This means hardware cannot forward
+traffic from a port in one namespace to another port in another namespace.
+
+Port Topology
+^^^^^^^^^^^^^
+
+The port netdevs representing the physical switch ports can be organized into
+higher-level switching constructs. The default construct is a standalone
+router port, used to offload L3 forwarding. Two or more ports can be bonded
+together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge
+L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3
+tunnels can be built on ports. These constructs are built using standard Linux
+tools such as the bridge driver, the bonding/team drivers, and netlink-based
+tools such as iproute2.
+
+The switchdev driver can know a particular port's position in the topology by
+monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a
+bond will see it's upper master change. If that bond is moved into a bridge,
+the bond's upper master will change. And so on. The driver will track such
+movements to know what position a port is in in the overall topology by
+registering for netdevice events and acting on NETDEV_CHANGEUPPER.
+
+L2 Forwarding Offload
+---------------------
+
+The idea is to offload the L2 data forwarding (switching) path from the kernel
+to the switchdev device by mirroring bridge FDB entries down to the device. An
+FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
+
+To offloading L2 bridging, the switchdev driver/device should support:
+
+ - Static FDB entries installed on a bridge port
+ - Notification of learned/forgotten src mac/vlans from device
+ - STP state changes on the port
+ - VLAN flooding of multicast/broadcast and unknown unicast packets
+
+Static FDB Entries
+^^^^^^^^^^^^^^^^^^
+
+The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
+to support static FDB entries installed to the device. Static bridge FDB
+entries are installed, for example, using iproute2 bridge cmd:
+
+ bridge fdb add ADDR dev DEV [vlan VID] [self]
+
+The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
+ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
+switchdev_port_obj_xxx ops.
+
+XXX: what should be done if offloading this rule to hardware fails (for
+example, due to full capacity in hardware tables) ?
+
+Note: by default, the bridge does not filter on VLAN and only bridges untagged
+traffic. To enable VLAN support, turn on VLAN filtering:
+
+ echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
+
+Notification of Learned/Forgotten Source MAC/VLANs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The switch device will learn/forget source MAC address/VLAN on ingress packets
+and notify the switch driver of the mac/vlan/port tuples. The switch driver,
+in turn, will notify the bridge driver using the switchdev notifier call:
+
+ err = call_switchdev_notifiers(val, dev, info);
+
+Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
+forgetting, and info points to a struct switchdev_notifier_fdb_info. On
+SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
+bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
+command will label these entries "offload":
+
+ $ bridge fdb
+ 52:54:00:12:35:01 dev sw1p1 master br0 permanent
+ 00:02:00:00:02:00 dev sw1p1 master br0 offload
+ 00:02:00:00:02:00 dev sw1p1 self
+ 52:54:00:12:35:02 dev sw1p2 master br0 permanent
+ 00:02:00:00:03:00 dev sw1p2 master br0 offload
+ 00:02:00:00:03:00 dev sw1p2 self
+ 33:33:00:00:00:01 dev eth0 self permanent
+ 01:00:5e:00:00:01 dev eth0 self permanent
+ 33:33:ff:00:00:00 dev eth0 self permanent
+ 01:80:c2:00:00:0e dev eth0 self permanent
+ 33:33:00:00:00:01 dev br0 self permanent
+ 01:00:5e:00:00:01 dev br0 self permanent
+ 33:33:ff:12:35:01 dev br0 self permanent
+
+Learning on the port should be disabled on the bridge using the bridge command:
+
+ bridge link set dev DEV learning off
+
+Learning on the device port should be enabled, as well as learning_sync:
+
+ bridge link set dev DEV learning on self
+ bridge link set dev DEV learning_sync on self
+
+Learning_sync attribute enables syncing of the learned/forgotton FDB entry to
+the bridge's FDB. It's possible, but not optimal, to enable learning on the
+device port and on the bridge port, and disable learning_sync.
+
+To support learning and learning_sync port attributes, the driver implements
+switchdev op switchdev_port_attr_get/set for
+SWITCHDEV_ATTR_PORT_ID_BRIDGE_FLAGS. The driver should initialize the attributes
+to the hardware defaults.
+
+FDB Ageing
+^^^^^^^^^^
+
+The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
+the responsibility of the port driver/device to age out these entries. If the
+port device supports ageing, when the FDB entry expires, it will notify the
+driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the
+device does not support ageing, the driver can simulate ageing using a
+garbage collection timer to monitor FBD entries. Expired entries will be
+notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for
+example of driver running ageing timer.
+
+To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
+entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The
+notification will reset the FDB entry's last-used time to now. The driver
+should rate limit refresh notifications, for example, no more than once a
+second. (The last-used time is visible using the bridge -s fdb option).
+
+STP State Change on Port
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Internally or with a third-party STP protocol implementation (e.g. mstpd), the
+bridge driver maintains the STP state for ports, and will notify the switch
+driver of STP state change on a port using the switchdev op
+switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
+
+State is one of BR_STATE_*. The switch driver can use STP state updates to
+update ingress packet filter list for the port. For example, if port is
+DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
+and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
+
+Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
+so packet filters should be applied consistently across untagged and tagged
+VLANs on the port.
+
+Flooding L2 domain
+^^^^^^^^^^^^^^^^^^
+
+For a given L2 VLAN domain, the switch device should flood multicast/broadcast
+and unknown unicast packets to all ports in domain, if allowed by port's
+current STP state. The switch driver, knowing which ports are within which
+vlan L2 domain, can program the switch device for flooding. The packet may
+be sent to the port netdev for processing by the bridge driver. The
+bridge should not reflood the packet to the same ports the device flooded,
+otherwise there will be duplicate packets on the wire.
+
+To avoid duplicate packets, the device/driver should mark a packet as already
+forwarded using skb->offload_fwd_mark. The same mark is set on the device
+ports in the domain using dev->offload_fwd_mark. If the skb->offload_fwd_mark
+is non-zero and matches the forwarding egress port's dev->skb_mark, the kernel
+will drop the skb right before transmit on the egress port, with the
+understanding that the device already forwarded the packet on same egress port.
+The driver can use switchdev_port_fwd_mark_set() to set a globally unique mark
+for port's dev->offload_fwd_mark, based on the port's parent ID (switch ID) and
+a group ifindex.
+
+It is possible for the switch device to not handle flooding and push the
+packets up to the bridge driver for flooding. This is not ideal as the number
+of ports scale in the L2 domain as the device is much more efficient at
+flooding packets that software.
+
+If supported by the device, flood control can be offloaded to it, preventing
+certain netdevs from flooding unicast traffic for which there is no FDB entry.
+
+IGMP Snooping
+^^^^^^^^^^^^^
+
+XXX: complete this section
+
+
+L3 Routing Offload
+------------------
+
+Offloading L3 routing requires that device be programmed with FIB entries from
+the kernel, with the device doing the FIB lookup and forwarding. The device
+does a longest prefix match (LPM) on FIB entries matching route prefix and
+forwards the packet to the matching FIB entry's nexthop(s) egress ports.
+
+To program the device, the driver implements support for
+SWITCHDEV_OBJ_IPV[4|6]_FIB object using switchdev_port_obj_xxx ops.
+switchdev_port_obj_add is used for both adding a new FIB entry to the device,
+or modifying an existing entry on the device.
+
+XXX: Currently, only SWITCHDEV_OBJ_ID_IPV4_FIB objects are supported.
+
+SWITCHDEV_OBJ_ID_IPV4_FIB object passes:
+
+ struct switchdev_obj_ipv4_fib { /* IPV4_FIB */
+ u32 dst;
+ int dst_len;
+ struct fib_info *fi;
+ u8 tos;
+ u8 type;
+ u32 nlflags;
+ u32 tb_id;
+ } ipv4_fib;
+
+to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi
+structure holds details on the route and route's nexthops. *dev is one of the
+port netdevs mentioned in the routes next hop list. If the output port netdevs
+referenced in the route's nexthop list don't all have the same switch ID, the
+driver is not called to add/modify/delete the FIB entry.
+
+Routes offloaded to the device are labeled with "offload" in the ip route
+listing:
+
+ $ ip route show
+ default via 192.168.0.2 dev eth0
+ 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload
+ 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
+ 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload
+ 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
+ 12.0.0.2 proto zebra metric 30 offload
+ nexthop via 11.0.0.1 dev sw1p1 weight 1
+ nexthop via 11.0.0.9 dev sw1p2 weight 1
+ 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
+ 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
+ 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15
+
+XXX: add/mod/del IPv6 FIB API
+
+Nexthop Resolution
+^^^^^^^^^^^^^^^^^^
+
+The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
+the switch device to forward the packet with the correct dst mac address, the
+nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac
+address discovery comes via the ARP (or ND) process and is available via the
+arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver
+should trigger the kernel's neighbor resolution process. See the rocker
+driver's rocker_port_ipv4_resolve() for an example.
+
+The driver can monitor for updates to arp_tbl using the netevent notifier
+NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
+for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
+to know when arp_tbl neighbor entries are purged from the port.
+
+Transaction item queue
+^^^^^^^^^^^^^^^^^^^^^^
+
+For switchdev ops attr_set and obj_add, there is a 2 phase transaction model
+used. First phase is to "prepare" anything needed, including various checks,
+memory allocation, etc. The goal is to handle the stuff that is not unlikely
+to fail here. The second phase is to "commit" the actual changes.
+
+Switchdev provides an inftrastructure for sharing items (for example memory
+allocations) between the two phases.
+
+The object created by a driver in "prepare" phase and it is queued up by:
+switchdev_trans_item_enqueue()
+During the "commit" phase, the driver gets the object by:
+switchdev_trans_item_dequeue()
+
+If a transaction is aborted during "prepare" phase, switchdev code will handle
+cleanup of the queued-up objects.