summaryrefslogtreecommitdiffstats
path: root/kernel/Documentation/infiniband
diff options
context:
space:
mode:
authorYunhong Jiang <yunhong.jiang@intel.com>2015-08-04 12:17:53 -0700
committerYunhong Jiang <yunhong.jiang@intel.com>2015-08-04 15:44:42 -0700
commit9ca8dbcc65cfc63d6f5ef3312a33184e1d726e00 (patch)
tree1c9cafbcd35f783a87880a10f85d1a060db1a563 /kernel/Documentation/infiniband
parent98260f3884f4a202f9ca5eabed40b1354c489b29 (diff)
Add the rt linux 4.1.3-rt3 as base
Import the rt linux 4.1.3-rt3 as OPNFV kvm base. It's from git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-4.1.y-rt and the base is: commit 0917f823c59692d751951bf5ea699a2d1e2f26a2 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Sat Jul 25 12:13:34 2015 +0200 Prepare v4.1.3-rt3 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> We lose all the git history this way and it's not good. We should apply another opnfv project repo in future. Change-Id: I87543d81c9df70d99c5001fbdf646b202c19f423 Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Diffstat (limited to 'kernel/Documentation/infiniband')
-rw-r--r--kernel/Documentation/infiniband/core_locking.txt114
-rw-r--r--kernel/Documentation/infiniband/ipoib.txt105
-rw-r--r--kernel/Documentation/infiniband/sysfs.txt66
-rw-r--r--kernel/Documentation/infiniband/user_mad.txt153
-rw-r--r--kernel/Documentation/infiniband/user_verbs.txt69
5 files changed, 507 insertions, 0 deletions
diff --git a/kernel/Documentation/infiniband/core_locking.txt b/kernel/Documentation/infiniband/core_locking.txt
new file mode 100644
index 000000000..e16785422
--- /dev/null
+++ b/kernel/Documentation/infiniband/core_locking.txt
@@ -0,0 +1,114 @@
+INFINIBAND MIDLAYER LOCKING
+
+ This guide is an attempt to make explicit the locking assumptions
+ made by the InfiniBand midlayer. It describes the requirements on
+ both low-level drivers that sit below the midlayer and upper level
+ protocols that use the midlayer.
+
+Sleeping and interrupt context
+
+ With the following exceptions, a low-level driver implementation of
+ all of the methods in struct ib_device may sleep. The exceptions
+ are any methods from the list:
+
+ create_ah
+ modify_ah
+ query_ah
+ destroy_ah
+ bind_mw
+ post_send
+ post_recv
+ poll_cq
+ req_notify_cq
+ map_phys_fmr
+
+ which may not sleep and must be callable from any context.
+
+ The corresponding functions exported to upper level protocol
+ consumers:
+
+ ib_create_ah
+ ib_modify_ah
+ ib_query_ah
+ ib_destroy_ah
+ ib_bind_mw
+ ib_post_send
+ ib_post_recv
+ ib_req_notify_cq
+ ib_map_phys_fmr
+
+ are therefore safe to call from any context.
+
+ In addition, the function
+
+ ib_dispatch_event
+
+ used by low-level drivers to dispatch asynchronous events through
+ the midlayer is also safe to call from any context.
+
+Reentrancy
+
+ All of the methods in struct ib_device exported by a low-level
+ driver must be fully reentrant. The low-level driver is required to
+ perform all synchronization necessary to maintain consistency, even
+ if multiple function calls using the same object are run
+ simultaneously.
+
+ The IB midlayer does not perform any serialization of function calls.
+
+ Because low-level drivers are reentrant, upper level protocol
+ consumers are not required to perform any serialization. However,
+ some serialization may be required to get sensible results. For
+ example, a consumer may safely call ib_poll_cq() on multiple CPUs
+ simultaneously. However, the ordering of the work completion
+ information between different calls of ib_poll_cq() is not defined.
+
+Callbacks
+
+ A low-level driver must not perform a callback directly from the
+ same callchain as an ib_device method call. For example, it is not
+ allowed for a low-level driver to call a consumer's completion event
+ handler directly from its post_send method. Instead, the low-level
+ driver should defer this callback by, for example, scheduling a
+ tasklet to perform the callback.
+
+ The low-level driver is responsible for ensuring that multiple
+ completion event handlers for the same CQ are not called
+ simultaneously. The driver must guarantee that only one CQ event
+ handler for a given CQ is running at a time. In other words, the
+ following situation is not allowed:
+
+ CPU1 CPU2
+
+ low-level driver ->
+ consumer CQ event callback:
+ /* ... */
+ ib_req_notify_cq(cq, ...);
+ low-level driver ->
+ /* ... */ consumer CQ event callback:
+ /* ... */
+ return from CQ event handler
+
+ The context in which completion event and asynchronous event
+ callbacks run is not defined. Depending on the low-level driver, it
+ may be process context, softirq context, or interrupt context.
+ Upper level protocol consumers may not sleep in a callback.
+
+Hot-plug
+
+ A low-level driver announces that a device is ready for use by
+ consumers when it calls ib_register_device(), all initialization
+ must be complete before this call. The device must remain usable
+ until the driver's call to ib_unregister_device() has returned.
+
+ A low-level driver must call ib_register_device() and
+ ib_unregister_device() from process context. It must not hold any
+ semaphores that could cause deadlock if a consumer calls back into
+ the driver across these calls.
+
+ An upper level protocol consumer may begin using an IB device as
+ soon as the add method of its struct ib_client is called for that
+ device. A consumer must finish all cleanup and free all resources
+ relating to a device before returning from the remove method.
+
+ A consumer is permitted to sleep in its add and remove methods.
diff --git a/kernel/Documentation/infiniband/ipoib.txt b/kernel/Documentation/infiniband/ipoib.txt
new file mode 100644
index 000000000..f2cfe265e
--- /dev/null
+++ b/kernel/Documentation/infiniband/ipoib.txt
@@ -0,0 +1,105 @@
+IP OVER INFINIBAND
+
+ The ib_ipoib driver is an implementation of the IP over InfiniBand
+ protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
+ working group. It is a "native" implementation in the sense of
+ setting the interface type to ARPHRD_INFINIBAND and the hardware
+ address length to 20 (earlier proprietary implementations
+ masqueraded to the kernel as ethernet interfaces).
+
+Partitions and P_Keys
+
+ When the IPoIB driver is loaded, it creates one interface for each
+ port using the P_Key at index 0. To create an interface with a
+ different P_Key, write the desired P_Key into the main interface's
+ /sys/class/net/<intf name>/create_child file. For example:
+
+ echo 0x8001 > /sys/class/net/ib0/create_child
+
+ This will create an interface named ib0.8001 with P_Key 0x8001. To
+ remove a subinterface, use the "delete_child" file:
+
+ echo 0x8001 > /sys/class/net/ib0/delete_child
+
+ The P_Key for any interface is given by the "pkey" file, and the
+ main interface for a subinterface is in "parent."
+
+ Child interface create/delete can also be done using IPoIB's
+ rtnl_link_ops, where childs created using either way behave the same.
+
+Datagram vs Connected modes
+
+ The IPoIB driver supports two modes of operation: datagram and
+ connected. The mode is set and read through an interface's
+ /sys/class/net/<intf name>/mode file.
+
+ In datagram mode, the IB UD (Unreliable Datagram) transport is used
+ and so the interface MTU has is equal to the IB L2 MTU minus the
+ IPoIB encapsulation header (4 bytes). For example, in a typical IB
+ fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
+
+ In connected mode, the IB RC (Reliable Connected) transport is used.
+ Connected mode takes advantage of the connected nature of the IB
+ transport and allows an MTU up to the maximal IP packet size of 64K,
+ which reduces the number of IP packets needed for handling large UDP
+ datagrams, TCP segments, etc and increases the performance for large
+ messages.
+
+ In connected mode, the interface's UD QP is still used for multicast
+ and communication with peers that don't support connected mode. In
+ this case, RX emulation of ICMP PMTU packets is used to cause the
+ networking stack to use the smaller UD MTU for these neighbours.
+
+Stateless offloads
+
+ If the IB HW supports IPoIB stateless offloads, IPoIB advertises
+ TCP/IP checksum and/or Large Send (LSO) offloading capability to the
+ network stack.
+
+ Large Receive (LRO) offloading is also implemented and may be turned
+ on/off using ethtool calls. Currently LRO is supported only for
+ checksum offload capable devices.
+
+ Stateless offloads are supported only in datagram mode.
+
+Interrupt moderation
+
+ If the underlying IB device supports CQ event moderation, one can
+ use ethtool to set interrupt mitigation parameters and thus reduce
+ the overhead incurred by handling interrupts. The main code path of
+ IPoIB doesn't use events for TX completion signaling so only RX
+ moderation is supported.
+
+Debugging Information
+
+ By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
+ to 'y', tracing messages are compiled into the driver. They are
+ turned on by setting the module parameters debug_level and
+ mcast_debug_level to 1. These parameters can be controlled at
+ runtime through files in /sys/module/ib_ipoib/.
+
+ CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
+ virtual filesystem. By mounting this filesystem, for example with
+
+ mount -t debugfs none /sys/kernel/debug
+
+ it is possible to get statistics about multicast groups from the
+ files /sys/kernel/debug/ipoib/ib0_mcg and so on.
+
+ The performance impact of this option is negligible, so it
+ is safe to enable this option with debug_level set to 0 for normal
+ operation.
+
+ CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in
+ the data path when data_debug_level is set to 1. However, even with
+ the output disabled, enabling this configuration option will affect
+ performance, because it adds tests to the fast path.
+
+References
+
+ Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
+ http://ietf.org/rfc/rfc4391.txt
+ IP over InfiniBand (IPoIB) Architecture (RFC 4392)
+ http://ietf.org/rfc/rfc4392.txt
+ IP over InfiniBand: Connected Mode (RFC 4755)
+ http://ietf.org/rfc/rfc4755.txt
diff --git a/kernel/Documentation/infiniband/sysfs.txt b/kernel/Documentation/infiniband/sysfs.txt
new file mode 100644
index 000000000..ddd519b72
--- /dev/null
+++ b/kernel/Documentation/infiniband/sysfs.txt
@@ -0,0 +1,66 @@
+SYSFS FILES
+
+ For each InfiniBand device, the InfiniBand drivers create the
+ following files under /sys/class/infiniband/<device name>:
+
+ node_type - Node type (CA, switch or router)
+ node_guid - Node GUID
+ sys_image_guid - System image GUID
+
+ In addition, there is a "ports" subdirectory, with one subdirectory
+ for each port. For example, if mthca0 is a 2-port HCA, there will
+ be two directories:
+
+ /sys/class/infiniband/mthca0/ports/1
+ /sys/class/infiniband/mthca0/ports/2
+
+ (A switch will only have a single "0" subdirectory for switch port
+ 0; no subdirectory is created for normal switch ports)
+
+ In each port subdirectory, the following files are created:
+
+ cap_mask - Port capability mask
+ lid - Port LID
+ lid_mask_count - Port LID mask count
+ rate - Port data rate (active width * active speed)
+ sm_lid - Subnet manager LID for port's subnet
+ sm_sl - Subnet manager SL for port's subnet
+ state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER)
+ phys_state - Port physical state (Sleep, Polling, LinkUp, etc)
+
+ There is also a "counters" subdirectory, with files
+
+ VL15_dropped
+ excessive_buffer_overrun_errors
+ link_downed
+ link_error_recovery
+ local_link_integrity_errors
+ port_rcv_constraint_errors
+ port_rcv_data
+ port_rcv_errors
+ port_rcv_packets
+ port_rcv_remote_physical_errors
+ port_rcv_switch_relay_errors
+ port_xmit_constraint_errors
+ port_xmit_data
+ port_xmit_discards
+ port_xmit_packets
+ symbol_error
+
+ Each of these files contains the corresponding value from the port's
+ Performance Management PortCounters attribute, as described in
+ section 16.1.3.5 of the InfiniBand Architecture Specification.
+
+ The "pkeys" and "gids" subdirectories contain one file for each
+ entry in the port's P_Key or GID table respectively. For example,
+ ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key
+ table.
+
+MTHCA
+
+ The Mellanox HCA driver also creates the files:
+
+ hw_rev - Hardware revision number
+ fw_ver - Firmware version
+ hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)",
+ or "MT25208"
diff --git a/kernel/Documentation/infiniband/user_mad.txt b/kernel/Documentation/infiniband/user_mad.txt
new file mode 100644
index 000000000..7aca13a54
--- /dev/null
+++ b/kernel/Documentation/infiniband/user_mad.txt
@@ -0,0 +1,153 @@
+USERSPACE MAD ACCESS
+
+Device files
+
+ Each port of each InfiniBand device has a "umad" device and an
+ "issm" device attached. For example, a two-port HCA will have two
+ umad devices and two issm devices, while a switch will have one
+ device of each type (for switch port 0).
+
+Creating MAD agents
+
+ A MAD agent can be created by filling in a struct ib_user_mad_reg_req
+ and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
+ descriptor for the appropriate device file. If the registration
+ request succeeds, a 32-bit id will be returned in the structure.
+ For example:
+
+ struct ib_user_mad_reg_req req = { /* ... */ };
+ ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
+ if (!ret)
+ my_agent = req.id;
+ else
+ perror("agent register");
+
+ Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
+ ioctl. Also, all agents registered through a file descriptor will
+ be unregistered when the descriptor is closed.
+
+ 2014 -- a new registration ioctl is now provided which allows additional
+ fields to be provided during registration.
+ Users of this registration call are implicitly setting the use of
+ pkey_index (see below).
+
+Receiving MADs
+
+ MADs are received using read(). The receive side now supports
+ RMPP. The buffer passed to read() must be at least one
+ struct ib_user_mad + 256 bytes. For example:
+
+ If the buffer passed is not large enough to hold the received
+ MAD (RMPP), the errno is set to ENOSPC and the length of the
+ buffer needed is set in mad.length.
+
+ Example for normal MAD (non RMPP) reads:
+ struct ib_user_mad *mad;
+ mad = malloc(sizeof *mad + 256);
+ ret = read(fd, mad, sizeof *mad + 256);
+ if (ret != sizeof mad + 256) {
+ perror("read");
+ free(mad);
+ }
+
+ Example for RMPP reads:
+ struct ib_user_mad *mad;
+ mad = malloc(sizeof *mad + 256);
+ ret = read(fd, mad, sizeof *mad + 256);
+ if (ret == -ENOSPC)) {
+ length = mad.length;
+ free(mad);
+ mad = malloc(sizeof *mad + length);
+ ret = read(fd, mad, sizeof *mad + length);
+ }
+ if (ret < 0) {
+ perror("read");
+ free(mad);
+ }
+
+ In addition to the actual MAD contents, the other struct ib_user_mad
+ fields will be filled in with information on the received MAD. For
+ example, the remote LID will be in mad.lid.
+
+ If a send times out, a receive will be generated with mad.status set
+ to ETIMEDOUT. Otherwise when a MAD has been successfully received,
+ mad.status will be 0.
+
+ poll()/select() may be used to wait until a MAD can be read.
+
+Sending MADs
+
+ MADs are sent using write(). The agent ID for sending should be
+ filled into the id field of the MAD, the destination LID should be
+ filled into the lid field, and so on. The send side does support
+ RMPP so arbitrary length MAD can be sent. For example:
+
+ struct ib_user_mad *mad;
+
+ mad = malloc(sizeof *mad + mad_length);
+
+ /* fill in mad->data */
+
+ mad->hdr.id = my_agent; /* req.id from agent registration */
+ mad->hdr.lid = my_dest; /* in network byte order... */
+ /* etc. */
+
+ ret = write(fd, &mad, sizeof *mad + mad_length);
+ if (ret != sizeof *mad + mad_length)
+ perror("write");
+
+Transaction IDs
+
+ Users of the umad devices can use the lower 32 bits of the
+ transaction ID field (that is, the least significant half of the
+ field in network byte order) in MADs being sent to match
+ request/response pairs. The upper 32 bits are reserved for use by
+ the kernel and will be overwritten before a MAD is sent.
+
+P_Key Index Handling
+
+ The old ib_umad interface did not allow setting the P_Key index for
+ MADs that are sent and did not provide a way for obtaining the P_Key
+ index of received MADs. A new layout for struct ib_user_mad_hdr
+ with a pkey_index member has been defined; however, to preserve binary
+ compatibility with older applications, this new layout will not be used
+ unless one of IB_USER_MAD_ENABLE_PKEY or IB_USER_MAD_REGISTER_AGENT2 ioctl's
+ are called before a file descriptor is used for anything else.
+
+ In September 2008, the IB_USER_MAD_ABI_VERSION will be incremented
+ to 6, the new layout of struct ib_user_mad_hdr will be used by
+ default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed.
+
+Setting IsSM Capability Bit
+
+ To set the IsSM capability bit for a port, simply open the
+ corresponding issm device file. If the IsSM bit is already set,
+ then the open call will block until the bit is cleared (or return
+ immediately with errno set to EAGAIN if the O_NONBLOCK flag is
+ passed to open()). The IsSM bit will be cleared when the issm file
+ is closed. No read, write or other operations can be performed on
+ the issm file.
+
+/dev files
+
+ To create the appropriate character device files automatically with
+ udev, a rule like
+
+ KERNEL=="umad*", NAME="infiniband/%k"
+ KERNEL=="issm*", NAME="infiniband/%k"
+
+ can be used. This will create device nodes named
+
+ /dev/infiniband/umad0
+ /dev/infiniband/issm0
+
+ for the first port, and so on. The InfiniBand device and port
+ associated with these devices can be determined from the files
+
+ /sys/class/infiniband_mad/umad0/ibdev
+ /sys/class/infiniband_mad/umad0/port
+
+ and
+
+ /sys/class/infiniband_mad/issm0/ibdev
+ /sys/class/infiniband_mad/issm0/port
diff --git a/kernel/Documentation/infiniband/user_verbs.txt b/kernel/Documentation/infiniband/user_verbs.txt
new file mode 100644
index 000000000..e5092d696
--- /dev/null
+++ b/kernel/Documentation/infiniband/user_verbs.txt
@@ -0,0 +1,69 @@
+USERSPACE VERBS ACCESS
+
+ The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS,
+ enables direct userspace access to IB hardware via "verbs," as
+ described in chapter 11 of the InfiniBand Architecture Specification.
+
+ To use the verbs, the libibverbs library, available from
+ http://www.openfabrics.org/, is required. libibverbs contains a
+ device-independent API for using the ib_uverbs interface.
+ libibverbs also requires appropriate device-dependent kernel and
+ userspace driver for your InfiniBand hardware. For example, to use
+ a Mellanox HCA, you will need the ib_mthca kernel module and the
+ libmthca userspace driver be installed.
+
+User-kernel communication
+
+ Userspace communicates with the kernel for slow path, resource
+ management operations via the /dev/infiniband/uverbsN character
+ devices. Fast path operations are typically performed by writing
+ directly to hardware registers mmap()ed into userspace, with no
+ system call or context switch into the kernel.
+
+ Commands are sent to the kernel via write()s on these device files.
+ The ABI is defined in drivers/infiniband/include/ib_user_verbs.h.
+ The structs for commands that require a response from the kernel
+ contain a 64-bit field used to pass a pointer to an output buffer.
+ Status is returned to userspace as the return value of the write()
+ system call.
+
+Resource management
+
+ Since creation and destruction of all IB resources is done by
+ commands passed through a file descriptor, the kernel can keep track
+ of which resources are attached to a given userspace context. The
+ ib_uverbs module maintains idr tables that are used to translate
+ between kernel pointers and opaque userspace handles, so that kernel
+ pointers are never exposed to userspace and userspace cannot trick
+ the kernel into following a bogus pointer.
+
+ This also allows the kernel to clean up when a process exits and
+ prevent one process from touching another process's resources.
+
+Memory pinning
+
+ Direct userspace I/O requires that memory regions that are potential
+ I/O targets be kept resident at the same physical address. The
+ ib_uverbs module manages pinning and unpinning memory regions via
+ get_user_pages() and put_page() calls. It also accounts for the
+ amount of memory pinned in the process's locked_vm, and checks that
+ unprivileged processes do not exceed their RLIMIT_MEMLOCK limit.
+
+ Pages that are pinned multiple times are counted each time they are
+ pinned, so the value of locked_vm may be an overestimate of the
+ number of pages pinned by a process.
+
+/dev files
+
+ To create the appropriate character device files automatically with
+ udev, a rule like
+
+ KERNEL=="uverbs*", NAME="infiniband/%k"
+
+ can be used. This will create device nodes named
+
+ /dev/infiniband/uverbs0
+
+ and so on. Since the InfiniBand userspace verbs should be safe for
+ use by non-privileged processes, it may be useful to add an
+ appropriate MODE or GROUP to the udev rule.