summaryrefslogtreecommitdiffstats
path: root/qemu/docs/rdma.txt
diff options
context:
space:
mode:
authorRajithaY <rajithax.yerrumsetty@intel.com>2017-04-25 03:31:15 -0700
committerRajitha Yerrumchetty <rajithax.yerrumsetty@intel.com>2017-05-22 06:48:08 +0000
commitbb756eebdac6fd24e8919e2c43f7d2c8c4091f59 (patch)
treeca11e03542edf2d8f631efeca5e1626d211107e3 /qemu/docs/rdma.txt
parenta14b48d18a9ed03ec191cf16b162206998a895ce (diff)
Adding qemu as a submodule of KVMFORNFV
This Patch includes the changes to add qemu as a submodule to kvmfornfv repo and make use of the updated latest qemu for the execution of all testcase Change-Id: I1280af507a857675c7f81d30c95255635667bdd7 Signed-off-by:RajithaY<rajithax.yerrumsetty@intel.com>
Diffstat (limited to 'qemu/docs/rdma.txt')
-rw-r--r--qemu/docs/rdma.txt420
1 files changed, 0 insertions, 420 deletions
diff --git a/qemu/docs/rdma.txt b/qemu/docs/rdma.txt
deleted file mode 100644
index 2bdd0a5be..000000000
--- a/qemu/docs/rdma.txt
+++ /dev/null
@@ -1,420 +0,0 @@
-(RDMA: Remote Direct Memory Access)
-RDMA Live Migration Specification, Version # 1
-==============================================
-Wiki: http://wiki.qemu-project.org/Features/RDMALiveMigration
-Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
-
-Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
-
-An *exhaustive* paper (2010) shows additional performance details
-linked on the QEMU wiki above.
-
-Contents:
-=========
-* Introduction
-* Before running
-* Running
-* Performance
-* RDMA Migration Protocol Description
-* Versioning and Capabilities
-* QEMUFileRDMA Interface
-* Migration of VM's ram
-* Error handling
-* TODO
-
-Introduction:
-=============
-
-RDMA helps make your migration more deterministic under heavy load because
-of the significantly lower latency and higher throughput over TCP/IP. This is
-because the RDMA I/O architecture reduces the number of interrupts and
-data copies by bypassing the host networking stack. In particular, a TCP-based
-migration, under certain types of memory-bound workloads, may take a more
-unpredicatable amount of time to complete the migration if the amount of
-memory tracked during each live migration iteration round cannot keep pace
-with the rate of dirty memory produced by the workload.
-
-RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
-over Converged Ethernet) as well as Infiniband-based. This implementation of
-migration using RDMA is capable of using both technologies because of
-the use of the OpenFabrics OFED software stack that abstracts out the
-programming model irrespective of the underlying hardware.
-
-Refer to openfabrics.org or your respective RDMA hardware vendor for
-an understanding on how to verify that you have the OFED software stack
-installed in your environment. You should be able to successfully link
-against the "librdmacm" and "libibverbs" libraries and development headers
-for a working build of QEMU to run successfully using RDMA Migration.
-
-BEFORE RUNNING:
-===============
-
-Use of RDMA during migration requires pinning and registering memory
-with the hardware. This means that memory must be physically resident
-before the hardware can transmit that memory to another machine.
-If this is not acceptable for your application or product, then the use
-of RDMA migration may in fact be harmful to co-located VMs or other
-software on the machine if there is not sufficient memory available to
-relocate the entire footprint of the virtual machine. If so, then the
-use of RDMA is discouraged and it is recommended to use standard TCP migration.
-
-Experimental: Next, decide if you want dynamic page registration.
-For example, if you have an 8GB RAM virtual machine, but only 1GB
-is in active use, then enabling this feature will cause all 8GB to
-be pinned and resident in memory. This feature mostly affects the
-bulk-phase round of the migration and can be enabled for extremely
-high-performance RDMA hardware using the following command:
-
-QEMU Monitor Command:
-$ migrate_set_capability rdma-pin-all on # disabled by default
-
-Performing this action will cause all 8GB to be pinned, so if that's
-not what you want, then please ignore this step altogether.
-
-On the other hand, this will also significantly speed up the bulk round
-of the migration, which can greatly reduce the "total" time of your migration.
-Example performance of this using an idle VM in the previous example
-can be found in the "Performance" section.
-
-Note: for very large virtual machines (hundreds of GBs), pinning all
-*all* of the memory of your virtual machine in the kernel is very expensive
-may extend the initial bulk iteration time by many seconds,
-and thus extending the total migration time. However, this will not
-affect the determinism or predictability of your migration you will
-still gain from the benefits of advanced pinning with RDMA.
-
-RUNNING:
-========
-
-First, set the migration speed to match your hardware's capabilities:
-
-QEMU Monitor Command:
-$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
-
-Next, on the destination machine, add the following to the QEMU command line:
-
-qemu ..... -incoming rdma:host:port
-
-Finally, perform the actual migration on the source machine:
-
-QEMU Monitor Command:
-$ migrate -d rdma:host:port
-
-PERFORMANCE
-===========
-
-Here is a brief summary of total migration time and downtime using RDMA:
-Using a 40gbps infiniband link performing a worst-case stress test,
-using an 8GB RAM virtual machine:
-
-Using the following command:
-$ apt-get install stress
-$ stress --vm-bytes 7500M --vm 1 --vm-keep
-
-1. Migration throughput: 26 gigabits/second.
-2. Downtime (stop time) varies between 15 and 100 milliseconds.
-
-EFFECTS of memory registration on bulk phase round:
-
-For example, in the same 8GB RAM example with all 8GB of memory in
-active use and the VM itself is completely idle using the same 40 gbps
-infiniband link:
-
-1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
-2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
-
-These numbers would of course scale up to whatever size virtual machine
-you have to migrate using RDMA.
-
-Enabling this feature does *not* have any measurable affect on
-migration *downtime*. This is because, without this feature, all of the
-memory will have already been registered already in advance during
-the bulk round and does not need to be re-registered during the successive
-iteration rounds.
-
-RDMA Protocol Description:
-==========================
-
-Migration with RDMA is separated into two parts:
-
-1. The transmission of the pages using RDMA
-2. Everything else (a control channel is introduced)
-
-"Everything else" is transmitted using a formal
-protocol now, consisting of infiniband SEND messages.
-
-An infiniband SEND message is the standard ibverbs
-message used by applications of infiniband hardware.
-The only difference between a SEND message and an RDMA
-message is that SEND messages cause notifications
-to be posted to the completion queue (CQ) on the
-infiniband receiver side, whereas RDMA messages (used
-for VM's ram) do not (to behave like an actual DMA).
-
-Messages in infiniband require two things:
-
-1. registration of the memory that will be transmitted
-2. (SEND only) work requests to be posted on both
- sides of the network before the actual transmission
- can occur.
-
-RDMA messages are much easier to deal with. Once the memory
-on the receiver side is registered and pinned, we're
-basically done. All that is required is for the sender
-side to start dumping bytes onto the link.
-
-(Memory is not released from pinning until the migration
-completes, given that RDMA migrations are very fast.)
-
-SEND messages require more coordination because the
-receiver must have reserved space (using a receive
-work request) on the receive queue (RQ) before QEMUFileRDMA
-can start using them to carry all the bytes as
-a control transport for migration of device state.
-
-To begin the migration, the initial connection setup is
-as follows (migration-rdma.c):
-
-1. Receiver and Sender are started (command line or libvirt):
-2. Both sides post two RQ work requests
-3. Receiver does listen()
-4. Sender does connect()
-5. Receiver accept()
-6. Check versioning and capabilities (described later)
-
-At this point, we define a control channel on top of SEND messages
-which is described by a formal protocol. Each SEND message has a
-header portion and a data portion (but together are transmitted
-as a single SEND message).
-
-Header:
- * Length (of the data portion, uint32, network byte order)
- * Type (what command to perform, uint32, network byte order)
- * Repeat (Number of commands in data portion, same type only)
-
-The 'Repeat' field is here to support future multiple page registrations
-in a single message without any need to change the protocol itself
-so that the protocol is compatible against multiple versions of QEMU.
-Version #1 requires that all server implementations of the protocol must
-check this field and register all requests found in the array of commands located
-in the data portion and return an equal number of results in the response.
-The maximum number of repeats is hard-coded to 4096. This is a conservative
-limit based on the maximum size of a SEND message along with empirical
-observations on the maximum future benefit of simultaneous page registrations.
-
-The 'type' field has 12 different command values:
- 1. Unused
- 2. Error (sent to the source during bad things)
- 3. Ready (control-channel is available)
- 4. QEMU File (for sending non-live device state)
- 5. RAM Blocks request (used right after connection setup)
- 6. RAM Blocks result (used right after connection setup)
- 7. Compress page (zap zero page and skip registration)
- 8. Register request (dynamic chunk registration)
- 9. Register result ('rkey' to be used by sender)
- 10. Register finished (registration for current iteration finished)
- 11. Unregister request (unpin previously registered memory)
- 12. Unregister finished (confirmation that unpin completed)
-
-A single control message, as hinted above, can contain within the data
-portion an array of many commands of the same type. If there is more than
-one command, then the 'repeat' field will be greater than 1.
-
-After connection setup, message 5 & 6 are used to exchange ram block
-information and optionally pin all the memory if requested by the user.
-
-After ram block exchange is completed, we have two protocol-level
-functions, responsible for communicating control-channel commands
-using the above list of values:
-
-Logically:
-
-qemu_rdma_exchange_recv(header, expected command type)
-
-1. We transmit a READY command to let the sender know that
- we are *ready* to receive some data bytes on the control channel.
-2. Before attempting to receive the expected command, we post another
- RQ work request to replace the one we just used up.
-3. Block on a CQ event channel and wait for the SEND to arrive.
-4. When the send arrives, librdmacm will unblock us.
-5. Verify that the command-type and version received matches the one we expected.
-
-qemu_rdma_exchange_send(header, data, optional response header & data):
-
-1. Block on the CQ event channel waiting for a READY command
- from the receiver to tell us that the receiver
- is *ready* for us to transmit some new bytes.
-2. Optionally: if we are expecting a response from the command
- (that we have not yet transmitted), let's post an RQ
- work request to receive that data a few moments later.
-3. When the READY arrives, librdmacm will
- unblock us and we immediately post a RQ work request
- to replace the one we just used up.
-4. Now, we can actually post the work request to SEND
- the requested command type of the header we were asked for.
-5. Optionally, if we are expecting a response (as before),
- we block again and wait for that response using the additional
- work request we previously posted. (This is used to carry
- 'Register result' commands #6 back to the sender which
- hold the rkey need to perform RDMA. Note that the virtual address
- corresponding to this rkey was already exchanged at the beginning
- of the connection (described below).
-
-All of the remaining command types (not including 'ready')
-described above all use the aformentioned two functions to do the hard work:
-
-1. After connection setup, RAMBlock information is exchanged using
- this protocol before the actual migration begins. This information includes
- a description of each RAMBlock on the server side as well as the virtual addresses
- and lengths of each RAMBlock. This is used by the client to determine the
- start and stop locations of chunks and how to register them dynamically
- before performing the RDMA operations.
-2. During runtime, once a 'chunk' becomes full of pages ready to
- be sent with RDMA, the registration commands are used to ask the
- other side to register the memory for this chunk and respond
- with the result (rkey) of the registration.
-3. Also, the QEMUFile interfaces also call these functions (described below)
- when transmitting non-live state, such as devices or to send
- its own protocol information during the migration process.
-4. Finally, zero pages are only checked if a page has not yet been registered
- using chunk registration (or not checked at all and unconditionally
- written if chunk registration is disabled. This is accomplished using
- the "Compress" command listed above. If the page *has* been registered
- then we check the entire chunk for zero. Only if the entire chunk is
- zero, then we send a compress command to zap the page on the other side.
-
-Versioning and Capabilities
-===========================
-Current version of the protocol is version #1.
-
-The same version applies to both for protocol traffic and capabilities
-negotiation. (i.e. There is only one version number that is referred to
-by all communication).
-
-librdmacm provides the user with a 'private data' area to be exchanged
-at connection-setup time before any infiniband traffic is generated.
-
-Header:
- * Version (protocol version validated before send/recv occurs),
- uint32, network byte order
- * Flags (bitwise OR of each capability),
- uint32, network byte order
-
-There is no data portion of this header right now, so there is
-no length field. The maximum size of the 'private data' section
-is only 192 bytes per the Infiniband specification, so it's not
-very useful for data anyway. This structure needs to remain small.
-
-This private data area is a convenient place to check for protocol
-versioning because the user does not need to register memory to
-transmit a few bytes of version information.
-
-This is also a convenient place to negotiate capabilities
-(like dynamic page registration).
-
-If the version is invalid, we throw an error.
-
-If the version is new, we only negotiate the capabilities that the
-requested version is able to perform and ignore the rest.
-
-Currently there is only one capability in Version #1: dynamic page registration
-
-Finally: Negotiation happens with the Flags field: If the primary-VM
-sets a flag, but the destination does not support this capability, it
-will return a zero-bit for that flag and the primary-VM will understand
-that as not being an available capability and will thus disable that
-capability on the primary-VM side.
-
-QEMUFileRDMA Interface:
-=======================
-
-QEMUFileRDMA introduces a couple of new functions:
-
-1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
-2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
-
-These two functions are very short and simply use the protocol
-describe above to deliver bytes without changing the upper-level
-users of QEMUFile that depend on a bytestream abstraction.
-
-Finally, how do we handoff the actual bytes to get_buffer()?
-
-Again, because we're trying to "fake" a bytestream abstraction
-using an analogy not unlike individual UDP frames, we have
-to hold on to the bytes received from control-channel's SEND
-messages in memory.
-
-Each time we receive a complete "QEMU File" control-channel
-message, the bytes from SEND are copied into a small local holding area.
-
-Then, we return the number of bytes requested by get_buffer()
-and leave the remaining bytes in the holding area until get_buffer()
-comes around for another pass.
-
-If the buffer is empty, then we follow the same steps
-listed above and issue another "QEMU File" protocol command,
-asking for a new SEND message to re-fill the buffer.
-
-Migration of VM's ram:
-====================
-
-At the beginning of the migration, (migration-rdma.c),
-the sender and the receiver populate the list of RAMBlocks
-to be registered with each other into a structure.
-Then, using the aforementioned protocol, they exchange a
-description of these blocks with each other, to be used later
-during the iteration of main memory. This description includes
-a list of all the RAMBlocks, their offsets and lengths, virtual
-addresses and possibly includes pre-registered RDMA keys in case dynamic
-page registration was disabled on the server-side, otherwise not.
-
-Main memory is not migrated with the aforementioned protocol,
-but is instead migrated with normal RDMA Write operations.
-
-Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
-Chunk size is not dynamic, but it could be in a future implementation.
-There's nothing to indicate that this is useful right now.
-
-When a chunk is full (or a flush() occurs), the memory backed by
-the chunk is registered with librdmacm is pinned in memory on
-both sides using the aforementioned protocol.
-After pinning, an RDMA Write is generated and transmitted
-for the entire chunk.
-
-Chunks are also transmitted in batches: This means that we
-do not request that the hardware signal the completion queue
-for the completion of *every* chunk. The current batch size
-is about 64 chunks (corresponding to 64 MB of memory).
-Only the last chunk in a batch must be signaled.
-This helps keep everything as asynchronous as possible
-and helps keep the hardware busy performing RDMA operations.
-
-Error-handling:
-===============
-
-Infiniband has what is called a "Reliable, Connected"
-link (one of 4 choices). This is the mode in which
-we use for RDMA migration.
-
-If a *single* message fails,
-the decision is to abort the migration entirely and
-cleanup all the RDMA descriptors and unregister all
-the memory.
-
-After cleanup, the Virtual Machine is returned to normal
-operation the same way that would happen if the TCP
-socket is broken during a non-RDMA based migration.
-
-TODO:
-=====
-1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
- are not compatible with infinband memory pinning and will result in
- an aborted migration (but with the source VM left unaffected).
-2. Use of the recent /proc/<pid>/pagemap would likely speed up
- the use of KSM and ballooning while using RDMA.
-3. Also, some form of balloon-device usage tracking would also
- help alleviate some issues.
-4. Use LRU to provide more fine-grained direction of UNREGISTER
- requests for unpinning memory in an overcommitted environment.
-5. Expose UNREGISTER support to the user by way of workload-specific
- hints about application behavior.