diff options
author | RajithaY <rajithax.yerrumsetty@intel.com> | 2017-04-25 03:31:15 -0700 |
---|---|---|
committer | Rajitha Yerrumchetty <rajithax.yerrumsetty@intel.com> | 2017-05-22 06:48:08 +0000 |
commit | bb756eebdac6fd24e8919e2c43f7d2c8c4091f59 (patch) | |
tree | ca11e03542edf2d8f631efeca5e1626d211107e3 /qemu/docs/rdma.txt | |
parent | a14b48d18a9ed03ec191cf16b162206998a895ce (diff) |
Adding qemu as a submodule of KVMFORNFV
This Patch includes the changes to add qemu as a submodule to
kvmfornfv repo and make use of the updated latest qemu for the
execution of all testcase
Change-Id: I1280af507a857675c7f81d30c95255635667bdd7
Signed-off-by:RajithaY<rajithax.yerrumsetty@intel.com>
Diffstat (limited to 'qemu/docs/rdma.txt')
-rw-r--r-- | qemu/docs/rdma.txt | 420 |
1 files changed, 0 insertions, 420 deletions
diff --git a/qemu/docs/rdma.txt b/qemu/docs/rdma.txt deleted file mode 100644 index 2bdd0a5be..000000000 --- a/qemu/docs/rdma.txt +++ /dev/null @@ -1,420 +0,0 @@ -(RDMA: Remote Direct Memory Access) -RDMA Live Migration Specification, Version # 1 -============================================== -Wiki: http://wiki.qemu-project.org/Features/RDMALiveMigration -Github: git@github.com:hinesmr/qemu.git, 'rdma' branch - -Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com> - -An *exhaustive* paper (2010) shows additional performance details -linked on the QEMU wiki above. - -Contents: -========= -* Introduction -* Before running -* Running -* Performance -* RDMA Migration Protocol Description -* Versioning and Capabilities -* QEMUFileRDMA Interface -* Migration of VM's ram -* Error handling -* TODO - -Introduction: -============= - -RDMA helps make your migration more deterministic under heavy load because -of the significantly lower latency and higher throughput over TCP/IP. This is -because the RDMA I/O architecture reduces the number of interrupts and -data copies by bypassing the host networking stack. In particular, a TCP-based -migration, under certain types of memory-bound workloads, may take a more -unpredicatable amount of time to complete the migration if the amount of -memory tracked during each live migration iteration round cannot keep pace -with the rate of dirty memory produced by the workload. - -RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA -over Converged Ethernet) as well as Infiniband-based. This implementation of -migration using RDMA is capable of using both technologies because of -the use of the OpenFabrics OFED software stack that abstracts out the -programming model irrespective of the underlying hardware. - -Refer to openfabrics.org or your respective RDMA hardware vendor for -an understanding on how to verify that you have the OFED software stack -installed in your environment. You should be able to successfully link -against the "librdmacm" and "libibverbs" libraries and development headers -for a working build of QEMU to run successfully using RDMA Migration. - -BEFORE RUNNING: -=============== - -Use of RDMA during migration requires pinning and registering memory -with the hardware. This means that memory must be physically resident -before the hardware can transmit that memory to another machine. -If this is not acceptable for your application or product, then the use -of RDMA migration may in fact be harmful to co-located VMs or other -software on the machine if there is not sufficient memory available to -relocate the entire footprint of the virtual machine. If so, then the -use of RDMA is discouraged and it is recommended to use standard TCP migration. - -Experimental: Next, decide if you want dynamic page registration. -For example, if you have an 8GB RAM virtual machine, but only 1GB -is in active use, then enabling this feature will cause all 8GB to -be pinned and resident in memory. This feature mostly affects the -bulk-phase round of the migration and can be enabled for extremely -high-performance RDMA hardware using the following command: - -QEMU Monitor Command: -$ migrate_set_capability rdma-pin-all on # disabled by default - -Performing this action will cause all 8GB to be pinned, so if that's -not what you want, then please ignore this step altogether. - -On the other hand, this will also significantly speed up the bulk round -of the migration, which can greatly reduce the "total" time of your migration. -Example performance of this using an idle VM in the previous example -can be found in the "Performance" section. - -Note: for very large virtual machines (hundreds of GBs), pinning all -*all* of the memory of your virtual machine in the kernel is very expensive -may extend the initial bulk iteration time by many seconds, -and thus extending the total migration time. However, this will not -affect the determinism or predictability of your migration you will -still gain from the benefits of advanced pinning with RDMA. - -RUNNING: -======== - -First, set the migration speed to match your hardware's capabilities: - -QEMU Monitor Command: -$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device - -Next, on the destination machine, add the following to the QEMU command line: - -qemu ..... -incoming rdma:host:port - -Finally, perform the actual migration on the source machine: - -QEMU Monitor Command: -$ migrate -d rdma:host:port - -PERFORMANCE -=========== - -Here is a brief summary of total migration time and downtime using RDMA: -Using a 40gbps infiniband link performing a worst-case stress test, -using an 8GB RAM virtual machine: - -Using the following command: -$ apt-get install stress -$ stress --vm-bytes 7500M --vm 1 --vm-keep - -1. Migration throughput: 26 gigabits/second. -2. Downtime (stop time) varies between 15 and 100 milliseconds. - -EFFECTS of memory registration on bulk phase round: - -For example, in the same 8GB RAM example with all 8GB of memory in -active use and the VM itself is completely idle using the same 40 gbps -infiniband link: - -1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps -2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps - -These numbers would of course scale up to whatever size virtual machine -you have to migrate using RDMA. - -Enabling this feature does *not* have any measurable affect on -migration *downtime*. This is because, without this feature, all of the -memory will have already been registered already in advance during -the bulk round and does not need to be re-registered during the successive -iteration rounds. - -RDMA Protocol Description: -========================== - -Migration with RDMA is separated into two parts: - -1. The transmission of the pages using RDMA -2. Everything else (a control channel is introduced) - -"Everything else" is transmitted using a formal -protocol now, consisting of infiniband SEND messages. - -An infiniband SEND message is the standard ibverbs -message used by applications of infiniband hardware. -The only difference between a SEND message and an RDMA -message is that SEND messages cause notifications -to be posted to the completion queue (CQ) on the -infiniband receiver side, whereas RDMA messages (used -for VM's ram) do not (to behave like an actual DMA). - -Messages in infiniband require two things: - -1. registration of the memory that will be transmitted -2. (SEND only) work requests to be posted on both - sides of the network before the actual transmission - can occur. - -RDMA messages are much easier to deal with. Once the memory -on the receiver side is registered and pinned, we're -basically done. All that is required is for the sender -side to start dumping bytes onto the link. - -(Memory is not released from pinning until the migration -completes, given that RDMA migrations are very fast.) - -SEND messages require more coordination because the -receiver must have reserved space (using a receive -work request) on the receive queue (RQ) before QEMUFileRDMA -can start using them to carry all the bytes as -a control transport for migration of device state. - -To begin the migration, the initial connection setup is -as follows (migration-rdma.c): - -1. Receiver and Sender are started (command line or libvirt): -2. Both sides post two RQ work requests -3. Receiver does listen() -4. Sender does connect() -5. Receiver accept() -6. Check versioning and capabilities (described later) - -At this point, we define a control channel on top of SEND messages -which is described by a formal protocol. Each SEND message has a -header portion and a data portion (but together are transmitted -as a single SEND message). - -Header: - * Length (of the data portion, uint32, network byte order) - * Type (what command to perform, uint32, network byte order) - * Repeat (Number of commands in data portion, same type only) - -The 'Repeat' field is here to support future multiple page registrations -in a single message without any need to change the protocol itself -so that the protocol is compatible against multiple versions of QEMU. -Version #1 requires that all server implementations of the protocol must -check this field and register all requests found in the array of commands located -in the data portion and return an equal number of results in the response. -The maximum number of repeats is hard-coded to 4096. This is a conservative -limit based on the maximum size of a SEND message along with empirical -observations on the maximum future benefit of simultaneous page registrations. - -The 'type' field has 12 different command values: - 1. Unused - 2. Error (sent to the source during bad things) - 3. Ready (control-channel is available) - 4. QEMU File (for sending non-live device state) - 5. RAM Blocks request (used right after connection setup) - 6. RAM Blocks result (used right after connection setup) - 7. Compress page (zap zero page and skip registration) - 8. Register request (dynamic chunk registration) - 9. Register result ('rkey' to be used by sender) - 10. Register finished (registration for current iteration finished) - 11. Unregister request (unpin previously registered memory) - 12. Unregister finished (confirmation that unpin completed) - -A single control message, as hinted above, can contain within the data -portion an array of many commands of the same type. If there is more than -one command, then the 'repeat' field will be greater than 1. - -After connection setup, message 5 & 6 are used to exchange ram block -information and optionally pin all the memory if requested by the user. - -After ram block exchange is completed, we have two protocol-level -functions, responsible for communicating control-channel commands -using the above list of values: - -Logically: - -qemu_rdma_exchange_recv(header, expected command type) - -1. We transmit a READY command to let the sender know that - we are *ready* to receive some data bytes on the control channel. -2. Before attempting to receive the expected command, we post another - RQ work request to replace the one we just used up. -3. Block on a CQ event channel and wait for the SEND to arrive. -4. When the send arrives, librdmacm will unblock us. -5. Verify that the command-type and version received matches the one we expected. - -qemu_rdma_exchange_send(header, data, optional response header & data): - -1. Block on the CQ event channel waiting for a READY command - from the receiver to tell us that the receiver - is *ready* for us to transmit some new bytes. -2. Optionally: if we are expecting a response from the command - (that we have not yet transmitted), let's post an RQ - work request to receive that data a few moments later. -3. When the READY arrives, librdmacm will - unblock us and we immediately post a RQ work request - to replace the one we just used up. -4. Now, we can actually post the work request to SEND - the requested command type of the header we were asked for. -5. Optionally, if we are expecting a response (as before), - we block again and wait for that response using the additional - work request we previously posted. (This is used to carry - 'Register result' commands #6 back to the sender which - hold the rkey need to perform RDMA. Note that the virtual address - corresponding to this rkey was already exchanged at the beginning - of the connection (described below). - -All of the remaining command types (not including 'ready') -described above all use the aformentioned two functions to do the hard work: - -1. After connection setup, RAMBlock information is exchanged using - this protocol before the actual migration begins. This information includes - a description of each RAMBlock on the server side as well as the virtual addresses - and lengths of each RAMBlock. This is used by the client to determine the - start and stop locations of chunks and how to register them dynamically - before performing the RDMA operations. -2. During runtime, once a 'chunk' becomes full of pages ready to - be sent with RDMA, the registration commands are used to ask the - other side to register the memory for this chunk and respond - with the result (rkey) of the registration. -3. Also, the QEMUFile interfaces also call these functions (described below) - when transmitting non-live state, such as devices or to send - its own protocol information during the migration process. -4. Finally, zero pages are only checked if a page has not yet been registered - using chunk registration (or not checked at all and unconditionally - written if chunk registration is disabled. This is accomplished using - the "Compress" command listed above. If the page *has* been registered - then we check the entire chunk for zero. Only if the entire chunk is - zero, then we send a compress command to zap the page on the other side. - -Versioning and Capabilities -=========================== -Current version of the protocol is version #1. - -The same version applies to both for protocol traffic and capabilities -negotiation. (i.e. There is only one version number that is referred to -by all communication). - -librdmacm provides the user with a 'private data' area to be exchanged -at connection-setup time before any infiniband traffic is generated. - -Header: - * Version (protocol version validated before send/recv occurs), - uint32, network byte order - * Flags (bitwise OR of each capability), - uint32, network byte order - -There is no data portion of this header right now, so there is -no length field. The maximum size of the 'private data' section -is only 192 bytes per the Infiniband specification, so it's not -very useful for data anyway. This structure needs to remain small. - -This private data area is a convenient place to check for protocol -versioning because the user does not need to register memory to -transmit a few bytes of version information. - -This is also a convenient place to negotiate capabilities -(like dynamic page registration). - -If the version is invalid, we throw an error. - -If the version is new, we only negotiate the capabilities that the -requested version is able to perform and ignore the rest. - -Currently there is only one capability in Version #1: dynamic page registration - -Finally: Negotiation happens with the Flags field: If the primary-VM -sets a flag, but the destination does not support this capability, it -will return a zero-bit for that flag and the primary-VM will understand -that as not being an available capability and will thus disable that -capability on the primary-VM side. - -QEMUFileRDMA Interface: -======================= - -QEMUFileRDMA introduces a couple of new functions: - -1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) -2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) - -These two functions are very short and simply use the protocol -describe above to deliver bytes without changing the upper-level -users of QEMUFile that depend on a bytestream abstraction. - -Finally, how do we handoff the actual bytes to get_buffer()? - -Again, because we're trying to "fake" a bytestream abstraction -using an analogy not unlike individual UDP frames, we have -to hold on to the bytes received from control-channel's SEND -messages in memory. - -Each time we receive a complete "QEMU File" control-channel -message, the bytes from SEND are copied into a small local holding area. - -Then, we return the number of bytes requested by get_buffer() -and leave the remaining bytes in the holding area until get_buffer() -comes around for another pass. - -If the buffer is empty, then we follow the same steps -listed above and issue another "QEMU File" protocol command, -asking for a new SEND message to re-fill the buffer. - -Migration of VM's ram: -==================== - -At the beginning of the migration, (migration-rdma.c), -the sender and the receiver populate the list of RAMBlocks -to be registered with each other into a structure. -Then, using the aforementioned protocol, they exchange a -description of these blocks with each other, to be used later -during the iteration of main memory. This description includes -a list of all the RAMBlocks, their offsets and lengths, virtual -addresses and possibly includes pre-registered RDMA keys in case dynamic -page registration was disabled on the server-side, otherwise not. - -Main memory is not migrated with the aforementioned protocol, -but is instead migrated with normal RDMA Write operations. - -Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now). -Chunk size is not dynamic, but it could be in a future implementation. -There's nothing to indicate that this is useful right now. - -When a chunk is full (or a flush() occurs), the memory backed by -the chunk is registered with librdmacm is pinned in memory on -both sides using the aforementioned protocol. -After pinning, an RDMA Write is generated and transmitted -for the entire chunk. - -Chunks are also transmitted in batches: This means that we -do not request that the hardware signal the completion queue -for the completion of *every* chunk. The current batch size -is about 64 chunks (corresponding to 64 MB of memory). -Only the last chunk in a batch must be signaled. -This helps keep everything as asynchronous as possible -and helps keep the hardware busy performing RDMA operations. - -Error-handling: -=============== - -Infiniband has what is called a "Reliable, Connected" -link (one of 4 choices). This is the mode in which -we use for RDMA migration. - -If a *single* message fails, -the decision is to abort the migration entirely and -cleanup all the RDMA descriptors and unregister all -the memory. - -After cleanup, the Virtual Machine is returned to normal -operation the same way that would happen if the TCP -socket is broken during a non-RDMA based migration. - -TODO: -===== -1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits - are not compatible with infinband memory pinning and will result in - an aborted migration (but with the source VM left unaffected). -2. Use of the recent /proc/<pid>/pagemap would likely speed up - the use of KSM and ballooning while using RDMA. -3. Also, some form of balloon-device usage tracking would also - help alleviate some issues. -4. Use LRU to provide more fine-grained direction of UNREGISTER - requests for unpinning memory in an overcommitted environment. -5. Expose UNREGISTER support to the user by way of workload-specific - hints about application behavior. |