summaryrefslogtreecommitdiffstats
path: root/qemu/docs/specs
diff options
context:
space:
mode:
Diffstat (limited to 'qemu/docs/specs')
-rw-r--r--qemu/docs/specs/fw_cfg.txt158
-rw-r--r--qemu/docs/specs/ivshmem-spec.txt254
-rw-r--r--qemu/docs/specs/ivshmem_device_spec.txt96
-rw-r--r--qemu/docs/specs/parallels.txt228
-rw-r--r--qemu/docs/specs/pci-ids.txt24
-rw-r--r--qemu/docs/specs/ppc-spapr-hcalls.txt4
-rw-r--r--qemu/docs/specs/ppc-spapr-hotplug.txt48
-rw-r--r--qemu/docs/specs/qcow2.txt223
-rw-r--r--qemu/docs/specs/rocker.txt2
-rw-r--r--qemu/docs/specs/vhost-user.txt210
10 files changed, 1060 insertions, 187 deletions
diff --git a/qemu/docs/specs/fw_cfg.txt b/qemu/docs/specs/fw_cfg.txt
index 74351dd18..7a5f8c782 100644
--- a/qemu/docs/specs/fw_cfg.txt
+++ b/qemu/docs/specs/fw_cfg.txt
@@ -76,6 +76,22 @@ increasing address order, similar to memcpy().
Selector Register IOport: 0x510
Data Register IOport: 0x511
+DMA Address IOport: 0x514
+
+=== ARM Register Locations ===
+
+Selector Register address: Base + 8 (2 bytes)
+Data Register address: Base + 0 (8 bytes)
+DMA Address address: Base + 16 (8 bytes)
+
+== ACPI Interface ==
+
+The fw_cfg device is defined with ACPI ID "QEMU0002". Since we expect
+ACPI tables to be passed into the guest through the fw_cfg device itself,
+the guest-side firmware can not use ACPI to find fw_cfg. However, once the
+firmware is finished setting up ACPI tables and hands control over to the
+guest kernel, the latter can use the fw_cfg ACPI node for a more accurate
+inventory of in-use IOport or MMIO regions.
== Firmware Configuration Items ==
@@ -86,11 +102,15 @@ by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE),
and reading four bytes from the data register. If the fw_cfg device is
present, the four bytes read will contain the characters "QEMU".
-=== Revision (Key 0x0001, FW_CFG_ID) ===
+If the DMA interface is available, then reading the DMA Address
+Register returns 0x51454d5520434647 ("QEMU CFG" in big-endian format).
+
+=== Revision / feature bitmap (Key 0x0001, FW_CFG_ID) ===
-A 32-bit little-endian unsigned int, this item is used as an interface
-revision number, and is currently set to 1 by QEMU when fw_cfg is
-initialized.
+A 32-bit little-endian unsigned int, this item is used to check for enabled
+features.
+ - Bit 0: traditional interface. Always set.
+ - Bit 1: DMA interface.
=== File Directory (Key 0x0019, FW_CFG_FILE_DIR) ===
@@ -132,79 +152,56 @@ Selector Reg. Range Usage
In practice, the number of allowed firmware configuration items is given
by the value of FW_CFG_MAX_ENTRY (see fw_cfg.h).
-= Host-side API =
-
-The following functions are available to the QEMU programmer for adding
-data to a fw_cfg device during guest initialization (see fw_cfg.h for
-each function's complete prototype):
-
-== fw_cfg_add_bytes() ==
-
-Given a selector key value, starting pointer, and size, create an item
-as a raw "blob" of the given size, available by selecting the given key.
-The data referenced by the starting pointer is only linked, NOT copied,
-into the data structure of the fw_cfg device.
-
-== fw_cfg_add_string() ==
+= Guest-side DMA Interface =
-Instead of a starting pointer and size, this function accepts a pointer
-to a NUL-terminated ascii string, and inserts a newly allocated copy of
-the string (including the NUL terminator) into the fw_cfg device data
-structure.
+If bit 1 of the feature bitmap is set, the DMA interface is present. This does
+not replace the existing fw_cfg interface, it is an add-on. This interface
+can be used through the 64-bit wide address register.
-== fw_cfg_add_iXX() ==
+The address register is in big-endian format. The value for the register is 0
+at startup and after an operation. A write to the least significant half (at
+offset 4) triggers an operation. This means that operations with 32-bit
+addresses can be triggered with just one write, whereas operations with
+64-bit addresses can be triggered with one 64-bit write or two 32-bit writes,
+starting with the most significant half (at offset 0).
-Insert an XX-bit item, where XX may be 16, 32, or 64. These functions
-will convert a 16-, 32-, or 64-bit integer to little-endian, then add
-a dynamically allocated copy of the appropriately sized item to fw_cfg
-under the given selector key value.
+In this register, the physical address of a FWCfgDmaAccess structure in RAM
+should be written. This is the format of the FWCfgDmaAccess structure:
-== fw_cfg_add_file() ==
+typedef struct FWCfgDmaAccess {
+ uint32_t control;
+ uint32_t length;
+ uint64_t address;
+} FWCfgDmaAccess;
-Given a filename (i.e., fw_cfg item name), starting pointer, and size,
-create an item as a raw "blob" of the given size. Unlike fw_cfg_add_bytes()
-above, the next available selector key (above 0x0020, FW_CFG_FILE_FIRST)
-will be used, and a new entry will be added to the file directory structure
-(at key 0x0019), containing the item name, blob size, and automatically
-assigned selector key value. The data referenced by the starting pointer
-is only linked, NOT copied, into the fw_cfg data structure.
+The fields of the structure are in big endian mode, and the field at the lowest
+address is the "control" field.
-== fw_cfg_add_file_callback() ==
+The "control" field has the following bits:
+ - Bit 0: Error
+ - Bit 1: Read
+ - Bit 2: Skip
+ - Bit 3: Select. The upper 16 bits are the selected index.
-Like fw_cfg_add_file(), but additionally sets pointers to a callback
-function (and opaque argument), which will be executed host-side by
-QEMU each time a byte is read by the guest from this particular item.
+When an operation is triggered, if the "control" field has bit 3 set, the
+upper 16 bits are interpreted as an index of a firmware configuration item.
+This has the same effect as writing the selector register.
-NOTE: The callback function is given the opaque argument set by
-fw_cfg_add_file_callback(), but also the current data offset,
-allowing it the option of only acting upon specific offset values
-(e.g., 0, before the first data byte of the selected item is
-returned to the guest).
+If the "control" field has bit 1 set, a read operation will be performed.
+"length" bytes for the current selector and offset will be copied into the
+physical RAM address specified by the "address" field.
-== fw_cfg_modify_file() ==
+If the "control" field has bit 2 set (and not bit 1), a skip operation will be
+performed. The offset for the current selector will be advanced "length" bytes.
-Given a filename (i.e., fw_cfg item name), starting pointer, and size,
-completely replace the configuration item referenced by the given item
-name with the new given blob. If an existing blob is found, its
-callback information is removed, and a pointer to the old data is
-returned to allow the caller to free it, helping avoid memory leaks.
-If a configuration item does not already exist under the given item
-name, a new item will be created as with fw_cfg_add_file(), and NULL
-is returned to the caller. In any case, the data referenced by the
-starting pointer is only linked, NOT copied, into the fw_cfg data
-structure.
+To check the result, read the "control" field:
+ error bit set -> something went wrong.
+ all bits cleared -> transfer finished successfully.
+ otherwise -> transfer still in progress (doesn't happen
+ today due to implementation not being async,
+ but may in the future).
-== fw_cfg_add_callback() ==
-
-Like fw_cfg_add_bytes(), but additionally sets pointers to a callback
-function (and opaque argument), which will be executed host-side by
-QEMU each time a guest-side write operation to this particular item
-completes fully overwriting the item's data.
-
-NOTE: This function is deprecated, and will be completely removed
-starting with QEMU v2.4.
-
-== Externally Provided Items ==
+= Externally Provided Items =
As of v2.4, "file" fw_cfg items (i.e., items with selector keys above
FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file
@@ -213,14 +210,27 @@ the following syntax:
-fw_cfg [name=]<item_name>,file=<path>
-where <item_name> is the fw_cfg item name, and <path> is the location
-on the host file system of a file containing the data to be inserted.
+Or
+
+ -fw_cfg [name=]<item_name>,string=<string>
+
+See QEMU man page for more documentation.
+
+Using item_name with plain ASCII characters only is recommended.
+
+Item names beginning with "opt/" are reserved for users. QEMU will
+never create entries with such names unless explicitly ordered by the
+user.
+
+To avoid clashes among different users, it is strongly recommended
+that you use names beginning with opt/RFQDN/, where RFQDN is a reverse
+fully qualified domain name you control. For instance, if SeaBIOS
+wanted to define additional names, the prefix "opt/org.seabios/" would
+be appropriate.
-NOTE: Users *SHOULD* choose item names beginning with the prefix "opt/"
-when using the "-fw_cfg" command line option, to avoid conflicting with
-item names used internally by QEMU. For instance:
+For historical reasons, "opt/ovmf/" is reserved for OVMF firmware.
- -fw_cfg name=opt/my_item_name,file=./my_blob.bin
+Prefix "opt/org.qemu/" is reserved for QEMU itself.
-Similarly, QEMU developers *SHOULD NOT* use item names prefixed with
-"opt/" when inserting items programmatically, e.g. via fw_cfg_add_file().
+Use of names not beginning with "opt/" is potentially dangerous and
+entirely unsupported. QEMU will warn if you try.
diff --git a/qemu/docs/specs/ivshmem-spec.txt b/qemu/docs/specs/ivshmem-spec.txt
new file mode 100644
index 000000000..a1f549979
--- /dev/null
+++ b/qemu/docs/specs/ivshmem-spec.txt
@@ -0,0 +1,254 @@
+= Device Specification for Inter-VM shared memory device =
+
+The Inter-VM shared memory device (ivshmem) is designed to share a
+memory region between multiple QEMU processes running different guests
+and the host. In order for all guests to be able to pick up the
+shared memory area, it is modeled by QEMU as a PCI device exposing
+said memory to the guest as a PCI BAR.
+
+The device can use a shared memory object on the host directly, or it
+can obtain one from an ivshmem server.
+
+In the latter case, the device can additionally interrupt its peers, and
+get interrupted by its peers.
+
+
+== Configuring the ivshmem PCI device ==
+
+There are two basic configurations:
+
+- Just shared memory: -device ivshmem-plain,memdev=HMB,...
+
+ This uses host memory backend HMB. It should have option "share"
+ set.
+
+- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
+
+ An ivshmem server must already be running on the host. The device
+ connects to the server's UNIX domain socket via character device
+ CHR.
+
+ Each peer gets assigned a unique ID by the server. IDs must be
+ between 0 and 65535.
+
+ Interrupts are message-signaled (MSI-X). vectors=N configures the
+ number of vectors to use.
+
+For more details on ivshmem device properties, see The QEMU Emulator
+User Documentation (qemu-doc.*).
+
+
+== The ivshmem PCI device's guest interface ==
+
+The device has vendor ID 1af4, device ID 1110, revision 1. Before
+QEMU 2.6.0, it had revision 0.
+
+=== PCI BARs ===
+
+The ivshmem PCI device has two or three BARs:
+
+- BAR0 holds device registers (256 Byte MMIO)
+- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell)
+- BAR2 maps the shared memory object
+
+There are two ways to use this device:
+
+- If you only need the shared memory part, BAR2 suffices. This way,
+ you have access to the shared memory in the guest and can use it as
+ you see fit. Memnic, for example, uses ivshmem this way from guest
+ user space (see http://dpdk.org/browse/memnic).
+
+- If you additionally need the capability for peers to interrupt each
+ other, you need BAR0 and BAR1. You will most likely want to write a
+ kernel driver to handle interrupts. Requires the device to be
+ configured for interrupts, obviously.
+
+Before QEMU 2.6.0, BAR2 can initially be invalid if the device is
+configured for interrupts. It becomes safely accessible only after
+the ivshmem server provided the shared memory. These devices have PCI
+revision 0 rather than 1. Guest software should wait for the
+IVPosition register (described below) to become non-negative before
+accessing BAR2.
+
+Revision 0 of the device is not capable to tell guest software whether
+it is configured for interrupts.
+
+=== PCI device registers ===
+
+BAR 0 contains the following registers:
+
+ Offset Size Access On reset Function
+ 0 4 read/write 0 Interrupt Mask
+ bit 0: peer interrupt (rev 0)
+ reserved (rev 1)
+ bit 1..31: reserved
+ 4 4 read/write 0 Interrupt Status
+ bit 0: peer interrupt (rev 0)
+ reserved (rev 1)
+ bit 1..31: reserved
+ 8 4 read-only 0 or ID IVPosition
+ 12 4 write-only N/A Doorbell
+ bit 0..15: vector
+ bit 16..31: peer ID
+ 16 240 none N/A reserved
+
+Software should only access the registers as specified in column
+"Access". Reserved bits should be ignored on read, and preserved on
+write.
+
+In revision 0 of the device, Interrupt Status and Mask Register
+together control the legacy INTx interrupt when the device has no
+MSI-X capability: INTx is asserted when the bit-wise AND of Status and
+Mask is non-zero and the device has no MSI-X capability. Interrupt
+Status Register bit 0 becomes 1 when an interrupt request from a peer
+is received. Reading the register clears it.
+
+IVPosition Register: if the device is not configured for interrupts,
+this is zero. Else, it is the device's ID (between 0 and 65535).
+
+Before QEMU 2.6.0, the register may read -1 for a short while after
+reset. These devices have PCI revision 0 rather than 1.
+
+There is no good way for software to find out whether the device is
+configured for interrupts. A positive IVPosition means interrupts,
+but zero could be either.
+
+Doorbell Register: writing this register requests to interrupt a peer.
+The written value's high 16 bits are the ID of the peer to interrupt,
+and its low 16 bits select an interrupt vector.
+
+If the device is not configured for interrupts, the write is ignored.
+
+If the interrupt hasn't completed setup, the write is ignored. The
+device is not capable to tell guest software whether setup is
+complete. Interrupts can regress to this state on migration.
+
+If the peer with the requested ID isn't connected, or it has fewer
+interrupt vectors connected, the write is ignored. The device is not
+capable to tell guest software what peers are connected, or how many
+interrupt vectors are connected.
+
+The peer's interrupt for this vector then becomes pending. There is
+no way for software to clear the pending bit, and a polling mode of
+operation is therefore impossible.
+
+If the peer is a revision 0 device without MSI-X capability, its
+Interrupt Status register is set to 1. This asserts INTx unless
+masked by the Interrupt Mask register. The device is not capable to
+communicate the interrupt vector to guest software then.
+
+With multiple MSI-X vectors, different vectors can be used to indicate
+different events have occurred. The semantics of interrupt vectors
+are left to the application.
+
+
+== Interrupt infrastructure ==
+
+When configured for interrupts, the peers share eventfd objects in
+addition to shared memory. The shared resources are managed by an
+ivshmem server.
+
+=== The ivshmem server ===
+
+The server listens on a UNIX domain socket.
+
+For each new client that connects to the server, the server
+- picks an ID,
+- creates eventfd file descriptors for the interrupt vectors,
+- sends the ID and the file descriptor for the shared memory to the
+ new client,
+- sends connect notifications for the new client to the other clients
+ (these contain file descriptors for sending interrupts),
+- sends connect notifications for the other clients to the new client,
+ and
+- sends interrupt setup messages to the new client (these contain file
+ descriptors for receiving interrupts).
+
+The first client to connect to the server receives ID zero.
+
+When a client disconnects from the server, the server sends disconnect
+notifications to the other clients.
+
+The next section describes the protocol in detail.
+
+If the server terminates without sending disconnect notifications for
+its connected clients, the clients can elect to continue. They can
+communicate with each other normally, but won't receive disconnect
+notification on disconnect, and no new clients can connect. There is
+no way for the clients to connect to a restarted server. The device
+is not capable to tell guest software whether the server is still up.
+
+Example server code is in contrib/ivshmem-server/. Not to be used in
+production. It assumes all clients use the same number of interrupt
+vectors.
+
+A standalone client is in contrib/ivshmem-client/. It can be useful
+for debugging.
+
+=== The ivshmem Client-Server Protocol ===
+
+An ivshmem device configured for interrupts connects to an ivshmem
+server. This section details the protocol between the two.
+
+The connection is one-way: the server sends messages to the client.
+Each message consists of a single 8 byte little-endian signed number,
+and may be accompanied by a file descriptor via SCM_RIGHTS. Both
+client and server close the connection on error.
+
+Note: QEMU currently doesn't close the connection right on error, but
+only when the character device is destroyed.
+
+On connect, the server sends the following messages in order:
+
+1. The protocol version number, currently zero. The client should
+ close the connection on receipt of versions it can't handle.
+
+2. The client's ID. This is unique among all clients of this server.
+ IDs must be between 0 and 65535, because the Doorbell register
+ provides only 16 bits for them.
+
+3. The number -1, accompanied by the file descriptor for the shared
+ memory.
+
+4. Connect notifications for existing other clients, if any. This is
+ a peer ID (number between 0 and 65535 other than the client's ID),
+ repeated N times. Each repetition is accompanied by one file
+ descriptor. These are for interrupting the peer with that ID using
+ vector 0,..,N-1, in order. If the client is configured for fewer
+ vectors, it closes the extra file descriptors. If it is configured
+ for more, the extra vectors remain unconnected.
+
+5. Interrupt setup. This is the client's own ID, repeated N times.
+ Each repetition is accompanied by one file descriptor. These are
+ for receiving interrupts from peers using vector 0,..,N-1, in
+ order. If the client is configured for fewer vectors, it closes
+ the extra file descriptors. If it is configured for more, the
+ extra vectors remain unconnected.
+
+From then on, the server sends these kinds of messages:
+
+6. Connection / disconnection notification. This is a peer ID.
+
+ - If the number comes with a file descriptor, it's a connection
+ notification, exactly like in step 4.
+
+ - Else, it's a disconnection notification for the peer with that ID.
+
+Known bugs:
+
+* The protocol changed incompatibly in QEMU 2.5. Before, messages
+ were native endian long, and there was no version number.
+
+* The protocol is poorly designed.
+
+=== The ivshmem Client-Client Protocol ===
+
+An ivshmem device configured for interrupts receives eventfd file
+descriptors for interrupting peers and getting interrupted by peers
+from the server, as explained in the previous section.
+
+To interrupt a peer, the device writes the 8-byte integer 1 in native
+byte order to the respective file descriptor.
+
+To receive an interrupt, the device reads and discards as many 8-byte
+integers as it can.
diff --git a/qemu/docs/specs/ivshmem_device_spec.txt b/qemu/docs/specs/ivshmem_device_spec.txt
deleted file mode 100644
index 667a8628f..000000000
--- a/qemu/docs/specs/ivshmem_device_spec.txt
+++ /dev/null
@@ -1,96 +0,0 @@
-
-Device Specification for Inter-VM shared memory device
-------------------------------------------------------
-
-The Inter-VM shared memory device is designed to share a region of memory to
-userspace in multiple virtual guests. The memory region does not belong to any
-guest, but is a POSIX memory object on the host. Optionally, the device may
-support sending interrupts to other guests sharing the same memory region.
-
-
-The Inter-VM PCI device
------------------------
-
-*BARs*
-
-The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support
-registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is
-used to map the shared memory object from the host. The size of BAR2 is
-specified when the guest is started and must be a power of 2 in size.
-
-*Registers*
-
-The device currently supports 4 registers of 32-bits each. Registers
-are used for synchronization between guests sharing the same memory object when
-interrupts are supported (this requires using the shared memory server).
-
-The server assigns each VM an ID number and sends this ID number to the QEMU
-process when the guest starts.
-
-enum ivshmem_registers {
- IntrMask = 0,
- IntrStatus = 4,
- IVPosition = 8,
- Doorbell = 12
-};
-
-The first two registers are the interrupt mask and status registers. Mask and
-status are only used with pin-based interrupts. They are unused with MSI
-interrupts.
-
-Status Register: The status register is set to 1 when an interrupt occurs.
-
-Mask Register: The mask register is bitwise ANDed with the interrupt status
-and the result will raise an interrupt if it is non-zero. However, since 1 is
-the only value the status will be set to, it is only the first bit of the mask
-that has any effect. Therefore interrupts can be masked by setting the first
-bit to 0 and unmasked by setting the first bit to 1.
-
-IVPosition Register: The IVPosition register is read-only and reports the
-guest's ID number. The guest IDs are non-negative integers. When using the
-server, since the server is a separate process, the VM ID will only be set when
-the device is ready (shared memory is received from the server and accessible via
-the device). If the device is not ready, the IVPosition will return -1.
-Applications should ensure that they have a valid VM ID before accessing the
-shared memory.
-
-Doorbell Register: To interrupt another guest, a guest must write to the
-Doorbell register. The doorbell register is 32-bits, logically divided into
-two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low
-16-bits are the interrupt vector to trigger. The semantics of the value
-written to the doorbell depends on whether the device is using MSI or a regular
-pin-based interrupt. In short, MSI uses vectors while regular interrupts set the
-status register.
-
-Regular Interrupts
-
-If regular interrupts are used (due to either a guest not supporting MSI or the
-user specifying not to use them on startup) then the value written to the lower
-16-bits of the Doorbell register results is arbitrary and will trigger an
-interrupt in the destination guest.
-
-Message Signalled Interrupts
-
-A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
-written to the Doorbell register must be between 0 and the maximum number of
-vectors the guest supports. The lower 16 bits written to the doorbell is the
-MSI vector that will be raised in the destination guest. The number of MSI
-vectors is configurable but it is set when the VM is started.
-
-The important thing to remember with MSI is that it is only a signal, no status
-is set (since MSI interrupts are not shared). All information other than the
-interrupt itself should be communicated via the shared memory region. Devices
-supporting multiple MSI vectors can use different vectors to indicate different
-events have occurred. The semantics of interrupt vectors are left to the
-user's discretion.
-
-
-Usage in the Guest
-------------------
-
-The shared memory device is intended to be used with the provided UIO driver.
-Very little configuration is needed. The guest should map BAR0 to access the
-registers (an array of 32-bit ints allows simple writing) and map BAR2 to
-access the shared memory region itself. The size of the shared memory region
-is specified when the guest (or shared memory server) is started. A guest may
-map the whole shared memory region or only part of it.
diff --git a/qemu/docs/specs/parallels.txt b/qemu/docs/specs/parallels.txt
new file mode 100644
index 000000000..b4fe2295f
--- /dev/null
+++ b/qemu/docs/specs/parallels.txt
@@ -0,0 +1,228 @@
+= License =
+
+Copyright (c) 2015 Denis Lunev
+Copyright (c) 2015 Vladimir Sementsov-Ogievskiy
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+= Parallels Expandable Image File Format =
+
+A Parallels expandable image file consists of three consecutive parts:
+ * header
+ * BAT
+ * data area
+
+All numbers in a Parallels expandable image are stored in little-endian byte
+order.
+
+
+== Definitions ==
+
+ Sector A 512-byte data chunk.
+
+ Cluster A data chunk of the size specified in the image header.
+ Currently, the default size is 1MiB (2048 sectors). In previous
+ versions, cluster sizes of 63 sectors, 256 and 252 kilobytes were
+ used.
+
+ BAT Block Allocation Table, an entity that contains information for
+ guest-to-host I/O data address translation.
+
+
+== Header ==
+
+The header is placed at the start of an image and contains the following
+fields:
+
+Bytes:
+ 0 - 15: magic
+ Must contain "WithoutFreeSpace" or "WithouFreSpacExt".
+
+ 16 - 19: version
+ Must be 2.
+
+ 20 - 23: heads
+ Disk geometry parameter for guest.
+
+ 24 - 27: cylinders
+ Disk geometry parameter for guest.
+
+ 28 - 31: tracks
+ Cluster size, in sectors.
+
+ 32 - 35: nb_bat_entries
+ Disk size, in clusters (BAT size).
+
+ 36 - 43: nb_sectors
+ Disk size, in sectors.
+
+ For "WithoutFreeSpace" images:
+ Only the lowest 4 bytes are used. The highest 4 bytes must be
+ cleared in this case.
+
+ For "WithouFreSpacExt" images, there are no such
+ restrictions.
+
+ 44 - 47: in_use
+ Set to 0x746F6E59 when the image is opened by software in R/W
+ mode; set to 0x312e3276 when the image is closed.
+
+ A zero in this field means that the image was opened by an old
+ version of the software that doesn't support Format Extension
+ (see below).
+
+ Other values are not allowed.
+
+ 48 - 51: data_off
+ An offset, in sectors, from the start of the file to the start of
+ the data area.
+
+ For "WithoutFreeSpace" images:
+ - If data_off is zero, the offset is calculated as the end of BAT
+ table plus some padding to ensure sector size alignment.
+ - If data_off is non-zero, the offset should be aligned to sector
+ size. However it is recommended to align it to cluster size for
+ newly created images.
+
+ For "WithouFreSpacExt" images:
+ data_off must be non-zero and aligned to cluster size.
+
+ 52 - 55: flags
+ Miscellaneous flags.
+
+ Bit 0: Empty Image bit. If set, the image should be
+ considered clear.
+
+ Bits 2-31: Unused.
+
+ 56 - 63: ext_off
+ Format Extension offset, an offset, in sectors, from the start of
+ the file to the start of the Format Extension Cluster.
+
+ ext_off must meet the same requirements as cluster offsets
+ defined by BAT entries (see below).
+
+
+== BAT ==
+
+BAT is placed immediately after the image header. In the file, BAT is a
+contiguous array of 32-bit unsigned little-endian integers with
+(bat_entries * 4) bytes size.
+
+Each BAT entry contains an offset from the start of the file to the
+corresponding cluster. The offset set in clusters for "WithouFreSpacExt" images
+and in sectors for "WithoutFreeSpace" images.
+
+If a BAT entry is zero, the corresponding cluster is not allocated and should
+be considered as filled with zeroes.
+
+Cluster offsets specified by BAT entries must meet the following requirements:
+ - the value must not be lower than data offset (provided by header.data_off
+ or calculated as specified above),
+ - the value must be lower than the desired file size,
+ - the value must be unique among all BAT entries,
+ - the result of (cluster offset - data offset) must be aligned to cluster
+ size.
+
+
+== Data Area ==
+
+The data area is an area from the data offset (provided by header.data_off or
+calculated as specified above) to the end of the file. It represents a
+contiguous array of clusters. Most of them are allocated by the BAT, some may
+be allocated by the ext_off field in the header while other may be allocated by
+extensions. All clusters allocated by ext_off and extensions should meet the
+same requirements as clusters specified by BAT entries.
+
+
+== Format Extension ==
+
+The Format Extension is an area 1 cluster in size that provides additional
+format features. This cluster is addressed by the ext_off field in the header.
+The format of the Format Extension area is the following:
+
+ 0 - 7: magic
+ Must be 0xAB234CEF23DCEA87
+
+ 8 - 23: m_CheckSum
+ The MD5 checksum of the entire Header Extension cluster except
+ the first 24 bytes.
+
+ The above are followed by feature sections or "extensions". The last
+ extension must be "End of features" (see below).
+
+Each feature section has the following format:
+
+ 0 - 7: magic
+ The identifier of the feature:
+ 0x0000000000000000 - End of features
+ 0x20385FAE252CB34A - Dirty bitmap
+
+ 8 - 15: flags
+ External flags for extension:
+
+ Bit 0: NECESSARY
+ If the software cannot load the extension (due to an
+ unknown magic number or error), the file should not be
+ changed. If this flag is unset and there is an error on
+ loading the extension, said extension should be dropped.
+
+ Bit 1: TRANSIT
+ If there is an unknown extension with this flag set,
+ said extension should be left as is.
+
+ If neither NECESSARY nor TRANSIT are set, the extension should be
+ dropped.
+
+ 16 - 19: data_size
+ The size of the following feature data, in bytes.
+
+ 20 - 23: unused32
+ Align header to 8 bytes boundary.
+
+ variable: data (data_size bytes)
+
+ The above is followed by padding to the next 8 bytes boundary, then the
+ next extension starts.
+
+ The last extension must be "End of features" with all the fields set to 0.
+
+
+=== Dirty bitmaps feature ===
+
+This feature provides a way of storing dirty bitmaps in the image. The fields
+of its data area are:
+
+ 0 - 7: size
+ The bitmap size, should be equal to disk size in sectors.
+
+ 8 - 23: id
+ An identifier for backup consistency checking.
+
+ 24 - 27: granularity
+ Bitmap granularity, in sectors. I.e., the number of sectors
+ corresponding to one bit of the bitmap. Granularity must be
+ a power of 2.
+
+ 28 - 31: l1_size
+ The number of entries in the L1 table of the bitmap.
+
+ variable: l1 (64 * l1_size bytes)
+ L1 offset table (in bytes)
+
+A dirty bitmap is stored using a one-level structure for the mapping to host
+clusters - an L1 table.
+
+Given an offset in bytes into the bitmap data, the offset in bytes into the
+image file can be obtained as follows:
+
+ offset = l1_table[offset / cluster_size] + (offset % cluster_size)
+
+If an L1 table entry is 0, the corresponding cluster of the bitmap is assumed
+to be zero.
+
+If an L1 table entry is 1, the corresponding cluster of the bitmap is assumed
+to have all bits set.
+
+If an L1 table entry is not 0 or 1, it allocates a cluster from the data area.
diff --git a/qemu/docs/specs/pci-ids.txt b/qemu/docs/specs/pci-ids.txt
index 0adcb89aa..fd27c677d 100644
--- a/qemu/docs/specs/pci-ids.txt
+++ b/qemu/docs/specs/pci-ids.txt
@@ -15,13 +15,23 @@ The 1000 -> 10ff device ID range is used as follows for virtio-pci devices.
Note that this allocation separate from the virtio device IDs, which are
maintained as part of the virtio specification.
-1af4:1000 network device
-1af4:1001 block device
-1af4:1002 balloon device
-1af4:1003 console device
-1af4:1004 SCSI host bus adapter device
-1af4:1005 entropy generator device
-1af4:1009 9p filesystem device
+1af4:1000 network device (legacy)
+1af4:1001 block device (legacy)
+1af4:1002 balloon device (legacy)
+1af4:1003 console device (legacy)
+1af4:1004 SCSI host bus adapter device (legacy)
+1af4:1005 entropy generator device (legacy)
+1af4:1009 9p filesystem device (legacy)
+
+1af4:1041 network device (modern)
+1af4:1042 block device (modern)
+1af4:1043 console device (modern)
+1af4:1044 entropy generator device (modern)
+1af4:1045 balloon device (modern)
+1af4:1048 SCSI host bus adapter device (modern)
+1af4:1049 9p filesystem device (modern)
+1af4:1050 virtio gpu device (modern)
+1af4:1052 virtio input device (modern)
1af4:10f0 Available for experimental usage without registration. Must get
to official ID when the code leaves the test lab (i.e. when seeking
diff --git a/qemu/docs/specs/ppc-spapr-hcalls.txt b/qemu/docs/specs/ppc-spapr-hcalls.txt
index 667b3fa00..5bd8eab78 100644
--- a/qemu/docs/specs/ppc-spapr-hcalls.txt
+++ b/qemu/docs/specs/ppc-spapr-hcalls.txt
@@ -41,8 +41,8 @@ When the guest runs in "real mode" (in powerpc lingua this means
with MMU disabled, ie guest effective == guest physical), it only
has access to a subset of memory and no IOs.
-PAPR provides a set of hypervisor calls to perform cachable or
-non-cachable accesses to any guest physical addresses that the
+PAPR provides a set of hypervisor calls to perform cacheable or
+non-cacheable accesses to any guest physical addresses that the
guest can use in order to access IO devices while in real mode.
This is typically used by the firmware running in the guest.
diff --git a/qemu/docs/specs/ppc-spapr-hotplug.txt b/qemu/docs/specs/ppc-spapr-hotplug.txt
index 46e07196b..631b0cada 100644
--- a/qemu/docs/specs/ppc-spapr-hotplug.txt
+++ b/qemu/docs/specs/ppc-spapr-hotplug.txt
@@ -302,4 +302,52 @@ consisting of <phys>, <size> and <maxcpus>.
pseries guests use this property to note the maximum allowed CPUs for the
guest.
+== ibm,dynamic-reconfiguration-memory ==
+
+ibm,dynamic-reconfiguration-memory is a device tree node that represents
+dynamically reconfigurable logical memory blocks (LMB). This node
+is generated only when the guest advertises the support for it via
+ibm,client-architecture-support call. Memory that is not dynamically
+reconfigurable is represented by /memory nodes. The properties of this
+node that are of interest to the sPAPR memory hotplug implementation
+in QEMU are described here.
+
+ibm,lmb-size
+
+This 64bit integer defines the size of each dynamically reconfigurable LMB.
+
+ibm,associativity-lookup-arrays
+
+This property defines a lookup array in which the NUMA associativity
+information for each LMB can be found. It is a property encoded array
+that begins with an integer M, the number of associativity lists followed
+by an integer N, the number of entries per associativity list and terminated
+by M associativity lists each of length N integers.
+
+This property provides the same information as given by ibm,associativity
+property in a /memory node. Each assigned LMB has an index value between
+0 and M-1 which is used as an index into this table to select which
+associativity list to use for the LMB. This index value for each LMB
+is defined in ibm,dynamic-memory property.
+
+ibm,dynamic-memory
+
+This property describes the dynamically reconfigurable memory. It is a
+property encoded array that has an integer N, the number of LMBs followed
+by N LMB list entires.
+
+Each LMB list entry consists of the following elements:
+
+- Logical address of the start of the LMB encoded as a 64bit integer. This
+ corresponds to reg property in /memory node.
+- DRC index of the LMB that corresponds to ibm,my-drc-index property
+ in a /memory node.
+- Four bytes reserved for expansion.
+- Associativity list index for the LMB that is used as an index into
+ ibm,associativity-lookup-arrays property described earlier. This
+ is used to retrieve the right associativity list to be used for this
+ LMB.
+- A 32bit flags word. The bit at bit position 0x00000008 defines whether
+ the LMB is assigned to the the partition as of boot time.
+
[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
diff --git a/qemu/docs/specs/qcow2.txt b/qemu/docs/specs/qcow2.txt
index 121dfc8cc..80cdfd0e9 100644
--- a/qemu/docs/specs/qcow2.txt
+++ b/qemu/docs/specs/qcow2.txt
@@ -103,7 +103,18 @@ in the description of a field.
write to an image with unknown auto-clear features if it
clears the respective bits from this field first.
- Bits 0-63: Reserved (set to 0)
+ Bit 0: Bitmaps extension bit
+ This bit indicates consistency for the bitmaps
+ extension data.
+
+ It is an error if this bit is set without the
+ bitmaps extension present.
+
+ If the bitmaps extension is present but this
+ bit is unset, the bitmaps extension data must be
+ considered inconsistent.
+
+ Bits 1-63: Reserved (set to 0)
96 - 99: refcount_order
Describes the width of a reference count block entry (width
@@ -123,6 +134,7 @@ be stored. Each extension has a structure like the following:
0x00000000 - End of the header extension area
0xE2792ACA - Backing file format name
0x6803f857 - Feature name table
+ 0x23852875 - Bitmaps extension
other - Unknown header extension, can be safely
ignored
@@ -166,6 +178,36 @@ the header extension data. Each entry look like this:
terminated if it has full length)
+== Bitmaps extension ==
+
+The bitmaps extension is an optional header extension. It provides the ability
+to store bitmaps related to a virtual disk. For now, there is only one bitmap
+type: the dirty tracking bitmap, which tracks virtual disk changes from some
+point in time.
+
+The data of the extension should be considered consistent only if the
+corresponding auto-clear feature bit is set, see autoclear_features above.
+
+The fields of the bitmaps extension are:
+
+ Byte 0 - 3: nb_bitmaps
+ The number of bitmaps contained in the image. Must be
+ greater than or equal to 1.
+
+ Note: Qemu currently only supports up to 65535 bitmaps per
+ image.
+
+ 4 - 7: Reserved, must be zero.
+
+ 8 - 15: bitmap_directory_size
+ Size of the bitmap directory in bytes. It is the cumulative
+ size of all (nb_bitmaps) bitmap headers.
+
+ 16 - 23: bitmap_directory_offset
+ Offset into the image file at which the bitmap directory
+ starts. Must be aligned to a cluster boundary.
+
+
== Host cluster management ==
qcow2 manages the allocation of host clusters by maintaining a reference count
@@ -257,7 +299,7 @@ L2 table entry:
63: 0 for a cluster that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
- in L2 tables that are reachable from the the active L1
+ in L2 tables that are reachable from the active L1
table.
Standard Cluster Descriptor:
@@ -360,3 +402,180 @@ Snapshot table entry:
variable: Padding to round up the snapshot table entry size to the
next multiple of 8.
+
+
+== Bitmaps ==
+
+As mentioned above, the bitmaps extension provides the ability to store bitmaps
+related to a virtual disk. This section describes how these bitmaps are stored.
+
+All stored bitmaps are related to the virtual disk stored in the same image, so
+each bitmap size is equal to the virtual disk size.
+
+Each bit of the bitmap is responsible for strictly defined range of the virtual
+disk. For bit number bit_nr the corresponding range (in bytes) will be:
+
+ [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1]
+
+Granularity is a property of the concrete bitmap, see below.
+
+
+=== Bitmap directory ===
+
+Each bitmap saved in the image is described in a bitmap directory entry. The
+bitmap directory is a contiguous area in the image file, whose starting offset
+and length are given by the header extension fields bitmap_directory_offset and
+bitmap_directory_size. The entries of the bitmap directory have variable
+length, depending on the lengths of the bitmap name and extra data. These
+entries are also called bitmap headers.
+
+Structure of a bitmap directory entry:
+
+ Byte 0 - 7: bitmap_table_offset
+ Offset into the image file at which the bitmap table
+ (described below) for the bitmap starts. Must be aligned to
+ a cluster boundary.
+
+ 8 - 11: bitmap_table_size
+ Number of entries in the bitmap table of the bitmap.
+
+ 12 - 15: flags
+ Bit
+ 0: in_use
+ The bitmap was not saved correctly and may be
+ inconsistent.
+
+ 1: auto
+ The bitmap must reflect all changes of the virtual
+ disk by any application that would write to this qcow2
+ file (including writes, snapshot switching, etc.). The
+ type of this bitmap must be 'dirty tracking bitmap'.
+
+ 2: extra_data_compatible
+ This flags is meaningful when the extra data is
+ unknown to the software (currently any extra data is
+ unknown to Qemu).
+ If it is set, the bitmap may be used as expected, extra
+ data must be left as is.
+ If it is not set, the bitmap must not be used, but
+ both it and its extra data be left as is.
+
+ Bits 3 - 31 are reserved and must be 0.
+
+ 16: type
+ This field describes the sort of the bitmap.
+ Values:
+ 1: Dirty tracking bitmap
+
+ Values 0, 2 - 255 are reserved.
+
+ 17: granularity_bits
+ Granularity bits. Valid values: 0 - 63.
+
+ Note: Qemu currently doesn't support granularity_bits
+ greater than 31.
+
+ Granularity is calculated as
+ granularity = 1 << granularity_bits
+
+ A bitmap's granularity is how many bytes of the image
+ accounts for one bit of the bitmap.
+
+ 18 - 19: name_size
+ Size of the bitmap name. Must be non-zero.
+
+ Note: Qemu currently doesn't support values greater than
+ 1023.
+
+ 20 - 23: extra_data_size
+ Size of type-specific extra data.
+
+ For now, as no extra data is defined, extra_data_size is
+ reserved and should be zero. If it is non-zero the
+ behavior is defined by extra_data_compatible flag.
+
+ variable: extra_data
+ Extra data for the bitmap, occupying extra_data_size bytes.
+ Extra data must never contain references to clusters or in
+ some other way allocate additional clusters.
+
+ variable: name
+ The name of the bitmap (not null terminated), occupying
+ name_size bytes. Must be unique among all bitmap names
+ within the bitmaps extension.
+
+ variable: Padding to round up the bitmap directory entry size to the
+ next multiple of 8. All bytes of the padding must be zero.
+
+
+=== Bitmap table ===
+
+Each bitmap is stored using a one-level structure (as opposed to two-level
+structures like for refcounts and guest clusters mapping) for the mapping of
+bitmap data to host clusters. This structure is called the bitmap table.
+
+Each bitmap table has a variable size (stored in the bitmap directory entry)
+and may use multiple clusters, however, it must be contiguous in the image
+file.
+
+Structure of a bitmap table entry:
+
+ Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero.
+ If bits 9 - 55 are zero:
+ 0: Cluster should be read as all zeros.
+ 1: Cluster should be read as all ones.
+
+ 1 - 8: Reserved and must be zero.
+
+ 9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to
+ a cluster boundary. If the offset is 0, the cluster is
+ unallocated; in that case, bit 0 determines how this
+ cluster should be treated during reads.
+
+ 56 - 63: Reserved and must be zero.
+
+
+=== Bitmap data ===
+
+As noted above, bitmap data is stored in separate clusters, described by the
+bitmap table. Given an offset (in bytes) into the bitmap data, the offset into
+the image file can be obtained as follows:
+
+ image_offset(bitmap_data_offset) =
+ bitmap_table[bitmap_data_offset / cluster_size] +
+ (bitmap_data_offset % cluster_size)
+
+This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see
+above).
+
+Given an offset byte_nr into the virtual disk and the bitmap's granularity, the
+bit offset into the image file to the corresponding bit of the bitmap can be
+calculated like this:
+
+ bit_offset(byte_nr) =
+ image_offset(byte_nr / granularity / 8) * 8 +
+ (byte_nr / granularity) % 8
+
+If the size of the bitmap data is not a multiple of the cluster size then the
+last cluster of the bitmap data contains some unused tail bits. These bits must
+be zero.
+
+
+=== Dirty tracking bitmaps ===
+
+Bitmaps with 'type' field equal to one are dirty tracking bitmaps.
+
+When the virtual disk is in use dirty tracking bitmap may be 'enabled' or
+'disabled'. While the bitmap is 'enabled', all writes to the virtual disk
+should be reflected in the bitmap. A set bit in the bitmap means that the
+corresponding range of the virtual disk (see above) was written to while the
+bitmap was 'enabled'. An unset bit means that this range was not written to.
+
+The software doesn't have to sync the bitmap in the image file with its
+representation in RAM after each write. Flag 'in_use' should be set while the
+bitmap is not synced.
+
+In the image file the 'enabled' state is reflected by the 'auto' flag. If this
+flag is set, the software must consider the bitmap as 'enabled' and start
+tracking virtual disk changes to this bitmap from the first write to the
+virtual disk. If this flag is not set then the bitmap is disabled.
diff --git a/qemu/docs/specs/rocker.txt b/qemu/docs/specs/rocker.txt
index 1c743515c..d2a82624f 100644
--- a/qemu/docs/specs/rocker.txt
+++ b/qemu/docs/specs/rocker.txt
@@ -297,7 +297,7 @@ but not fired. If only partial credits are returned, the interrupt remains
masked but the device generates an interrupt, signaling the driver that more
outstanding work is available.
-(* this masking is unrelated to to the MSI-X interrupt mask register)
+(* this masking is unrelated to the MSI-X interrupt mask register)
Endianness
----------
diff --git a/qemu/docs/specs/vhost-user.txt b/qemu/docs/specs/vhost-user.txt
index 650bb1818..777c49cfe 100644
--- a/qemu/docs/specs/vhost-user.txt
+++ b/qemu/docs/specs/vhost-user.txt
@@ -87,6 +87,14 @@ Depending on the request type, payload can be:
User address: a 64-bit user address
mmap offset: 64-bit offset where region starts in the mapped memory
+* Log description
+ ---------------------------
+ | log size | log offset |
+ ---------------------------
+ log size: size of area used for logging
+ log offset: offset from start of supplied file descriptor
+ where logging starts (i.e. where guest address 0 would be logged)
+
In QEMU the vhost-user message is implemented with the following struct:
typedef struct VhostUserMsg {
@@ -98,6 +106,7 @@ typedef struct VhostUserMsg {
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
+ VhostUserLog log;
};
} QEMU_PACKED VhostUserMsg;
@@ -113,12 +122,15 @@ message replies. Most of the requests don't require replies. Here is a list of
the ones that do:
* VHOST_GET_FEATURES
+ * VHOST_GET_PROTOCOL_FEATURES
* VHOST_GET_VRING_BASE
+ * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
There are several messages that the master sends with file descriptors passed
in the ancillary data:
* VHOST_SET_MEM_TABLE
+ * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
* VHOST_SET_LOG_FD
* VHOST_SET_VRING_KICK
* VHOST_SET_VRING_CALL
@@ -127,6 +139,122 @@ in the ancillary data:
If Master is unable to send the full message or receives a wrong reply it will
close the connection. An optional reconnection mechanism can be implemented.
+Any protocol extensions are gated by protocol feature bits,
+which allows full backwards compatibility on both master
+and slave.
+As older slaves don't support negotiating protocol features,
+a feature bit was dedicated for this purpose:
+#define VHOST_USER_F_PROTOCOL_FEATURES 30
+
+Starting and stopping rings
+----------------------
+Client must only process each ring when it is started.
+
+Client must only pass data between the ring and the
+backend, when the ring is enabled.
+
+If ring is started but disabled, client must process the
+ring without talking to the backend.
+
+For example, for a networking device, in the disabled state
+client must not supply any new RX packets, but must process
+and discard any TX packets.
+
+If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized
+in an enabled state.
+
+If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized
+in a disabled state. Client must not pass data to/from the backend until ring is enabled by
+VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by
+VHOST_USER_SET_VRING_ENABLE with parameter 0.
+
+Each ring is initialized in a stopped state, client must not process it until
+ring is started, or after it has been stopped.
+
+Client must start ring upon receiving a kick (that is, detecting that file
+descriptor is readable) on the descriptor specified by
+VHOST_USER_SET_VRING_KICK, and stop ring upon receiving
+VHOST_USER_GET_VRING_BASE.
+
+While processing the rings (whether they are enabled or not), client must
+support changing some configuration aspects on the fly.
+
+Multiple queue support
+----------------------
+
+Multiple queue is treated as a protocol extension, hence the slave has to
+implement protocol features first. The multiple queues feature is supported
+only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set.
+
+The max number of queues the slave supports can be queried with message
+VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of
+requested queues is bigger than that.
+
+As all queues share one connection, the master uses a unique index for each
+queue in the sent message to identify a specified queue. One queue pair
+is enabled initially. More queues are enabled dynamically, by sending
+message VHOST_USER_SET_VRING_ENABLE.
+
+Migration
+---------
+
+During live migration, the master may need to track the modifications
+the slave makes to the memory mapped regions. The client should mark
+the dirty pages in a log. Once it complies to this logging, it may
+declare the VHOST_F_LOG_ALL vhost feature.
+
+To start/stop logging of data/used ring writes, server may send messages
+VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with
+VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively.
+
+All the modifications to memory pointed by vring "descriptor" should
+be marked. Modifications to "used" vring should be marked if
+VHOST_VRING_F_LOG is part of ring's flags.
+
+Dirty pages are of size:
+#define VHOST_LOG_PAGE 0x1000
+
+The log memory fd is provided in the ancillary data of
+VHOST_USER_SET_LOG_BASE message when the slave has
+VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature.
+
+The size of the log is supplied as part of VhostUserMsg
+which should be large enough to cover all known guest
+addresses. Log starts at the supplied offset in the
+supplied file descriptor.
+The log covers from address 0 to the maximum of guest
+regions. In pseudo-code, to mark page at "addr" as dirty:
+
+page = addr / VHOST_LOG_PAGE
+log[page / 8] |= 1 << page % 8
+
+Where addr is the guest physical address.
+
+Use atomic operations, as the log may be concurrently manipulated.
+
+Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG
+is set for this ring), log_guest_addr should be used to calculate the log
+offset: the write to first byte of the used ring is logged at this offset from
+log start. Also note that this value might be outside the legal guest physical
+address range (i.e. does not have to be covered by the VhostUserMemory table),
+but the bit offset of the last byte of the ring must fall within
+the size supplied by VhostUserLog.
+
+VHOST_USER_SET_LOG_FD is an optional message with an eventfd in
+ancillary data, it may be used to inform the master that the log has
+been modified.
+
+Once the source has finished migration, rings will be stopped by
+the source. No further update must be done before rings are
+restarted.
+
+Protocol features
+-----------------
+
+#define VHOST_USER_PROTOCOL_F_MQ 0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
+#define VHOST_USER_PROTOCOL_F_RARP 2
+
Message types
-------------
@@ -138,6 +266,8 @@ Message types
Slave payload: u64
Get from the underlying vhost implementation the features bitmask.
+ Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
+ VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
* VHOST_USER_SET_FEATURES
@@ -146,6 +276,33 @@ Message types
Master payload: u64
Enable features in the underlying vhost implementation using a bitmask.
+ Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
+ VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
+
+ * VHOST_USER_GET_PROTOCOL_FEATURES
+
+ Id: 15
+ Equivalent ioctl: VHOST_GET_FEATURES
+ Master payload: N/A
+ Slave payload: u64
+
+ Get the protocol feature bitmask from the underlying vhost implementation.
+ Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
+ VHOST_USER_GET_FEATURES.
+ Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
+ this message even before VHOST_USER_SET_FEATURES was called.
+
+ * VHOST_USER_SET_PROTOCOL_FEATURES
+
+ Id: 16
+ Ioctl: VHOST_SET_FEATURES
+ Master payload: u64
+
+ Enable protocol features in the underlying vhost implementation.
+ Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
+ VHOST_USER_GET_FEATURES.
+ Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
+ this message even before VHOST_USER_SET_FEATURES was called.
* VHOST_USER_SET_OWNER
@@ -160,11 +317,13 @@ Message types
* VHOST_USER_RESET_OWNER
Id: 4
- Equivalent ioctl: VHOST_RESET_OWNER
Master payload: N/A
- Issued when a new connection is about to be closed. The Master will no
- longer own this connection (and will usually close it).
+ This is no longer used. Used to be sent to request disabling
+ all rings, but some clients interpreted it to also discard
+ connection state (this interpretation would lead to bugs).
+ It is recommended that clients either ignore this message,
+ or use it to disable all rings.
* VHOST_USER_SET_MEM_TABLE
@@ -182,8 +341,14 @@ Message types
Id: 6
Equivalent ioctl: VHOST_SET_LOG_BASE
Master payload: u64
+ Slave payload: N/A
+
+ Sets logging shared memory space.
+ When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol
+ feature, the log memory fd is provided in the ancillary data of
+ VHOST_USER_SET_LOG_BASE message, the size and offset of shared
+ memory area provided in the message.
- Sets the logging base address.
* VHOST_USER_SET_LOG_FD
@@ -199,7 +364,7 @@ Message types
Equivalent ioctl: VHOST_SET_VRING_NUM
Master payload: vring state description
- Sets the number of vrings for this owner.
+ Set the size of the queue.
* VHOST_USER_SET_VRING_ADDR
@@ -264,3 +429,38 @@ Message types
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data.
+
+ * VHOST_USER_GET_QUEUE_NUM
+
+ Id: 17
+ Equivalent ioctl: N/A
+ Master payload: N/A
+ Slave payload: u64
+
+ Query how many queues the backend supports. This request should be
+ sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol
+ features by VHOST_USER_GET_PROTOCOL_FEATURES.
+
+ * VHOST_USER_SET_VRING_ENABLE
+
+ Id: 18
+ Equivalent ioctl: N/A
+ Master payload: vring state description
+
+ Signal slave to enable or disable corresponding vring.
+ This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
+ has been negotiated.
+
+ * VHOST_USER_SEND_RARP
+
+ Id: 19
+ Equivalent ioctl: N/A
+ Master payload: u64
+
+ Ask vhost user backend to broadcast a fake RARP to notify the migration
+ is terminated for guest that does not support GUEST_ANNOUNCE.
+ Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
+ VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP
+ is present in VHOST_USER_GET_PROTOCOL_FEATURES.
+ The first 6 bytes of the payload contain the mac address of the guest to
+ allow the vhost user backend to construct and broadcast the fake RARP.