diff options
author | José Pekkarinen <jose.pekkarinen@nokia.com> | 2016-05-18 13:18:31 +0300 |
---|---|---|
committer | José Pekkarinen <jose.pekkarinen@nokia.com> | 2016-05-18 13:42:15 +0300 |
commit | 437fd90c0250dee670290f9b714253671a990160 (patch) | |
tree | b871786c360704244a07411c69fb58da9ead4a06 /qemu/docs | |
parent | 5bbd6fe9b8bab2a93e548c5a53b032d1939eec05 (diff) |
These changes are the raw update to qemu-2.6.
Collission happened in the following patches:
migration: do cleanup operation after completion(738df5b9)
Bug fix.(1750c932f86)
kvmclock: add a new function to update env->tsc.(b52baab2)
The code provided by the patches was already in the upstreamed
version.
Change-Id: I3cc11841a6a76ae20887b2e245710199e1ea7f9a
Signed-off-by: José Pekkarinen <jose.pekkarinen@nokia.com>
Diffstat (limited to 'qemu/docs')
30 files changed, 3452 insertions, 1000 deletions
diff --git a/qemu/docs/bitmaps.md b/qemu/docs/bitmaps.md index fa87f077f..a2e8d5116 100644 --- a/qemu/docs/bitmaps.md +++ b/qemu/docs/bitmaps.md @@ -19,12 +19,20 @@ which is included at the end of this document. * A dirty bitmap's name is unique to the node, but bitmaps attached to different nodes can share the same name. +* Dirty bitmaps created for internal use by QEMU may be anonymous and have no + name, but any user-created bitmaps may not be. There can be any number of + anonymous bitmaps per node. + +* The name of a user-created bitmap must not be empty (""). + ## Bitmap Modes * A Bitmap can be "frozen," which means that it is currently in-use by a backup operation and cannot be deleted, renamed, written to, reset, etc. +* The normal operating mode for a bitmap is "active." + ## Basic QMP Usage ### Supported Commands ### @@ -97,11 +105,7 @@ which is included at the end of this document. } ``` -## Transactions (Not yet implemented) - -* Transactional commands are forthcoming in a future version, - and are not yet available for use. This section serves as - documentation of intent for their design and usage. +## Transactions ### Justification @@ -323,6 +327,155 @@ full backup as a backing image. "event": "BLOCK_JOB_COMPLETED" } ``` +### Partial Transactional Failures + +* Sometimes, a transaction will succeed in launching and return success, + but then later the backup jobs themselves may fail. It is possible that + a management application may have to deal with a partial backup failure + after a successful transaction. + +* If multiple backup jobs are specified in a single transaction, when one of + them fails, it will not interact with the other backup jobs in any way. + +* The job(s) that succeeded will clear the dirty bitmap associated with the + operation, but the job(s) that failed will not. It is not "safe" to delete + any incremental backups that were created successfully in this scenario, + even though others failed. + +#### Example + +* QMP example highlighting two backup jobs: + + ```json + { "execute": "transaction", + "arguments": { + "actions": [ + { "type": "drive-backup", + "data": { "device": "drive0", "bitmap": "bitmap0", + "format": "qcow2", "mode": "existing", + "sync": "incremental", "target": "d0-incr-1.qcow2" } }, + { "type": "drive-backup", + "data": { "device": "drive1", "bitmap": "bitmap1", + "format": "qcow2", "mode": "existing", + "sync": "incremental", "target": "d1-incr-1.qcow2" } }, + ] + } + } + ``` + +* QMP example response, highlighting one success and one failure: + * Acknowledgement that the Transaction was accepted and jobs were launched: + ```json + { "return": {} } + ``` + + * Later, QEMU sends notice that the first job was completed: + ```json + { "timestamp": { "seconds": 1447192343, "microseconds": 615698 }, + "data": { "device": "drive0", "type": "backup", + "speed": 0, "len": 67108864, "offset": 67108864 }, + "event": "BLOCK_JOB_COMPLETED" + } + ``` + + * Later yet, QEMU sends notice that the second job has failed: + ```json + { "timestamp": { "seconds": 1447192399, "microseconds": 683015 }, + "data": { "device": "drive1", "action": "report", + "operation": "read" }, + "event": "BLOCK_JOB_ERROR" } + ``` + + ```json + { "timestamp": { "seconds": 1447192399, "microseconds": 685853 }, + "data": { "speed": 0, "offset": 0, "len": 67108864, + "error": "Input/output error", + "device": "drive1", "type": "backup" }, + "event": "BLOCK_JOB_COMPLETED" } + +* In the above example, "d0-incr-1.qcow2" is valid and must be kept, + but "d1-incr-1.qcow2" is invalid and should be deleted. If a VM-wide + incremental backup of all drives at a point-in-time is to be made, + new backups for both drives will need to be made, taking into account + that a new incremental backup for drive0 needs to be based on top of + "d0-incr-1.qcow2." + +### Grouped Completion Mode + +* While jobs launched by transactions normally complete or fail on their own, + it is possible to instruct them to complete or fail together as a group. + +* QMP transactions take an optional properties structure that can affect + the semantics of the transaction. + +* The "completion-mode" transaction property can be either "individual" + which is the default, legacy behavior described above, or "grouped," + a new behavior detailed below. + +* Delayed Completion: In grouped completion mode, no jobs will report + success until all jobs are ready to report success. + +* Grouped failure: If any job fails in grouped completion mode, all remaining + jobs will be cancelled. Any incremental backups will restore their dirty + bitmap objects as if no backup command was ever issued. + + * Regardless of if QEMU reports a particular incremental backup job as + CANCELLED or as an ERROR, the in-memory bitmap will be restored. + +#### Example + +* Here's the same example scenario from above with the new property: + + ```json + { "execute": "transaction", + "arguments": { + "actions": [ + { "type": "drive-backup", + "data": { "device": "drive0", "bitmap": "bitmap0", + "format": "qcow2", "mode": "existing", + "sync": "incremental", "target": "d0-incr-1.qcow2" } }, + { "type": "drive-backup", + "data": { "device": "drive1", "bitmap": "bitmap1", + "format": "qcow2", "mode": "existing", + "sync": "incremental", "target": "d1-incr-1.qcow2" } }, + ], + "properties": { + "completion-mode": "grouped" + } + } + } + ``` + +* QMP example response, highlighting a failure for drive2: + * Acknowledgement that the Transaction was accepted and jobs were launched: + ```json + { "return": {} } + ``` + + * Later, QEMU sends notice that the second job has errored out, + but that the first job was also cancelled: + ```json + { "timestamp": { "seconds": 1447193702, "microseconds": 632377 }, + "data": { "device": "drive1", "action": "report", + "operation": "read" }, + "event": "BLOCK_JOB_ERROR" } + ``` + + ```json + { "timestamp": { "seconds": 1447193702, "microseconds": 640074 }, + "data": { "speed": 0, "offset": 0, "len": 67108864, + "error": "Input/output error", + "device": "drive1", "type": "backup" }, + "event": "BLOCK_JOB_COMPLETED" } + ``` + + ```json + { "timestamp": { "seconds": 1447193702, "microseconds": 640163 }, + "data": { "device": "drive0", "type": "backup", "speed": 0, + "len": 67108864, "offset": 16777216 }, + "event": "BLOCK_JOB_CANCELLED" } + ``` + <!-- The FreeBSD Documentation License diff --git a/qemu/docs/blkdebug.txt b/qemu/docs/blkdebug.txt index b67a36d5c..43d8e8f9c 100644 --- a/qemu/docs/blkdebug.txt +++ b/qemu/docs/blkdebug.txt @@ -1,6 +1,6 @@ Block I/O error injection using blkdebug ---------------------------------------- -Copyright (C) 2014 Red Hat Inc +Copyright (C) 2014-2015 Red Hat Inc This work is licensed under the terms of the GNU GPL, version 2 or later. See the COPYING file in the top-level directory. @@ -92,8 +92,9 @@ The core events are: flush_to_disk - flush the host block device's disk cache -See block/blkdebug.c:event_names[] for the full list of events. You may need -to grep block driver source code to understand the meaning of specific events. +See qapi/block-core.json:BlkdebugEvent for the full list of events. +You may need to grep block driver source code to understand the +meaning of specific events. State transitions ----------------- diff --git a/qemu/docs/build-system.txt b/qemu/docs/build-system.txt new file mode 100644 index 000000000..5ddddeaaf --- /dev/null +++ b/qemu/docs/build-system.txt @@ -0,0 +1,507 @@ + The QEMU build system architecture + ================================== + +This document aims to help developers understand the architecture of the +QEMU build system. As with projects using GNU autotools, the QEMU build +system has two stages, first the developer runs the "configure" script +to determine the local build environment characteristics, then they run +"make" to build the project. There is about where the similarities with +GNU autotools end, so try to forget what you know about them. + + +Stage 1: configure +================== + +The QEMU configure script is written directly in shell, and should be +compatible with any POSIX shell, hence it uses #!/bin/sh. An important +implication of this is that it is important to avoid using bash-isms on +development platforms where bash is the primary host. + +In contrast to autoconf scripts, QEMU's configure is expected to be +silent while it is checking for features. It will only display output +when an error occurs, or to show the final feature enablement summary +on completion. + +Adding new checks to the configure script usually comprises the +following tasks: + + - Initialize one or more variables with the default feature state. + + Ideally features should auto-detect whether they are present, + so try to avoid hardcoding the initial state to either enabled + or disabled, as that forces the user to pass a --enable-XXX + / --disable-XXX flag on every invocation of configure. + + - Add support to the command line arg parser to handle any new + --enable-XXX / --disable-XXX flags required by the feature XXX. + + - Add information to the help output message to report on the new + feature flag. + + - Add code to perform the actual feature check. As noted above, try to + be fully dynamic in checking enablement/disablement. + + - Add code to print out the feature status in the configure summary + upon completion. + + - Add any new makefile variables to $config_host_mak on completion. + + +Taking (a simplified version of) the probe for gnutls from configure, +we have the following pieces: + + # Initial variable state + gnutls="" + + ..snip.. + + # Configure flag processing + --disable-gnutls) gnutls="no" + ;; + --enable-gnutls) gnutls="yes" + ;; + + ..snip.. + + # Help output feature message + gnutls GNUTLS cryptography support + + ..snip.. + + # Test for gnutls + if test "$gnutls" != "no"; then + if ! $pkg_config --exists "gnutls"; then + gnutls_cflags=`$pkg_config --cflags gnutls` + gnutls_libs=`$pkg_config --libs gnutls` + libs_softmmu="$gnutls_libs $libs_softmmu" + libs_tools="$gnutls_libs $libs_tools" + QEMU_CFLAGS="$QEMU_CFLAGS $gnutls_cflags" + gnutls="yes" + elif test "$gnutls" = "yes"; then + feature_not_found "gnutls" "Install gnutls devel" + else + gnutls="no" + fi + fi + + ..snip.. + + # Completion feature summary + echo "GNUTLS support $gnutls" + + ..snip.. + + # Define make variables + if test "$gnutls" = "yes" ; then + echo "CONFIG_GNUTLS=y" >> $config_host_mak + fi + + +Helper functions +---------------- + +The configure script provides a variety of helper functions to assist +developers in checking for system features: + + - do_cc $ARGS... + + Attempt to run the system C compiler passing it $ARGS... + + - do_cxx $ARGS... + + Attempt to run the system C++ compiler passing it $ARGS... + + - compile_object $CFLAGS + + Attempt to compile a test program with the system C compiler using + $CFLAGS. The test program must have been previously written to a file + called $TMPC. + + - compile_prog $CFLAGS $LDFLAGS + + Attempt to compile a test program with the system C compiler using + $CFLAGS and link it with the system linker using $LDFLAGS. The test + program must have been previously written to a file called $TMPC. + + - has $COMMAND + + Determine if $COMMAND exists in the current environment, either as a + shell builtin, or executable binary, returning 0 on success. + + - path_of $COMMAND + + Return the fully qualified path of $COMMAND, printing it to stdout, + and returning 0 on success. + + - check_define $NAME + + Determine if the macro $NAME is defined by the system C compiler + + - check_include $NAME + + Determine if the include $NAME file is available to the system C + compiler + + - write_c_skeleton + + Write a minimal C program main() function to the temporary file + indicated by $TMPC + + - feature_not_found $NAME $REMEDY + + Print a message to stderr that the feature $NAME was not available + on the system, suggesting the user try $REMEDY to address the + problem. + + - error_exit $MESSAGE $MORE... + + Print $MESSAGE to stderr, followed by $MORE... and then exit from the + configure script with non-zero status + + - query_pkg_config $ARGS... + + Run pkg-config passing it $ARGS. If QEMU is doing a static build, + then --static will be automatically added to $ARGS + + +Stage 2: makefiles +================== + +The use of GNU make is required with the QEMU build system. + +Although the source code is spread across multiple subdirectories, the +build system should be considered largely non-recursive in nature, in +contrast to common practices seen with automake. There is some recursive +invocation of make, but this is related to the things being built, +rather than the source directory structure. + +QEMU currently supports both VPATH and non-VPATH builds, so there are +three general ways to invoke configure & perform a build. + + - VPATH, build artifacts outside of QEMU source tree entirely + + cd ../ + mkdir build + cd build + ../qemu/configure + make + + - VPATH, build artifacts in a subdir of QEMU source tree + + mkdir build + cd build + ../configure + make + + - non-VPATH, build artifacts everywhere + + ./configure + make + +The QEMU maintainers generally recommend that a VPATH build is used by +developers. Patches to QEMU are expected to ensure VPATH build still +works. + + +Module structure +---------------- + +There are a number of key outputs of the QEMU build system: + + - Tools - qemu-img, qemu-nbd, qga (guest agent), etc + - System emulators - qemu-system-$ARCH + - Userspace emulators - qemu-$ARCH + - Unit tests + +The source code is highly modularized, split across many files to +facilitate building of all of these components with as little duplicated +compilation as possible. There can be considered to be two distinct +groups of files, those which are independent of the QEMU emulation +target and those which are dependent on the QEMU emulation target. + +In the target-independent set lives various general purpose helper code, +such as error handling infrastructure, standard data structures, +platform portability wrapper functions, etc. This code can be compiled +once only and the .o files linked into all output binaries. + +In the target-dependent set lives CPU emulation, device emulation and +much glue code. This sometimes also has to be compiled multiple times, +once for each target being built. + +The utility code that is used by all binaries is built into a +static archive called libqemuutil.a, which is then linked to all the +binaries. In order to provide hooks that are only needed by some of the +binaries, code in libqemuutil.a may depend on other functions that are +not fully implemented by all QEMU binaries. To deal with this there is a +second library called libqemustub.a which provides dummy stubs for all +these functions. These will get lazy linked into the binary if the real +implementation is not present. In this way, the libqemustub.a static +library can be thought of as a portable implementation of the weak +symbols concept. All binaries should link to both libqemuutil.a and +libqemustub.a. e.g. + + qemu-img$(EXESUF): qemu-img.o ..snip.. libqemuutil.a libqemustub.a + + +Windows platform portability +---------------------------- + +On Windows, all binaries have the suffix '.exe', so all Makefile rules +which create binaries must include the $(EXESUF) variable on the binary +name. e.g. + + qemu-img$(EXESUF): qemu-img.o ..snip.. + +This expands to '.exe' on Windows, or '' on other platforms. + +A further complication for the system emulator binaries is that +two separate binaries need to be generated. + +The main binary (e.g. qemu-system-x86_64.exe) is linked against the +Windows console runtime subsystem. These are expected to be run from a +command prompt window, and so will print stderr to the console that +launched them. + +The second binary generated has a 'w' on the end of its name (e.g. +qemu-system-x86_64w.exe) and is linked against the Windows graphical +runtime subsystem. These are expected to be run directly from the +desktop and will open up a dedicated console window for stderr output. + +The Makefile.target will generate the binary for the graphical subsystem +first, and then use objcopy to relink it against the console subsystem +to generate the second binary. + + +Object variable naming +---------------------- + +The QEMU convention is to define variables to list different groups of +object files. These are named with the convention $PREFIX-obj-y. For +example the libqemuutil.a file will be linked with all objects listed +in a variable 'util-obj-y'. So, for example, util/Makefile.obj will +contain a set of definitions looking like + + util-obj-y += bitmap.o bitops.o hbitmap.o + util-obj-y += fifo8.o + util-obj-y += acl.o + util-obj-y += error.o qemu-error.o + +When there is an object file which needs to be conditionally built based +on some characteristic of the host system, the configure script will +define a variable for the conditional. For example, on Windows it will +define $(CONFIG_POSIX) with a value of 'n' and $(CONFIG_WIN32) with a +value of 'y'. It is now possible to use the config variables when +listing object files. For example, + + util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o + util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o + +On Windows this expands to + + util-obj-y += oslib-win32.o qemu-thread-win32.o + util-obj-n += oslib-posix.o qemu-thread-posix.o + +Since libqemutil.a links in $(util-obj-y), the POSIX specific files +listed against $(util-obj-n) are ignored on the Windows platform builds. + + +CFLAGS / LDFLAGS / LIBS handling +-------------------------------- + +There are many different binaries being built with differing purposes, +and some of them might even be 3rd party libraries pulled in via git +submodules. As such the use of the global CFLAGS variable is generally +avoided in QEMU, since it would apply to too many build targets. + +Flags that are needed by any QEMU code (i.e. everything *except* GIT +submodule projects) are put in $(QEMU_CFLAGS) variable. For linker +flags the $(LIBS) variable is sometimes used, but a couple of more +targeted variables are preferred. $(libs_softmmu) is used for +libraries that must be linked to system emulator targets, $(LIBS_TOOLS) +is used for tools like qemu-img, qemu-nbd, etc and $(LIBS_QGA) is used +for the QEMU guest agent. There is currently no specific variable for +the userspace emulator targets as the global $(LIBS), or more targeted +variables shown below, are sufficient. + +In addition to these variables, it is possible to provide cflags and +libs against individual source code files, by defining variables of the +form $FILENAME-cflags and $FILENAME-libs. For example, the curl block +driver needs to link to the libcurl library, so block/Makefile defines +some variables: + + curl.o-cflags := $(CURL_CFLAGS) + curl.o-libs := $(CURL_LIBS) + +The scope is a little different between the two variables. The libs get +used when linking any target binary that includes the curl.o object +file, while the cflags get used when compiling the curl.c file only. + + +Statically defined files +------------------------ + +The following key files are statically defined in the source tree, with +the rules needed to build QEMU. Their behaviour is influenced by a +number of dynamically created files listed later. + +- Makefile + +The main entry point used when invoking make to build all the components +of QEMU. The default 'all' target will naturally result in the build of +every component. The various tools and helper binaries are built +directly via a non-recursive set of rules. + +Each system/userspace emulation target needs to have a slightly +different set of make rules / variables. Thus, make will be recursively +invoked for each of the emulation targets. + +The recursive invocation will end up processing the toplevel +Makefile.target file (more on that later). + + +- */Makefile.objs + +Since the source code is spread across multiple directories, the rules +for each file are similarly modularized. Thus each subdirectory +containing .c files will usually also contain a Makefile.objs file. +These files are not directly invoked by a recursive make, but instead +they are imported by the top level Makefile and/or Makefile.target + +Each Makefile.objs usually just declares a set of variables listing the +.o files that need building from the source files in the directory. They +will also define any custom linker or compiler flags. For example in +block/Makefile.objs + + block-obj-$(CONFIG_LIBISCSI) += iscsi.o + block-obj-$(CONFIG_CURL) += curl.o + + ..snip... + + iscsi.o-cflags := $(LIBISCSI_CFLAGS) + iscsi.o-libs := $(LIBISCSI_LIBS) + curl.o-cflags := $(CURL_CFLAGS) + curl.o-libs := $(CURL_LIBS) + +If there are any rules defined in the Makefile.objs file, they should +all use $(obj) as a prefix to the target, e.g. + + $(obj)/generated-tcg-tracers.h: $(obj)/generated-tcg-tracers.h-timestamp + + +- Makefile.target + +This file provides the entry point used to build each individual system +or userspace emulator target. Each enabled target has its own +subdirectory. For example if configure is run with the argument +'--target-list=x86_64-softmmu', then a sub-directory 'x86_64-softmu' +will be created, containing a 'Makefile' which symlinks back to +Makefile.target + +So when the recursive '$(MAKE) -C x86_64-softmmu' is invoked, it ends up +using Makefile.target for the build rules. + + +- rules.mak + +This file provides the generic helper rules for invoking build tools, in +particular the compiler and linker. This also contains the magic (hairy) +'unnest-vars' function which is used to merge the variable definitions +from all Makefile.objs in the source tree down into the main Makefile +context. + + +- default-configs/*.mak + +The files under default-configs/ control what emulated hardware is built +into each QEMU system and userspace emulator targets. They merely +contain a long list of config variable definitions. For example, +default-configs/x86_64-softmmu.mak has: + + include pci.mak + include sound.mak + include usb.mak + CONFIG_QXL=$(CONFIG_SPICE) + CONFIG_VGA_ISA=y + CONFIG_VGA_CIRRUS=y + CONFIG_VMWARE_VGA=y + CONFIG_VIRTIO_VGA=y + ...snip... + +These files rarely need changing unless new devices / hardware need to +be enabled for a particular system/userspace emulation target + + +- tests/Makefile + +Rules for building the unit tests. This file is included directly by the +top level Makefile, so anything defined in this file will influence the +entire build system. Care needs to be taken when writing rules for tests +to ensure they only apply to the unit test execution / build. + + +- po/Makefile + +Rules for building and installing the binary message catalogs from the +text .po file sources. This almost never needs changing for any reason. + + +Dynamically created files +------------------------- + +The following files are generated dynamically by configure in order to +control the behaviour of the statically defined makefiles. This avoids +the need for QEMU makefiles to go through any pre-processing as seen +with autotools, where Makefile.am generates Makefile.in which generates +Makefile. + + +- config-host.mak + +When configure has determined the characteristics of the build host it +will write a long list of variables to config-host.mak file. This +provides the various install directories, compiler / linker flags and a +variety of CONFIG_* variables related to optionally enabled features. +This is imported by the top level Makefile in order to tailor the build +output. + +The variables defined here are those which are applicable to all QEMU +build outputs. Variables which are potentially different for each +emulator target are defined by the next file... + +It is also used as a dependency checking mechanism. If make sees that +the modification timestamp on configure is newer than that on +config-host.mak, then configure will be re-run. + + +- config-host.h + +The config-host.h file is used by source code to determine what features +are enabled. It is generated from the contents of config-host.mak using +the scripts/create_config program. This extracts all the CONFIG_* variables, +most of the HOST_* variables and a few other misc variables from +config-host.mak, formatting them as C preprocessor macros. + + +- $TARGET-NAME/config-target.mak + +TARGET-NAME is the name of a system or userspace emulator, for example, +x86_64-softmmu denotes the system emulator for the x86_64 architecture. +This file contains the variables which need to vary on a per-target +basis. For example, it will indicate whether KVM or Xen are enabled for +the target and any other potential custom libraries needed for linking +the target. + + +- $TARGET-NAME/config-devices.mak + +TARGET-NAME is again the name of a system or userspace emulator. The +config-devices.mak file is automatically generated by make using the +scripts/make_device_config.sh program, feeding it the +default-configs/$TARGET-NAME file as input. + + +- $TARGET-NAME/Makefile + +This is the entrypoint used when make recurses to build a single system +or userspace emulator target. It is merely a symlink back to the +Makefile.target in the top level. diff --git a/qemu/docs/libcacard.txt b/qemu/docs/libcacard.txt deleted file mode 100644 index 8db421d3a..000000000 --- a/qemu/docs/libcacard.txt +++ /dev/null @@ -1,483 +0,0 @@ -This file documents the CAC (Common Access Card) library in the libcacard -subdirectory. - -Virtual Smart Card Emulator - -This emulator is designed to provide emulation of actual smart cards to a -virtual card reader running in a guest virtual machine. The emulated smart -cards can be representations of real smart cards, where the necessary functions -such as signing, card removal/insertion, etc. are mapped to real, physical -cards which are shared with the client machine the emulator is running on, or -the cards could be pure software constructs. - -The emulator is structured to allow multiple replaceable or additional pieces, -so it can be easily modified for future requirements. The primary envisioned -modifications are: - -1) The socket connection to the virtual card reader (presumably a CCID reader, -but other ISO-7816 compatible readers could be used). The code that handles -this is in vscclient.c. - -2) The virtual card low level emulation. This is currently supplied by using -NSS. This emulation could be replaced by implementations based on other -security libraries, including but not limitted to openssl+pkcs#11 library, -raw pkcs#11, Microsoft CAPI, direct opensc calls, etc. The code that handles -this is in vcard_emul_nss.c. - -3) Emulation for new types of cards. The current implementation emulates the -original DoD CAC standard with separate pki containers. This emulator lives in -cac.c. More than one card type emulator could be included. Other cards could -be emulated as well, including PIV, newer versions of CAC, PKCS #15, etc. - --------------------- -Replacing the Socket Based Virtual Reader Interface. - -The current implementation contains a replaceable module vscclient.c. The -current vscclient.c implements a sockets interface to the virtual ccid reader -on the guest. CCID commands that are pertinent to emulation are passed -across the socket, and their responses are passed back along that same socket. -The protocol that vscclient uses is defined in vscard_common.h and connects -to a qemu ccid usb device. Since this socket runs as a client, vscclient.c -implements a program with a main entry. It also handles argument parsing for -the emulator. - -An application that wants to use the virtual reader can replace vscclient.c -with its own implementation that connects to its own CCID reader. The calls -that the CCID reader can call are: - - VReaderList * vreader_get_reader_list(); - - This function returns a list of virtual readers. These readers may map to - physical devices, or simulated devices depending on vcard the back end. Each - reader in the list should represent a reader to the virtual machine. Virtual - USB address mapping is left to the CCID reader front end. This call can be - made any time to get an updated list. The returned list is a copy of the - internal list that can be referenced by the caller without locking. This copy - must be freed by the caller with vreader_list_delete when it is no longer - needed. - - VReaderListEntry *vreader_list_get_first(VReaderList *); - - This function gets the first entry on the reader list. Along with - vreader_list_get_next(), vreader_list_get_first() can be used to walk the - reader list returned from vreader_get_reader_list(). VReaderListEntries are - part of the list themselves and do not need to be freed separately from the - list. If there are no entries on the list, it will return NULL. - - VReaderListEntry *vreader_list_get_next(VReaderListEntry *); - - This function gets the next entry in the list. If there are no more entries - it will return NULL. - - VReader * vreader_list_get_reader(VReaderListEntry *) - - This function returns the reader stored in the reader List entry. Caller gets - a new reference to a reader. The caller must free its reference when it is - finished with vreader_free(). - - void vreader_free(VReader *reader); - - This function frees a reference to a reader. Readers are reference counted - and are automatically deleted when the last reference is freed. - - void vreader_list_delete(VReaderList *list); - - This function frees the list, all the elements on the list, and all the - reader references held by the list. - - VReaderStatus vreader_power_on(VReader *reader, char *atr, int *len); - - This function simulates a card power on. A virtual card does not care about - the actual voltage and other physical parameters, but it does care that the - card is actually on or off. Cycling the card causes the card to reset. If - the caller provides enough space, vreader_power_on will return the ATR of - the virtual card. The amount of space provided in atr should be indicated - in *len. The function modifies *len to be the actual length of of the - returned ATR. - - VReaderStatus vreader_power_off(VReader *reader); - - This function simulates a power off of a virtual card. - - VReaderStatus vreader_xfer_bytes(VReader *reader, unsigne char *send_buf, - int send_buf_len, - unsigned char *receive_buf, - int receive_buf_len); - - This function sends a raw apdu to a card and returns the card's response. - The CCID front end should return the response back. Most of the emulation - is driven from these APDUs. - - VReaderStatus vreader_card_is_present(VReader *reader); - - This function returns whether or not the reader has a card inserted. The - vreader_power_on, vreader_power_off, and vreader_xfer_bytes will return - VREADER_NO_CARD. - - const char *vreader_get_name(VReader *reader); - - This function returns the name of the reader. The name comes from the card - emulator level and is usually related to the name of the physical reader. - - VReaderID vreader_get_id(VReader *reader); - - This function returns the id of a reader. All readers start out with an id - of -1. The application can set the id with vreader_set_id. - - VReaderStatus vreader_get_id(VReader *reader, VReaderID id); - - This function sets the reader id. The application is responsible for making - sure that the id is unique for all readers it is actively using. - - VReader *vreader_find_reader_by_id(VReaderID id); - - This function returns the reader which matches the id. If two readers match, - only one is returned. The function returns NULL if the id is -1. - - Event *vevent_wait_next_vevent(); - - This function blocks waiting for reader and card insertion events. There - will be one event for each card insertion, each card removal, each reader - insertion and each reader removal. At start up, events are created for all - the initial readers found, as well as all the cards that are inserted. - - Event *vevent_get_next_vevent(); - - This function returns a pending event if it exists, otherwise it returns - NULL. It does not block. - ----------------- -Card Type Emulator: Adding a New Virtual Card Type - -The ISO 7816 card spec describes 2 types of cards: - 1) File system cards, where the smartcard is managed by reading and writing -data to files in a file system. There is currently only boiler plate -implemented for file system cards. - 2) VM cards, where the card has loadable applets which perform the card -functions. The current implementation supports VM cards. - -In the case of VM cards, the difference between various types of cards is -really what applets have been installed in that card. This structure is -mirrored in card type emulators. The 7816 emulator already handles the basic -ISO 7186 commands. Card type emulators simply need to add the virtual applets -which emulate the real card applets. Card type emulators have exactly one -public entry point: - - VCARDStatus xxx_card_init(VCard *card, const char *flags, - const unsigned char *cert[], - int cert_len[], - VCardKey *key[], - int cert_count); - - The parameters for this are: - card - the virtual card structure which will represent this card. - flags - option flags that may be specific to this card type. - cert - array of binary certificates. - cert_len - array of lengths of each of the certificates specified in cert. - key - array of opaque key structures representing the private keys on - the card. - cert_count - number of entries in cert, cert_len, and key arrays. - - Any cert, cert_len, or key with the same index are matching sets. That is - cert[0] is cert_len[0] long and has the corresponding private key of key[0]. - -The card type emulator is expected to own the VCardKeys, but it should copy -any raw cert data it wants to save. It can create new applets and add them to -the card using the following functions: - - VCardApplet *vcard_new_applet(VCardProcessAPDU apdu_func, - VCardResetApplet reset_func, - const unsigned char *aid, - int aid_len); - - This function creates a new applet. Applet structures store the following - information: - 1) the AID of the applet (set by aid and aid_len). - 2) a function to handle APDUs for this applet. (set by apdu_func, more on - this below). - 3) a function to reset the applet state when the applet is selected. - (set by reset_func, more on this below). - 3) applet private data, a data pointer used by the card type emulator to - store any data or state it needs to complete requests. (set by a - separate call). - 4) applet private data free, a function used to free the applet private - data when the applet itself is destroyed. - The created applet can be added to the card with vcard_add_applet below. - - void vcard_set_applet_private(VCardApplet *applet, - VCardAppletPrivate *private, - VCardAppletPrivateFree private_free); - This function sets the private data and the corresponding free function. - VCardAppletPrivate is an opaque data structure to the rest of the emulator. - The card type emulator can define it any way it wants by defining - struct VCardAppletPrivateStruct {};. If there is already a private data - structure on the applet, the old one is freed before the new one is set up. - passing two NULL clear any existing private data. - - VCardStatus vcard_add_applet(VCard *card, VCardApplet *applet); - - Add an applet onto the list of applets attached to the card. Once an applet - has been added, it can be selected by its AID, and then commands will be - routed to it VCardProcessAPDU function. This function adopts the applet that - is passed into it. Note: 2 applets with the same AID should not be added to - the same card. It is permissible to add more than one applet. Multiple applets - may have the same VCardPRocessAPDU entry point. - -The certs and keys should be attached to private data associated with one or -more appropriate applets for that card. Control will come to the card type -emulators once one of its applets are selected through the VCardProcessAPDU -function it specified when it created the applet. - -The signature of VCardResetApplet is: - VCardStatus (*VCardResetApplet) (VCard *card, int channel); - This function will reset the any internal applet state that needs to be - cleared after a select applet call. It should return VCARD_DONE; - -The signature of VCardProcessAPDU is: - VCardStatus (*VCardProcessAPDU)(VCard *card, VCardAPDU *apdu, - VCardResponse **response); - This function examines the APDU and determines whether it should process - the apdu directly, reject the apdu as invalid, or pass the apdu on to - the basic 7816 emulator for processing. - If the 7816 emulator should process the apdu, then the VCardProcessAPDU - should return VCARD_NEXT. - If there is an error, then VCardProcessAPDU should return an error - response using vcard_make_response and the appropriate 7816 error code - (see card_7816t.h) or vcard_make_response with a card type specific error - code. It should then return VCARD_DONE. - If the apdu can be processed correctly, VCardProcessAPDU should do so, - set the response value appropriately for that APDU, and return VCARD_DONE. - VCardProcessAPDU should always set the response if it returns VCARD_DONE. - It should always either return VCARD_DONE or VCARD_NEXT. - -Parsing the APDU -- - -Prior to processing calling the card type emulator's VCardProcessAPDU function, the emulator has already decoded the APDU header and set several fields: - - apdu->a_data - The raw apdu data bytes. - apdu->a_len - The len of the raw apdu data. - apdu->a_body - The start of any post header parameter data. - apdu->a_Lc - The parameter length value. - apdu->a_Le - The expected length of any returned data. - apdu->a_cla - The raw apdu class. - apdu->a_channel - The channel (decoded from the class). - apdu->a_secure_messaging_type - The decoded secure messaging type - (from class). - apdu->a_type - The decode class type. - apdu->a_gen_type - the generic class type (7816, PROPRIETARY, RFU, PTS). - apdu->a_ins - The instruction byte. - apdu->a_p1 - Parameter 1. - apdu->a_p2 - Parameter 2. - -Creating a Response -- - -The expected result of any APDU call is a response. The card type emulator must -set *response with an appropriate VCardResponse value if it returns VCARD_DONE. -Responses could be as simple as returning a 2 byte status word response, to as -complex as returning a block of data along with a 2 byte response. Which is -returned will depend on the semantics of the APDU. The following functions will -create card responses. - - VCardResponse *vcard_make_response(VCard7816Status status); - - This is the most basic function to get a response. This function will - return a response the consists solely one 2 byte status code. If that status - code is defined in card_7816t.h, then this function is guaranteed to - return a response with that status. If a cart type specific status code - is passed and vcard_make_response fails to allocate the appropriate memory - for that response, then vcard_make_response will return a VCardResponse - of VCARD7816_STATUS_EXC_ERROR_MEMORY. In any case, this function is - guaranteed to return a valid VCardResponse. - - VCardResponse *vcard_response_new(unsigned char *buf, int len, - VCard7816Status status); - - This function is similar to vcard_make_response except it includes some - returned data with the response. It could also fail to allocate enough - memory, in which case it will return NULL. - - VCardResponse *vcard_response_new_status_bytes(unsigned char sw1, - unsigned char sw2); - - Sometimes in 7816 the response bytes are treated as two separate bytes with - split meanings. This function allows you to create a response based on - two separate bytes. This function could fail, in which case it will return - NULL. - - VCardResponse *vcard_response_new_bytes(unsigned char *buf, int len, - unsigned char sw1, - unsigned char sw2); - - This function is the same as vcard_response_new except you may specify - the status as two separate bytes like vcard_response_new_status_bytes. - - -Implementing functionality --- - -The following helper functions access information about the current card -and applet. - - VCARDAppletPrivate *vcard_get_current_applet_private(VCard *card, - int channel); - - This function returns any private data set by the card type emulator on - the currently selected applet. The card type emulator keeps track of the - current applet state in this data structure. Any certs and keys associated - with a particular applet is also stored here. - - int vcard_emul_get_login_count(VCard *card); - - This function returns the the number of remaining login attempts for this - card. If the card emulator does not know, or the card does not have a - way of giving this information, this function returns -1. - - - VCard7816Status vcard_emul_login(VCard *card, unsigned char *pin, - int pin_len); - - This function logs into the card and returns the standard 7816 status - word depending on the success or failure of the call. - - void vcard_emul_delete_key(VCardKey *key); - - This function frees the VCardKey passed in to xxxx_card_init. The card - type emulator is responsible for freeing this key when it no longer needs - it. - - VCard7816Status vcard_emul_rsa_op(VCard *card, VCardKey *key, - unsigned char *buffer, - int buffer_size); - - This function does a raw rsa op on the buffer with the given key. - -The sample card type emulator is found in cac.c. It implements the cac specific -applets. Only those applets needed by the coolkey pkcs#11 driver on the guest -have been implemented. To support the full range CAC middleware, a complete CAC -card according to the CAC specs should be implemented here. - ------------------------------- -Virtual Card Emulator - -This code accesses both real smart cards and simulated smart cards through -services provided on the client. The current implementation uses NSS, which -already knows how to talk to various PKCS #11 modules on the client, and is -portable to most operating systems. A particular emulator can have only one -virtual card implementation at a time. - -The virtual card emulator consists of a series of virtual card services. In -addition to the services describe above (services starting with -vcard_emul_xxxx), the virtual card emulator also provides the following -functions: - - VCardEmulError vcard_emul_init(cont VCardEmulOptions *options); - - The options structure is built by another function in the virtual card - interface where a string of virtual card emulator specific strings are - mapped to the options. The actual structure is defined by the virtual card - emulator and is used to determine the configuration of soft cards, or to - determine which physical cards to present to the guest. - - The vcard_emul_init function will build up sets of readers, create any - threads that are needed to watch for changes in the reader state. If readers - have cards present in them, they are also initialized. - - Readers are created with the function. - - VReader *vreader_new(VReaderEmul *reader_emul, - VReaderEmulFree reader_emul_free); - - The freeFunc is used to free the VReaderEmul * when the reader is - destroyed. The VReaderEmul structure is an opaque structure to the - rest of the code, but defined by the virtual card emulator, which can - use it to store any reader specific state. - - Once the reader has been created, it can be added to the front end with the - call: - - VReaderStatus vreader_add_reader(VReader *reader); - - This function will automatically generate the appropriate new reader - events and add the reader to the list. - - To create a new card, the virtual card emulator will call a similar - function. - - VCard *vcard_new(VCardEmul *card_emul, - VCardEmulFree card_emul_free); - - Like vreader_new, this function takes a virtual card emulator specific - structure which it uses to keep track of the card state. - - Once the card is created, it is attached to a card type emulator with the - following function: - - VCardStatus vcard_init(VCard *vcard, VCardEmulType type, - const char *flags, - unsigned char *const *certs, - int *cert_len, - VCardKey *key[], - int cert_count); - - The vcard is the value returned from vcard_new. The type is the - card type emulator that this card should presented to the guest as. - The flags are card type emulator specific options. The certs, - cert_len, and keys are all arrays of length cert_count. These are the - the same of the parameters xxxx_card_init() accepts. - - Finally the card is associated with its reader by the call: - - VReaderStatus vreader_insert_card(VReader *vreader, VCard *vcard); - - This function, like vreader_add_reader, will take care of any event - notification for the card insert. - - - VCardEmulError vcard_emul_force_card_remove(VReader *vreader); - - Force a card that is present to appear to be removed to the guest, even if - that card is a physical card and is present. - - - VCardEmulError vcard_emul_force_card_insert(VReader *reader); - - Force a card that has been removed by vcard_emul_force_card_remove to be - reinserted from the point of view of the guest. This will only work if the - card is physically present (which is always true fro a soft card). - - void vcard_emul_get_atr(Vcard *card, unsigned char *atr, int *atr_len); - - Return the virtual ATR for the card. By convention this should be the value - VCARD_ATR_PREFIX(size) followed by several ascii bytes related to this - particular emulator. For instance the NSS emulator returns - {VCARD_ATR_PREFIX(3), 'N', 'S', 'S' }. Do ot return more data then *atr_len; - - void vcard_emul_reset(VCard *card, VCardPower power) - - Set the state of 'card' to the current power level and reset its internal - state (logout, etc). - -------------------------------------------------------- -List of files and their function: -README - This file -card_7816.c - emulate basic 7816 functionality. Parse APDUs. -card_7816.h - apdu and response services definitions. -card_7816t.h - 7816 specific structures, types and definitions. -event.c - event handling code. -event.h - event handling services definitions. -eventt.h - event handling structures and types -vcard.c - handle common virtual card services like creation, destruction, and - applet management. -vcard.h - common virtual card services function definitions. -vcardt.h - comon virtual card types -vreader.c - common virtual reader services. -vreader.h - common virtual reader services definitions. -vreadert.h - comon virtual reader types. -vcard_emul_type.c - manage the card type emulators. -vcard_emul_type.h - definitions for card type emulators. -cac.c - card type emulator for CAC cards -vcard_emul.h - virtual card emulator service definitions. -vcard_emul_nss.c - virtual card emulator implementation for nss. -vscclient.c - socket connection to guest qemu usb driver. -vscard_common.h - common header with the guest qemu usb driver. -mutex.h - header file for machine independent mutexes. -link_test.c - static test to make sure all the symbols are properly defined. diff --git a/qemu/docs/memory.txt b/qemu/docs/memory.txt index 2ceb34894..431d9ca88 100644 --- a/qemu/docs/memory.txt +++ b/qemu/docs/memory.txt @@ -26,14 +26,28 @@ These represent memory as seen from the CPU or a device's viewpoint. Types of regions ---------------- -There are four types of memory regions (all represented by a single C type +There are multiple types of memory regions (all represented by a single C type MemoryRegion): - RAM: a RAM region is simply a range of host memory that can be made available to the guest. + You typically initialize these with memory_region_init_ram(). Some special + purposes require the variants memory_region_init_resizeable_ram(), + memory_region_init_ram_from_file(), or memory_region_init_ram_ptr(). - MMIO: a range of guest memory that is implemented by host callbacks; each read or write causes a callback to be called on the host. + You initialize these with memory_region_init_io(), passing it a + MemoryRegionOps structure describing the callbacks. + +- ROM: a ROM memory region works like RAM for reads (directly accessing + a region of host memory), but like MMIO for writes (invoking a callback). + You initialize these with memory_region_init_rom_device(). + +- IOMMU region: an IOMMU region translates addresses of accesses made to it + and forwards them to some other target memory region. As the name suggests, + these are only needed for modelling an IOMMU, not for simple devices. + You initialize these with memory_region_init_iommu(). - container: a container simply includes other memory regions, each at a different offset. Containers are useful for grouping several regions @@ -45,12 +59,22 @@ MemoryRegion): can overlay a subregion of RAM with MMIO or ROM, or a PCI controller that does not prevent card from claiming overlapping BARs. + You initialize a pure container with memory_region_init(). + - alias: a subsection of another region. Aliases allow a region to be split apart into discontiguous regions. Examples of uses are memory banks used when the guest address space is smaller than the amount of RAM addressed, or a memory controller that splits main memory to expose a "PCI hole". Aliases may point to any type of region, including other aliases, but an alias may not point back to itself, directly or indirectly. + You initialize these with memory_region_init_alias(). + +- reservation region: a reservation region is primarily for debugging. + It claims I/O space that is not supposed to be handled by QEMU itself. + The typical use is to track parts of the address space which will be + handled by the host kernel when KVM is enabled. + You initialize these with memory_region_init_reservation(), or by + passing a NULL callback parameter to memory_region_init_io(). It is valid to add subregions to a region which is not a pure container (that is, to an MMIO, RAM or ROM region). This means that the region @@ -156,14 +180,14 @@ aliases that leave holes then the lower priority region will appear in these holes too.) For example, suppose we have a container A of size 0x8000 with two subregions -B and C. B is a container mapped at 0x2000, size 0x4000, priority 1; C is -an MMIO region mapped at 0x0, size 0x6000, priority 2. B currently has two +B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is +an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at offset 0x2000. As a diagram: - 0 1000 2000 3000 4000 5000 6000 7000 8000 - |------|------|------|------|------|------|------|-------| - A: [ ] + 0 1000 2000 3000 4000 5000 6000 7000 8000 + |------|------|------|------|------|------|------|------| + A: [ ] C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC] B: [ ] D: [DDDDD] @@ -223,7 +247,7 @@ system_memory: container@0-2^48-1 | +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff) | - +---- vga-window: alias@0xa0000-0xbfffff ---> #pci (0xa0000-0xbffff) + +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff) | (prio 1) | +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff) @@ -273,8 +297,9 @@ various constraints can be supplied to control how these callbacks are called: - .valid.min_access_size, .valid.max_access_size define the access sizes (in bytes) which the device accepts; accesses outside this range will have device and bus specific behaviour (ignored, or machine check) - - .valid.aligned specifies that the device only accepts naturally aligned - accesses. Unaligned accesses invoke device and bus specific behaviour. + - .valid.unaligned specifies that the *device being modelled* supports + unaligned accesses; if false, unaligned accesses will invoke the + appropriate bus or CPU specific behaviour. - .impl.min_access_size, .impl.max_access_size define the access sizes (in bytes) supported by the *implementation*; other access sizes will be emulated using the ones available. For example a 4-byte write will be @@ -282,5 +307,5 @@ various constraints can be supplied to control how these callbacks are called: - .impl.unaligned specifies that the *implementation* supports unaligned accesses; if false, unaligned accesses will be emulated by two aligned accesses. - - .old_mmio can be used to ease porting from code using + - .old_mmio eases the porting of code that was formerly using cpu_register_io_memory(). It should not be used in new code. diff --git a/qemu/docs/migration.txt b/qemu/docs/migration.txt index f6df4beb2..90209ab29 100644 --- a/qemu/docs/migration.txt +++ b/qemu/docs/migration.txt @@ -291,3 +291,194 @@ save/send this state when we are in the middle of a pio operation (that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is not enabled, the values on that fields are garbage and don't need to be sent. + += Return path = + +In most migration scenarios there is only a single data path that runs +from the source VM to the destination, typically along a single fd (although +possibly with another fd or similar for some fast way of throwing pages across). + +However, some uses need two way communication; in particular the Postcopy +destination needs to be able to request pages on demand from the source. + +For these scenarios there is a 'return path' from the destination to the source; +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return +path. + + Source side + Forward path - written by migration thread + Return path - opened by main thread, read by return-path thread + + Destination side + Forward path - read by main thread + Return path - opened by main thread, written by main thread AND postcopy + thread (protected by rp_mutex) + += Postcopy = +'Postcopy' migration is a way to deal with migrations that refuse to converge +(or take too long to converge) its plus side is that there is an upper bound on +the amount of migration traffic and time it takes, the down side is that during +the postcopy phase, a failure of *either* side or the network connection causes +the guest to be lost. + +In postcopy the destination CPUs are started before all the memory has been +transferred, and accesses to pages that are yet to be transferred cause +a fault that's translated by QEMU into a request to the source QEMU. + +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy +doesn't finish in a given time the switch is made to postcopy. + +=== Enabling postcopy === + +To enable postcopy, issue this command on the monitor prior to the +start of migration: + +migrate_set_capability postcopy-ram on + +The normal commands are then used to start a migration, which is still +started in precopy mode. Issuing: + +migrate_start_postcopy + +will now cause the transition from precopy to postcopy. +It can be issued immediately after migration is started or any +time later on. Issuing it after the end of a migration is harmless. + +Note: During the postcopy phase, the bandwidth limits set using +migrate_set_speed is ignored (to avoid delaying requested pages that +the destination is waiting for). + +=== Postcopy device transfer === + +Loading of device data may cause the device emulation to access guest RAM +that may trigger faults that have to be resolved by the source, as such +the migration stream has to be able to respond with page data *during* the +device load, and hence the device data has to be read from the stream completely +before the device load begins to free the stream up. This is achieved by +'packaging' the device data into a blob that's read in one go. + +Source behaviour + +Until postcopy is entered the migration stream is identical to normal +precopy, except for the addition of a 'postcopy advise' command at +the beginning, to tell the destination that postcopy might happen. +When postcopy starts the source sends the page discard data and then +forms the 'package' containing: + + Command: 'postcopy listen' + The device state + A series of sections, identical to the precopy streams device state stream + containing everything except postcopiable devices (i.e. RAM) + Command: 'postcopy run' + +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the +contents are formatted in the same way as the main migration stream. + +During postcopy the source scans the list of dirty pages and sends them +to the destination without being requested (in much the same way as precopy), +however when a page request is received from the destination, the dirty page +scanning restarts from the requested location. This causes requested pages +to be sent quickly, and also causes pages directly after the requested page +to be sent quickly in the hope that those pages are likely to be used +by the destination soon. + +Destination behaviour + +Initially the destination looks the same as precopy, with a single thread +reading the migration stream; the 'postcopy advise' and 'discard' commands +are processed to change the way RAM is managed, but don't affect the stream +processing. + +------------------------------------------------------------------------------ + 1 2 3 4 5 6 7 +main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) +thread | | + | (page request) + | \___ + v \ +listen thread: --- page -- page -- page -- page -- page -- + + a b c +------------------------------------------------------------------------------ + +On receipt of CMD_PACKAGED (1) + All the data associated with the package - the ( ... ) section in the +diagram - is read into memory (into a QEMUSizedBuffer), and the main thread +recurses into qemu_loadvm_state_main to process the contents of the package (2) +which contains commands (3,6) and devices (4...) + +On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) +a new thread (a) is started that takes over servicing the migration stream, +while the main thread carries on loading the package. It loads normal +background page data (b) but if during a device load a fault happens (5) the +returned page (c) is loaded by the listen thread allowing the main threads +device load to carry on. + +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination +CPUs start running. +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour +and is no longer used by migration, while the listen thread carries +on servicing page data until the end of migration. + +=== Postcopy states === + +Postcopy moves through a series of states (see postcopy_state) from +ADVISE->DISCARD->LISTEN->RUNNING->END + + Advise: Set at the start of migration if postcopy is enabled, even + if it hasn't had the start command; here the destination + checks that its OS has the support needed for postcopy, and performs + setup to ensure the RAM mappings are suitable for later postcopy. + The destination will fail early in migration at this point if the + required OS support is not present. + (Triggered by reception of POSTCOPY_ADVISE command) + + Discard: Entered on receipt of the first 'discard' command; prior to + the first Discard being performed, hugepages are switched off + (using madvise) to ensure that no new huge pages are created + during the postcopy phase, and to cause any huge pages that + have discards on them to be broken. + + Listen: The first command in the package, POSTCOPY_LISTEN, switches + the destination state to Listen, and starts a new thread + (the 'listen thread') which takes over the job of receiving + pages off the migration stream, while the main thread carries + on processing the blob. With this thread able to process page + reception, the destination now 'sensitises' the RAM to detect + any access to missing pages (on Linux using the 'userfault' + system). + + Running: POSTCOPY_RUN causes the destination to synchronise all + state and start the CPUs and IO devices running. The main + thread now finishes processing the migration package and + now carries on as it would for normal precopy migration + (although it can't do the cleanup it would do as it + finishes a normal migration). + + End: The listen thread can now quit, and perform the cleanup of migration + state, the migration is now complete. + +=== Source side page maps === + +The source side keeps two bitmaps during postcopy; 'the migration bitmap' +and 'unsent map'. The 'migration bitmap' is basically the same as in +the precopy case, and holds a bit to indicate that page is 'dirty' - +i.e. needs sending. During the precopy phase this is updated as the CPU +dirties pages, however during postcopy the CPUs are stopped and nothing +should dirty anything any more. + +The 'unsent map' is used for the transition to postcopy. It is a bitmap that +has a bit cleared whenever a page is sent to the destination, however during +the transition to postcopy mode it is combined with the migration bitmap +to form a set of pages that: + a) Have been sent but then redirtied (which must be discarded) + b) Have not yet been sent - which also must be discarded to cause any + transparent huge pages built during precopy to be broken. + +Note that the contents of the unsentmap are sacrificed during the calculation +of the discard set and thus aren't valid once in postcopy. The dirtymap +is still valid and is used to ensure that no page is sent more than once. Any +request for a page that has already been sent is ignored. Duplicate requests +such as this can happen as a page is sent at about the same time the +destination accesses it. + diff --git a/qemu/docs/multiseat.txt b/qemu/docs/multiseat.txt index ebf244693..807518c8a 100644 --- a/qemu/docs/multiseat.txt +++ b/qemu/docs/multiseat.txt @@ -135,7 +135,7 @@ configuration: TAG+="seat", ENV{ID_AUTOSEAT}="1" Patch with this rule has been submitted to upstream udev/systemd, was -accepted and and should be included in the next systemd release (222). +accepted and should be included in the next systemd release (222). So, if your guest has this or a newer version, multiseat will work just fine without any manual guest configuration. diff --git a/qemu/docs/pci_expander_bridge.txt b/qemu/docs/pci_expander_bridge.txt index d7913fb4a..36750273b 100644 --- a/qemu/docs/pci_expander_bridge.txt +++ b/qemu/docs/pci_expander_bridge.txt @@ -23,9 +23,9 @@ A detailed command line would be: -m 2G -object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node0 -numa node,nodeid=0,cpus=0,memdev=ram-node0 -object memory-backend-ram,size=1024M,policy=bind,host-nodes=1,id=ram-node1 -numa node,nodeid=1,cpus=1,memdev=ram-node1 --device pxb,id=bridge1,bus=pci.0,numa_node=1,bus_nr=4 -netdev user,id=nd-device e1000,bus=bridge1,addr=0x4,netdev=nd --device pxb,id=bridge2,bus=pci.0,numa_node=0,bus_nr=8,bus=pci.0 -device e1000,bus=bridge2,addr=0x3 --device pxb,id=bridge3,bus=pci.0,bus_nr=40,bus=pci.0 -drive if=none,id=drive0,file=[img] -device virtio-blk-pci,drive=drive0,scsi=off,bus=bridge3,addr=1 +-device pxb,id=bridge1,bus=pci.0,numa_node=1,bus_nr=4 -netdev user,id=nd -device e1000,bus=bridge1,addr=0x4,netdev=nd +-device pxb,id=bridge2,bus=pci.0,numa_node=0,bus_nr=8 -device e1000,bus=bridge2,addr=0x3 +-device pxb,id=bridge3,bus=pci.0,bus_nr=40 -drive if=none,id=drive0,file=[img] -device virtio-blk-pci,drive=drive0,scsi=off,bus=bridge3,addr=1 Here you have: - 2 NUMA nodes for the guest, 0 and 1. (both mapped to the same NUMA node in host, but you can and should put it in different host NUMA nodes) @@ -43,7 +43,7 @@ Implementation ============== The PXB is composed by: - HostBridge (TYPE_PXB_HOST) - The host bridge allows to register and query the PXB's rPCI root bus in QEMU. + The host bridge allows to register and query the PXB's PCI root bus in QEMU. - PXBDev(TYPE_PXB_DEVICE) It is a regular PCI Device that resides on the piix host-bridge bus and its bus uses the same PCI domain. However, the bus behind is exposed through ACPI as a primary PCI bus and starts a new PCI hierarchy. diff --git a/qemu/docs/qapi-code-gen.txt b/qemu/docs/qapi-code-gen.txt index 61b5be47f..0e4bafff0 100644 --- a/qemu/docs/qapi-code-gen.txt +++ b/qemu/docs/qapi-code-gen.txt @@ -1,7 +1,7 @@ = How to use the QAPI code generator = Copyright IBM Corp. 2011 -Copyright (C) 2012-2015 Red Hat, Inc. +Copyright (C) 2012-2016 Red Hat, Inc. This work is licensed under the terms of the GNU GPL, version 2 or later. See the COPYING file in the top-level directory. @@ -52,7 +52,7 @@ schema. The documentation is delimited between two lines of ##, then the first line names the expression, an optional overview is provided, then individual documentation about each member of 'data' is provided, and finally, a 'Since: x.y.z' tag lists the release that introduced -the expression. Optional fields are tagged with the phrase +the expression. Optional members are tagged with the phrase '#optional', often with their default value; and extensions added after the expression was first released are also given a '(since x.y.z)' comment. For example: @@ -106,27 +106,28 @@ Types, commands, and events share a common namespace. Therefore, generally speaking, type definitions should always use CamelCase for user-defined type names, while built-in types are lowercase. Type definitions should not end in 'Kind', as this namespace is used for -creating implicit C enums for visiting union types. Command names, -and field names within a type, should be all lower case with words +creating implicit C enums for visiting union types, or in 'List', as +this namespace is used for creating array types. Command names, +and member names within a type, should be all lower case with words separated by a hyphen. However, some existing older commands and complex types use underscore; when extending such expressions, consistency is preferred over blindly avoiding underscore. Event -names should be ALL_CAPS with words separated by underscore. The -special string '**' appears for some commands that manually perform -their own type checking rather than relying on the type-safe code -produced by the qapi code generators. +names should be ALL_CAPS with words separated by underscore. Member +names cannot start with 'has-' or 'has_', as this is reserved for +tracking optional members. -Any name (command, event, type, field, or enum value) beginning with +Any name (command, event, type, member, or enum value) beginning with "x-" is marked experimental, and may be withdrawn or changed -incompatibly in a future release. Downstream vendors may add -extensions; such extensions should begin with a prefix matching -"__RFQDN_" (for the reverse-fully-qualified-domain-name of the -vendor), even if the rest of the name uses dash (example: -__com.redhat_drive-mirror). Other than downstream extensions (with -leading underscore and the use of dots), all names should begin with a -letter, and contain only ASCII letters, digits, dash, and underscore. -It is okay to reuse names that match C keywords; the generator will -rename a field named "default" in the QAPI to "q_default" in the +incompatibly in a future release. All names must begin with a letter, +and contain only ASCII letters, digits, dash, and underscore. There +are two exceptions: enum values may start with a digit, and any +extensions added by downstream vendors should start with a prefix +matching "__RFQDN_" (for the reverse-fully-qualified-domain-name of +the vendor), even if the rest of the name uses dash (example: +__com.redhat_drive-mirror). Names beginning with 'q_' are reserved +for the generator: QMP names that resemble C keywords or other +problematic strings will be munged in C to use this prefix. For +example, a member named "default" in qapi becomes "q_default" in the generated C code. In the rest of this document, usage lines are given for each @@ -140,17 +141,26 @@ must have a value that forms a struct name. === Built-in Types === -The following types are built-in to the parser: - 'str' - arbitrary UTF-8 string - 'int' - 64-bit signed integer (although the C code may place further - restrictions on acceptable range) - 'number' - floating point number - 'bool' - JSON value of true or false - 'int8', 'int16', 'int32', 'int64' - like 'int', but enforce maximum - bit size - 'uint8', 'uint16', 'uint32', 'uint64' - unsigned counterparts - 'size' - like 'uint64', but allows scaled suffix from command line - visitor +The following types are predefined, and map to C as follows: + + Schema C JSON + str char * any JSON string, UTF-8 + number double any JSON number + int int64_t a JSON number without fractional part + that fits into the C integer type + int8 int8_t likewise + int16 int16_t likewise + int32 int32_t likewise + int64 int64_t likewise + uint8 uint8_t likewise + uint16 uint16_t likewise + uint32 uint32_t likewise + uint64 uint64_t likewise + size uint64_t like uint64_t, except StringInputVisitor + accepts size suffixes + bool bool JSON true or false + any QObject * any JSON value + QType QType JSON string matching enum QType values === Includes === @@ -163,7 +173,7 @@ The QAPI schema definitions can be modularized using the 'include' directive: The directive is evaluated recursively, and include paths are relative to the file using the directive. Multiple includes of the same file are -safe. No other keys should appear in the expression, and the include +idempotent. No other keys should appear in the expression, and the include value should be a string. As a matter of style, it is a good idea to have all files be @@ -177,11 +187,11 @@ prevent incomplete include files. Usage: { 'struct': STRING, 'data': DICT, '*base': STRUCT-NAME } -A struct is a dictionary containing a single 'data' key whose -value is a dictionary. This corresponds to a struct in C or an Object -in JSON. Each value of the 'data' dictionary must be the name of a -type, or a one-element array containing a type name. An example of a -struct is: +A struct is a dictionary containing a single 'data' key whose value is +a dictionary; the dictionary may be empty. This corresponds to a +struct in C or an Object in JSON. Each value of the 'data' dictionary +must be the name of a type, or a one-element array containing a type +name. An example of a struct is: { 'struct': 'MyType', 'data': { 'member1': 'str', 'member2': 'int', '*member3': 'str' } } @@ -207,17 +217,18 @@ and must continue to work). On output structures (only mentioned in the 'returns' side of a command), changing from mandatory to optional is in general unsafe (older clients may be -expecting the field, and could crash if it is missing), although it can be done -if the only way that the optional argument will be omitted is when it is -triggered by the presence of a new input flag to the command that older clients -don't know to send. Changing from optional to mandatory is safe. +expecting the member, and could crash if it is missing), although it +can be done if the only way that the optional argument will be omitted +is when it is triggered by the presence of a new input flag to the +command that older clients don't know to send. Changing from optional +to mandatory is safe. A structure that is used in both input and output of various commands must consider the backwards compatibility constraints of both directions of use. A struct definition can specify another struct as its base. -In this case, the fields of the base type are included as top-level fields +In this case, the members of the base type are included as top-level members of the new struct's dictionary in the Client JSON Protocol wire format. An example definition is: @@ -227,7 +238,7 @@ format. An example definition is: 'data': { '*backing': 'str' } } An example BlockdevOptionsGenericCOWFormat object on the wire could use -both fields like this: +both members like this: { "file": "/some/place/my-image", "backing": "/some/place/my-backing-file" } @@ -236,6 +247,7 @@ both fields like this: === Enumeration types === Usage: { 'enum': STRING, 'data': ARRAY-OF-STRING } + { 'enum': STRING, '*prefix': STRING, 'data': ARRAY-OF-STRING } An enumeration type is a dictionary containing a single 'data' key whose value is a list of strings. An example enumeration is: @@ -247,6 +259,13 @@ useful. The list of strings should be lower case; if an enum name represents multiple words, use '-' between words. The string 'max' is not allowed as an enum value, and values should not be repeated. +The enum constants will be named by using a heuristic to turn the +type name into a set of underscore separated words. For the example +above, 'MyEnum' will turn into 'MY_ENUM' giving a constant name +of 'MY_ENUM_VALUE1' for the first value. If the default heuristic +does not result in a desirable name, the optional 'prefix' member +can be used when defining the enum. + The enumeration values are passed as strings over the Client JSON Protocol, but are encoded as C enum integral values in generated code. While the C code starts numbering at 0, it is better to use explicit @@ -257,42 +276,43 @@ converting between strings and enum values. Since the wire format always passes by name, it is acceptable to reorder or add new enumeration members in any location without breaking clients of Client JSON Protocol; however, removing enum values would break -compatibility. For any struct that has a field that will only contain -a finite set of string values, using an enum type for that field is -better than open-coding the field to be type 'str'. +compatibility. For any struct that has a member that will only contain +a finite set of string values, using an enum type for that member is +better than open-coding the member to be type 'str'. === Union types === Usage: { 'union': STRING, 'data': DICT } -or: { 'union': STRING, 'data': DICT, 'base': STRUCT-NAME, +or: { 'union': STRING, 'data': DICT, 'base': STRUCT-NAME-OR-DICT, 'discriminator': ENUM-MEMBER-OF-BASE } Union types are used to let the user choose between several different variants for an object. There are two flavors: simple (no -discriminator or base), flat (both discriminator and base). A union +discriminator or base), and flat (both discriminator and base). A union type is defined using a data dictionary as explained in the following -paragraphs. +paragraphs. The data dictionary for either type of union must not +be empty. A simple union type defines a mapping from automatic discriminator values to data types like in this example: - { 'struct': 'FileOptions', 'data': { 'filename': 'str' } } - { 'struct': 'Qcow2Options', - 'data': { 'backing-file': 'str', 'lazy-refcounts': 'bool' } } + { 'struct': 'BlockdevOptionsFile', 'data': { 'filename': 'str' } } + { 'struct': 'BlockdevOptionsQcow2', + 'data': { 'backing': 'str', '*lazy-refcounts': 'bool' } } - { 'union': 'BlockdevOptions', - 'data': { 'file': 'FileOptions', - 'qcow2': 'Qcow2Options' } } + { 'union': 'BlockdevOptionsSimple', + 'data': { 'file': 'BlockdevOptionsFile', + 'qcow2': 'BlockdevOptionsQcow2' } } In the Client JSON Protocol, a simple union is represented by a -dictionary that contains the 'type' field as a discriminator, and a -'data' field that is of the specified data type corresponding to the +dictionary that contains the 'type' member as a discriminator, and a +'data' member that is of the specified data type corresponding to the discriminator value, as in these examples: - { "type": "file", "data" : { "filename": "/some/place/my-image" } } - { "type": "qcow2", "data" : { "backing-file": "/some/place/my-image", - "lazy-refcounts": true } } + { "type": "file", "data": { "filename": "/some/place/my-image" } } + { "type": "qcow2", "data": { "backing": "/some/place/my-image", + "lazy-refcounts": true } } The generated C code uses a struct containing a union. Additionally, an implicit C enum 'NameKind' is created, corresponding to the union @@ -300,43 +320,43 @@ an implicit C enum 'NameKind' is created, corresponding to the union the union can be named 'max', as this would collide with the implicit enum. The value for each branch can be of any type. - -A flat union definition specifies a struct as its base, and -avoids nesting on the wire. All branches of the union must be -complex types, and the top-level fields of the union dictionary on -the wire will be combination of fields from both the base type and the -appropriate branch type (when merging two dictionaries, there must be -no keys in common). The 'discriminator' field must be the name of an -enum-typed member of the base struct. +A flat union definition avoids nesting on the wire, and specifies a +set of common members that occur in all variants of the union. The +'base' key must specifiy either a type name (the type must be a +struct, not a union), or a dictionary representing an anonymous type. +All branches of the union must be complex types, and the top-level +members of the union dictionary on the wire will be combination of +members from both the base type and the appropriate branch type (when +merging two dictionaries, there must be no keys in common). The +'discriminator' member must be the name of a non-optional enum-typed +member of the base struct. The following example enhances the above simple union example by -adding a common field 'readonly', renaming the discriminator to -something more applicable, and reducing the number of {} required on -the wire: +adding an optional common member 'read-only', renaming the +discriminator to something more applicable than the simple union's +default of 'type', and reducing the number of {} required on the wire: - { 'enum': 'BlockdevDriver', 'data': [ 'raw', 'qcow2' ] } - { 'struct': 'BlockdevCommonOptions', - 'data': { 'driver': 'BlockdevDriver', 'readonly': 'bool' } } + { 'enum': 'BlockdevDriver', 'data': [ 'file', 'qcow2' ] } { 'union': 'BlockdevOptions', - 'base': 'BlockdevCommonOptions', + 'base': { 'driver': 'BlockdevDriver', '*read-only': 'bool' }, 'discriminator': 'driver', - 'data': { 'file': 'FileOptions', - 'qcow2': 'Qcow2Options' } } + 'data': { 'file': 'BlockdevOptionsFile', + 'qcow2': 'BlockdevOptionsQcow2' } } Resulting in these JSON objects: - { "driver": "file", "readonly": true, + { "driver": "file", "read-only": true, "filename": "/some/place/my-image" } - { "driver": "qcow2", "readonly": false, - "backing-file": "/some/place/my-image", "lazy-refcounts": true } + { "driver": "qcow2", "read-only": false, + "backing": "/some/place/my-image", "lazy-refcounts": true } Notice that in a flat union, the discriminator name is controlled by the user, but because it must map to a base member with enum type, the code generator can ensure that branches exist for all values of the enum (although the order of the keys need not match the declaration of the enum). In the resulting generated C data types, a flat union is -represented as a struct with the base member fields included directly, -and then a union of structures for each branch of the struct. +represented as a struct with the base members included directly, and +then a union of structures for each branch of the struct. A simple union can always be re-written as a flat union where the base class has a single member named 'type', and where each branch of the @@ -347,10 +367,9 @@ union has a struct with a single member named 'data'. That is, is identical on the wire to: { 'enum': 'Enum', 'data': ['one', 'two'] } - { 'struct': 'Base', 'data': { 'type': 'Enum' } } { 'struct': 'Branch1', 'data': { 'data': 'str' } } { 'struct': 'Branch2', 'data': { 'data': 'int' } } - { 'union': 'Flat': 'base': 'Base', 'discriminator': 'type', + { 'union': 'Flat': 'base': { 'type': 'Enum' }, 'discriminator': 'type', 'data': { 'one': 'Branch1', 'two': 'Branch2' } } @@ -363,13 +382,10 @@ data types (string, integer, number, or object, but currently not array) on the wire. The definition is similar to a simple union type, where each branch of the union names a QAPI type. For example: - { 'alternate': 'BlockRef', + { 'alternate': 'BlockdevRef', 'data': { 'definition': 'BlockdevOptions', 'reference': 'str' } } -Just like for a simple union, an implicit C enum 'NameKind' is created -to enumerate the branches for the alternate 'Name'. - Unlike a union, the discriminator string is never passed on the wire for the Client JSON Protocol. Instead, the value's JSON type serves as an implicit discriminator, which in turn means that an alternate @@ -387,14 +403,14 @@ following example objects: { "file": "my_existing_block_device_id" } { "file": { "driver": "file", - "readonly": false, + "read-only": false, "filename": "/tmp/mydisk.qcow2" } } === Commands === Usage: { 'command': STRING, '*data': COMPLEX-TYPE-NAME-OR-DICT, - '*returns': TYPE-NAME-OR-DICT, + '*returns': TYPE-NAME, '*gen': false, '*success-response': false } Commands are defined by using a dictionary containing several members, @@ -405,25 +421,23 @@ Client JSON Protocol command exchange. The 'data' argument maps to the "arguments" dictionary passed in as part of a Client JSON Protocol command. The 'data' member is optional and defaults to {} (an empty dictionary). If present, it must be the -string name of a complex type, a one-element array containing the name -of a complex type, or a dictionary that declares an anonymous type -with the same semantics as a 'struct' expression, with one exception -noted below when 'gen' is used. +string name of a complex type, or a dictionary that declares an +anonymous type with the same semantics as a 'struct' expression, with +one exception noted below when 'gen' is used. -The 'returns' member describes what will appear in the "return" field +The 'returns' member describes what will appear in the "return" member of a Client JSON Protocol reply on successful completion of a command. The member is optional from the command declaration; if absent, the -"return" field will be an empty dictionary. If 'returns' is present, +"return" member will be an empty dictionary. If 'returns' is present, it must be the string name of a complex or built-in type, a one-element array containing the name of a complex or built-in type, -or a dictionary that declares an anonymous type with the same -semantics as a 'struct' expression, with one exception noted below -when 'gen' is used. Although it is permitted to have the 'returns' -member name a built-in type or an array of built-in types, any command -that does this cannot be extended to return additional information in -the future; thus, new commands should strongly consider returning a -dictionary-based type or an array of dictionaries, even if the -dictionary only contains one field at the present. +with one exception noted below when 'gen' is used. Although it is +permitted to have the 'returns' member name a built-in type or an +array of built-in types, any command that does this cannot be extended +to return additional information in the future; thus, new commands +should strongly consider returning a dictionary-based type or an array +of dictionaries, even if the dictionary only contains one member at the +present. All commands in Client JSON Protocol use a dictionary to report failure, with no way to specify that in QAPI. Where the error return @@ -448,17 +462,14 @@ which would validate this Client JSON Protocol transaction: <= { "return": [ { "value": "one" }, { } ] } In rare cases, QAPI cannot express a type-safe representation of a -corresponding Client JSON Protocol command. In these cases, if the -command expression includes the key 'gen' with boolean value false, -then the 'data' or 'returns' member that intends to bypass generated -type-safety and do its own manual validation should use an inline -dictionary definition, with a value of '**' rather than a valid type -name for the keys that the generated code will not validate. Please -try to avoid adding new commands that rely on this, and instead use -type-safe unions. For an example of bypass usage: +corresponding Client JSON Protocol command. You then have to suppress +generation of a marshalling function by including a key 'gen' with +boolean value false, and instead write your own function. Please try +to avoid adding new commands that rely on this, and instead use +type-safe unions. For an example of this usage: { 'command': 'netdev_add', - 'data': {'type': 'str', 'id': 'str', '*props': '**'}, + 'data': {'type': 'str', 'id': 'str'}, 'gen': false } Normally, the QAPI schema is used to describe synchronous exchanges, @@ -468,7 +479,7 @@ response is not possible (although the command will still return a normal dictionary error on failure). When a successful reply is not possible, the command expression should include the optional key 'success-response' with boolean value false. So far, only QGA makes -use of this field. +use of this member. === Events === @@ -495,34 +506,255 @@ Resulting in this JSON object: "timestamp": { "seconds": 1267020223, "microseconds": 435656 } } +== Client JSON Protocol introspection == + +Clients of a Client JSON Protocol commonly need to figure out what +exactly the server (QEMU) supports. + +For this purpose, QMP provides introspection via command +query-qmp-schema. QGA currently doesn't support introspection. + +While Client JSON Protocol wire compatibility should be maintained +between qemu versions, we cannot make the same guarantees for +introspection stability. For example, one version of qemu may provide +a non-variant optional member of a struct, and a later version rework +the member to instead be non-optional and associated with a variant. +Likewise, one version of qemu may list a member with open-ended type +'str', and a later version could convert it to a finite set of strings +via an enum type; or a member may be converted from a specific type to +an alternate that represents a choice between the original type and +something else. + +query-qmp-schema returns a JSON array of SchemaInfo objects. These +objects together describe the wire ABI, as defined in the QAPI schema. +There is no specified order to the SchemaInfo objects returned; a +client must search for a particular name throughout the entire array +to learn more about that name, but is at least guaranteed that there +will be no collisions between type, command, and event names. + +However, the SchemaInfo can't reflect all the rules and restrictions +that apply to QMP. It's interface introspection (figuring out what's +there), not interface specification. The specification is in the QAPI +schema. To understand how QMP is to be used, you need to study the +QAPI schema. + +Like any other command, query-qmp-schema is itself defined in the QAPI +schema, along with the SchemaInfo type. This text attempts to give an +overview how things work. For details you need to consult the QAPI +schema. + +SchemaInfo objects have common members "name" and "meta-type", and +additional variant members depending on the value of meta-type. + +Each SchemaInfo object describes a wire ABI entity of a certain +meta-type: a command, event or one of several kinds of type. + +SchemaInfo for commands and events have the same name as in the QAPI +schema. + +Command and event names are part of the wire ABI, but type names are +not. Therefore, the SchemaInfo for types have auto-generated +meaningless names. For readability, the examples in this section use +meaningful type names instead. + +To examine a type, start with a command or event using it, then follow +references by name. + +QAPI schema definitions not reachable that way are omitted. + +The SchemaInfo for a command has meta-type "command", and variant +members "arg-type" and "ret-type". On the wire, the "arguments" +member of a client's "execute" command must conform to the object type +named by "arg-type". The "return" member that the server passes in a +success response conforms to the type named by "ret-type". + +If the command takes no arguments, "arg-type" names an object type +without members. Likewise, if the command returns nothing, "ret-type" +names an object type without members. + +Example: the SchemaInfo for command query-qmp-schema + + { "name": "query-qmp-schema", "meta-type": "command", + "arg-type": "q_empty", "ret-type": "SchemaInfoList" } + + Type "q_empty" is an automatic object type without members, and type + "SchemaInfoList" is the array of SchemaInfo type. + +The SchemaInfo for an event has meta-type "event", and variant member +"arg-type". On the wire, a "data" member that the server passes in an +event conforms to the object type named by "arg-type". + +If the event carries no additional information, "arg-type" names an +object type without members. The event may not have a data member on +the wire then. + +Each command or event defined with dictionary-valued 'data' in the +QAPI schema implicitly defines an object type. + +Example: the SchemaInfo for EVENT_C from section Events + + { "name": "EVENT_C", "meta-type": "event", + "arg-type": "q_obj-EVENT_C-arg" } + + Type "q_obj-EVENT_C-arg" is an implicitly defined object type with + the two members from the event's definition. + +The SchemaInfo for struct and union types has meta-type "object". + +The SchemaInfo for a struct type has variant member "members". + +The SchemaInfo for a union type additionally has variant members "tag" +and "variants". + +"members" is a JSON array describing the object's common members, if +any. Each element is a JSON object with members "name" (the member's +name), "type" (the name of its type), and optionally "default". The +member is optional if "default" is present. Currently, "default" can +only have value null. Other values are reserved for future +extensions. The "members" array is in no particular order; clients +must search the entire object when learning whether a particular +member is supported. + +Example: the SchemaInfo for MyType from section Struct types + + { "name": "MyType", "meta-type": "object", + "members": [ + { "name": "member1", "type": "str" }, + { "name": "member2", "type": "int" }, + { "name": "member3", "type": "str", "default": null } ] } + +"tag" is the name of the common member serving as type tag. +"variants" is a JSON array describing the object's variant members. +Each element is a JSON object with members "case" (the value of type +tag this element applies to) and "type" (the name of an object type +that provides the variant members for this type tag value). The +"variants" array is in no particular order, and is not guaranteed to +list cases in the same order as the corresponding "tag" enum type. + +Example: the SchemaInfo for flat union BlockdevOptions from section +Union types + + { "name": "BlockdevOptions", "meta-type": "object", + "members": [ + { "name": "driver", "type": "BlockdevDriver" }, + { "name": "read-only", "type": "bool", "default": null } ], + "tag": "driver", + "variants": [ + { "case": "file", "type": "BlockdevOptionsFile" }, + { "case": "qcow2", "type": "BlockdevOptionsQcow2" } ] } + +Note that base types are "flattened": its members are included in the +"members" array. + +A simple union implicitly defines an enumeration type for its implicit +discriminator (called "type" on the wire, see section Union types). + +A simple union implicitly defines an object type for each of its +variants. + +Example: the SchemaInfo for simple union BlockdevOptionsSimple from section +Union types + + { "name": "BlockdevOptionsSimple", "meta-type": "object", + "members": [ + { "name": "type", "type": "BlockdevOptionsSimpleKind" } ], + "tag": "type", + "variants": [ + { "case": "file", "type": "q_obj-BlockdevOptionsFile-wrapper" }, + { "case": "qcow2", "type": "q_obj-BlockdevOptionsQcow2-wrapper" } ] } + + Enumeration type "BlockdevOptionsSimpleKind" and the object types + "q_obj-BlockdevOptionsFile-wrapper", "q_obj-BlockdevOptionsQcow2-wrapper" + are implicitly defined. + +The SchemaInfo for an alternate type has meta-type "alternate", and +variant member "members". "members" is a JSON array. Each element is +a JSON object with member "type", which names a type. Values of the +alternate type conform to exactly one of its member types. There is +no guarantee on the order in which "members" will be listed. + +Example: the SchemaInfo for BlockdevRef from section Alternate types + + { "name": "BlockdevRef", "meta-type": "alternate", + "members": [ + { "type": "BlockdevOptions" }, + { "type": "str" } ] } + +The SchemaInfo for an array type has meta-type "array", and variant +member "element-type", which names the array's element type. Array +types are implicitly defined. For convenience, the array's name may +resemble the element type; however, clients should examine member +"element-type" instead of making assumptions based on parsing member +"name". + +Example: the SchemaInfo for ['str'] + + { "name": "[str]", "meta-type": "array", + "element-type": "str" } + +The SchemaInfo for an enumeration type has meta-type "enum" and +variant member "values". The values are listed in no particular +order; clients must search the entire enum when learning whether a +particular value is supported. + +Example: the SchemaInfo for MyEnum from section Enumeration types + + { "name": "MyEnum", "meta-type": "enum", + "values": [ "value1", "value2", "value3" ] } + +The SchemaInfo for a built-in type has the same name as the type in +the QAPI schema (see section Built-in Types), with one exception +detailed below. It has variant member "json-type" that shows how +values of this type are encoded on the wire. + +Example: the SchemaInfo for str + + { "name": "str", "meta-type": "builtin", "json-type": "string" } + +The QAPI schema supports a number of integer types that only differ in +how they map to C. They are identical as far as SchemaInfo is +concerned. Therefore, they get all mapped to a single type "int" in +SchemaInfo. + +As explained above, type names are not part of the wire ABI. Not even +the names of built-in types. Clients should examine member +"json-type" instead of hard-coding names of built-in types. + + == Code generation == -Schemas are fed into 3 scripts to generate all the code/files that, paired -with the core QAPI libraries, comprise everything required to take JSON -commands read in by a Client JSON Protocol server, unmarshal the arguments into -the underlying C types, call into the corresponding C function, and map the -response back to a Client JSON Protocol response to be returned to the user. +Schemas are fed into five scripts to generate all the code/files that, +paired with the core QAPI libraries, comprise everything required to +take JSON commands read in by a Client JSON Protocol server, unmarshal +the arguments into the underlying C types, call into the corresponding +C function, map the response back to a Client JSON Protocol response +to be returned to the user, and introspect the commands. -As an example, we'll use the following schema, which describes a single -complex user-defined type (which will produce a C struct, along with a list -node structure that can be used to chain together a list of such types in -case we want to accept/return a list of this type with a command), and a -command which takes that type as a parameter and returns the same type: +As an example, we'll use the following schema, which describes a +single complex user-defined type, along with command which takes a +list of that type as a parameter, and returns a single element of that +type. The user is responsible for writing the implementation of +qmp_my_command(); everything else is produced by the generator. $ cat example-schema.json { 'struct': 'UserDefOne', - 'data': { 'integer': 'int', 'string': 'str' } } + 'data': { 'integer': 'int', '*string': 'str' } } { 'command': 'my-command', - 'data': {'arg1': 'UserDefOne'}, + 'data': { 'arg1': ['UserDefOne'] }, 'returns': 'UserDefOne' } { 'event': 'MY_EVENT' } +For a more thorough look at generated code, the testsuite includes +tests/qapi-schema/qapi-schema-tests.json that covers more examples of +what the generator will accept, and compiles the resulting C code as +part of 'make check-unit'. + === scripts/qapi-types.py === -Used to generate the C types defined by a schema. The following files are -created: +Used to generate the C types defined by a schema, along with +supporting code. The following files are created: $(prefix)qapi-types.h - C types corresponding to types defined in the schema you pass in @@ -537,77 +769,73 @@ Example: $ python scripts/qapi-types.py --output-dir="qapi-generated" \ --prefix="example-" example-schema.json + $ cat qapi-generated/example-qapi-types.h +[Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_TYPES_H + #define EXAMPLE_QAPI_TYPES_H + +[Built-in types omitted...] + + typedef struct UserDefOne UserDefOne; + + typedef struct UserDefOneList UserDefOneList; + + struct UserDefOne { + int64_t integer; + bool has_string; + char *string; + }; + + void qapi_free_UserDefOne(UserDefOne *obj); + + struct UserDefOneList { + UserDefOneList *next; + UserDefOne *value; + }; + + void qapi_free_UserDefOneList(UserDefOneList *obj); + + #endif $ cat qapi-generated/example-qapi-types.c [Uninteresting stuff omitted...] - void qapi_free_UserDefOneList(UserDefOneList *obj) + void qapi_free_UserDefOne(UserDefOne *obj) { - QapiDeallocVisitor *md; + QapiDeallocVisitor *qdv; Visitor *v; if (!obj) { return; } - md = qapi_dealloc_visitor_new(); - v = qapi_dealloc_get_visitor(md); - visit_type_UserDefOneList(v, &obj, NULL, NULL); - qapi_dealloc_visitor_cleanup(md); + qdv = qapi_dealloc_visitor_new(); + v = qapi_dealloc_get_visitor(qdv); + visit_type_UserDefOne(v, NULL, &obj, NULL); + qapi_dealloc_visitor_cleanup(qdv); } - void qapi_free_UserDefOne(UserDefOne *obj) + void qapi_free_UserDefOneList(UserDefOneList *obj) { - QapiDeallocVisitor *md; + QapiDeallocVisitor *qdv; Visitor *v; if (!obj) { return; } - md = qapi_dealloc_visitor_new(); - v = qapi_dealloc_get_visitor(md); - visit_type_UserDefOne(v, &obj, NULL, NULL); - qapi_dealloc_visitor_cleanup(md); + qdv = qapi_dealloc_visitor_new(); + v = qapi_dealloc_get_visitor(qdv); + visit_type_UserDefOneList(v, NULL, &obj, NULL); + qapi_dealloc_visitor_cleanup(qdv); } - $ cat qapi-generated/example-qapi-types.h -[Uninteresting stuff omitted...] - - #ifndef EXAMPLE_QAPI_TYPES_H - #define EXAMPLE_QAPI_TYPES_H - -[Built-in types omitted...] - - typedef struct UserDefOne UserDefOne; - - typedef struct UserDefOneList - { - union { - UserDefOne *value; - uint64_t padding; - }; - struct UserDefOneList *next; - } UserDefOneList; - -[Functions on built-in types omitted...] - - struct UserDefOne - { - int64_t integer; - char *string; - }; - - void qapi_free_UserDefOneList(UserDefOneList *obj); - void qapi_free_UserDefOne(UserDefOne *obj); - - #endif - === scripts/qapi-visit.py === -Used to generate the visitor functions used to walk through and convert -a QObject (as provided by QMP) to a native C data structure and -vice-versa, as well as the visitor function used to dealloc a complex -schema-defined C type. +Used to generate the visitor functions used to walk through and +convert between a native QAPI C data structure and some other format +(such as QObject); the generated functions are named visit_type_FOO() +and visit_type_FOO_members(). The following files are generated: @@ -624,79 +852,90 @@ Example: $ python scripts/qapi-visit.py --output-dir="qapi-generated" --prefix="example-" example-schema.json + $ cat qapi-generated/example-qapi-visit.h +[Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_VISIT_H + #define EXAMPLE_QAPI_VISIT_H + +[Visitors for built-in types omitted...] + + void visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp); + void visit_type_UserDefOne(Visitor *v, const char *name, UserDefOne **obj, Error **errp); + void visit_type_UserDefOneList(Visitor *v, const char *name, UserDefOneList **obj, Error **errp); + + #endif $ cat qapi-generated/example-qapi-visit.c [Uninteresting stuff omitted...] - static void visit_type_UserDefOne_fields(Visitor *m, UserDefOne **obj, Error **errp) + void visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp) { Error *err = NULL; - visit_type_int(m, &(*obj)->integer, "integer", &err); + + visit_type_int(v, "integer", &obj->integer, &err); if (err) { goto out; } - visit_type_str(m, &(*obj)->string, "string", &err); - if (err) { - goto out; + if (visit_optional(v, "string", &obj->has_string)) { + visit_type_str(v, "string", &obj->string, &err); + if (err) { + goto out; + } } out: error_propagate(errp, err); } - void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp) + void visit_type_UserDefOne(Visitor *v, const char *name, UserDefOne **obj, Error **errp) { Error *err = NULL; - visit_start_struct(m, (void **)obj, "UserDefOne", name, sizeof(UserDefOne), &err); - if (!err) { - if (*obj) { - visit_type_UserDefOne_fields(m, obj, errp); - } - visit_end_struct(m, &err); + visit_start_struct(v, name, (void **)obj, sizeof(UserDefOne), &err); + if (err) { + goto out; } + if (!*obj) { + goto out_obj; + } + visit_type_UserDefOne_members(v, *obj, &err); + error_propagate(errp, err); + err = NULL; + out_obj: + visit_end_struct(v, &err); + out: error_propagate(errp, err); } - void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp) + void visit_type_UserDefOneList(Visitor *v, const char *name, UserDefOneList **obj, Error **errp) { Error *err = NULL; GenericList *i, **prev; - visit_start_list(m, name, &err); + visit_start_list(v, name, &err); if (err) { goto out; } for (prev = (GenericList **)obj; - !err && (i = visit_next_list(m, prev, &err)) != NULL; + !err && (i = visit_next_list(v, prev, sizeof(**obj))) != NULL; prev = &i) { UserDefOneList *native_i = (UserDefOneList *)i; - visit_type_UserDefOne(m, &native_i->value, NULL, &err); + visit_type_UserDefOne(v, NULL, &native_i->value, &err); } - error_propagate(errp, err); - err = NULL; - visit_end_list(m, &err); + visit_end_list(v); out: error_propagate(errp, err); } - $ cat qapi-generated/example-qapi-visit.h -[Uninteresting stuff omitted...] - - #ifndef EXAMPLE_QAPI_VISIT_H - #define EXAMPLE_QAPI_VISIT_H - -[Visitors for built-in types omitted...] - - void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp); - void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp); - - #endif === scripts/qapi-commands.py === -Used to generate the marshaling/dispatch functions for the commands defined -in the schema. The following files are generated: +Used to generate the marshaling/dispatch functions for the commands +defined in the schema. The generated code implements +qmp_marshal_COMMAND() (mentioned in qmp-commands.hx, and registered +automatically), and declares qmp_COMMAND() that the user must +implement. The following files are generated: $(prefix)qmp-marshal.c: command marshal/dispatch functions for each QMP command defined in the schema. Functions @@ -714,88 +953,88 @@ Example: $ python scripts/qapi-commands.py --output-dir="qapi-generated" --prefix="example-" example-schema.json + $ cat qapi-generated/example-qmp-commands.h +[Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QMP_COMMANDS_H + #define EXAMPLE_QMP_COMMANDS_H + + #include "example-qapi-types.h" + #include "qapi/qmp/qdict.h" + #include "qapi/error.h" + + UserDefOne *qmp_my_command(UserDefOneList *arg1, Error **errp); + + #endif $ cat qapi-generated/example-qmp-marshal.c [Uninteresting stuff omitted...] - static void qmp_marshal_output_my_command(UserDefOne *ret_in, QObject **ret_out, Error **errp) + static void qmp_marshal_output_UserDefOne(UserDefOne *ret_in, QObject **ret_out, Error **errp) { - Error *local_err = NULL; - QmpOutputVisitor *mo = qmp_output_visitor_new(); - QapiDeallocVisitor *md; + Error *err = NULL; + QmpOutputVisitor *qov = qmp_output_visitor_new(); + QapiDeallocVisitor *qdv; Visitor *v; - v = qmp_output_get_visitor(mo); - visit_type_UserDefOne(v, &ret_in, "unused", &local_err); - if (local_err) { + v = qmp_output_get_visitor(qov); + visit_type_UserDefOne(v, "unused", &ret_in, &err); + if (err) { goto out; } - *ret_out = qmp_output_get_qobject(mo); + *ret_out = qmp_output_get_qobject(qov); out: - error_propagate(errp, local_err); - qmp_output_visitor_cleanup(mo); - md = qapi_dealloc_visitor_new(); - v = qapi_dealloc_get_visitor(md); - visit_type_UserDefOne(v, &ret_in, "unused", NULL); - qapi_dealloc_visitor_cleanup(md); + error_propagate(errp, err); + qmp_output_visitor_cleanup(qov); + qdv = qapi_dealloc_visitor_new(); + v = qapi_dealloc_get_visitor(qdv); + visit_type_UserDefOne(v, "unused", &ret_in, NULL); + qapi_dealloc_visitor_cleanup(qdv); } - static void qmp_marshal_input_my_command(QDict *args, QObject **ret, Error **errp) + static void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp) { - Error *local_err = NULL; - UserDefOne *retval = NULL; - QmpInputVisitor *mi = qmp_input_visitor_new_strict(QOBJECT(args)); - QapiDeallocVisitor *md; + Error *err = NULL; + UserDefOne *retval; + QmpInputVisitor *qiv = qmp_input_visitor_new_strict(QOBJECT(args)); + QapiDeallocVisitor *qdv; Visitor *v; - UserDefOne *arg1 = NULL; + UserDefOneList *arg1 = NULL; - v = qmp_input_get_visitor(mi); - visit_type_UserDefOne(v, &arg1, "arg1", &local_err); - if (local_err) { + v = qmp_input_get_visitor(qiv); + visit_type_UserDefOneList(v, "arg1", &arg1, &err); + if (err) { goto out; } - retval = qmp_my_command(arg1, &local_err); - if (local_err) { + retval = qmp_my_command(arg1, &err); + if (err) { goto out; } - qmp_marshal_output_my_command(retval, ret, &local_err); + qmp_marshal_output_UserDefOne(retval, ret, &err); out: - error_propagate(errp, local_err); - qmp_input_visitor_cleanup(mi); - md = qapi_dealloc_visitor_new(); - v = qapi_dealloc_get_visitor(md); - visit_type_UserDefOne(v, &arg1, "arg1", NULL); - qapi_dealloc_visitor_cleanup(md); - return; + error_propagate(errp, err); + qmp_input_visitor_cleanup(qiv); + qdv = qapi_dealloc_visitor_new(); + v = qapi_dealloc_get_visitor(qdv); + visit_type_UserDefOneList(v, "arg1", &arg1, NULL); + qapi_dealloc_visitor_cleanup(qdv); } static void qmp_init_marshal(void) { - qmp_register_command("my-command", qmp_marshal_input_my_command, QCO_NO_OPTIONS); + qmp_register_command("my-command", qmp_marshal_my_command, QCO_NO_OPTIONS); } qapi_init(qmp_init_marshal); - $ cat qapi-generated/example-qmp-commands.h -[Uninteresting stuff omitted...] - - #ifndef EXAMPLE_QMP_COMMANDS_H - #define EXAMPLE_QMP_COMMANDS_H - - #include "example-qapi-types.h" - #include "qapi/qmp/qdict.h" - #include "qapi/error.h" - - UserDefOne *qmp_my_command(UserDefOne *arg1, Error **errp); - - #endif === scripts/qapi-event.py === -Used to generate the event-related C code defined by a schema. The -following files are created: +Used to generate the event-related C code defined by a schema, with +implementations for qapi_event_send_FOO(). The following files are +created: $(prefix)qapi-event.h - Function prototypes for each event type, plus an enumeration of all event names @@ -805,13 +1044,34 @@ Example: $ python scripts/qapi-event.py --output-dir="qapi-generated" --prefix="example-" example-schema.json + $ cat qapi-generated/example-qapi-event.h +[Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_EVENT_H + #define EXAMPLE_QAPI_EVENT_H + + #include "qapi/error.h" + #include "qapi/qmp/qdict.h" + #include "example-qapi-types.h" + + + void qapi_event_send_my_event(Error **errp); + + typedef enum example_QAPIEvent { + EXAMPLE_QAPI_EVENT_MY_EVENT = 0, + EXAMPLE_QAPI_EVENT__MAX = 1, + } example_QAPIEvent; + + extern const char *const example_QAPIEvent_lookup[]; + + #endif $ cat qapi-generated/example-qapi-event.c [Uninteresting stuff omitted...] void qapi_event_send_my_event(Error **errp) { QDict *qmp; - Error *local_err = NULL; + Error *err = NULL; QMPEventFuncEmit emit; emit = qmp_event_get_func_emit(); if (!emit) { @@ -820,34 +1080,48 @@ Example: qmp = qmp_event_build_dict("MY_EVENT"); - emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp, &local_err); + emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp, &err); - error_propagate(errp, local_err); + error_propagate(errp, err); QDECREF(qmp); } - const char *EXAMPLE_QAPIEvent_lookup[] = { - "MY_EVENT", - NULL, + const char *const example_QAPIEvent_lookup[] = { + [EXAMPLE_QAPI_EVENT_MY_EVENT] = "MY_EVENT", + [EXAMPLE_QAPI_EVENT__MAX] = NULL, }; - $ cat qapi-generated/example-qapi-event.h -[Uninteresting stuff omitted...] - #ifndef EXAMPLE_QAPI_EVENT_H - #define EXAMPLE_QAPI_EVENT_H +=== scripts/qapi-introspect.py === - #include "qapi/error.h" - #include "qapi/qmp/qdict.h" - #include "example-qapi-types.h" +Used to generate the introspection C code for a schema. The following +files are created: +$(prefix)qmp-introspect.c - Defines a string holding a JSON + description of the schema. +$(prefix)qmp-introspect.h - Declares the above string. - void qapi_event_send_my_event(Error **errp); +Example: - extern const char *EXAMPLE_QAPIEvent_lookup[]; - typedef enum EXAMPLE_QAPIEvent - { - EXAMPLE_QAPI_EVENT_MY_EVENT = 0, - EXAMPLE_QAPI_EVENT_MAX = 1, - } EXAMPLE_QAPIEvent; + $ python scripts/qapi-introspect.py --output-dir="qapi-generated" + --prefix="example-" example-schema.json + $ cat qapi-generated/example-qmp-introspect.h +[Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QMP_INTROSPECT_H + #define EXAMPLE_QMP_INTROSPECT_H + + extern const char example_qmp_schema_json[]; #endif + $ cat qapi-generated/example-qmp-introspect.c +[Uninteresting stuff omitted...] + + const char example_qmp_schema_json[] = "[" + "{\"arg-type\": \"0\", \"meta-type\": \"event\", \"name\": \"MY_EVENT\"}, " + "{\"arg-type\": \"1\", \"meta-type\": \"command\", \"name\": \"my-command\", \"ret-type\": \"2\"}, " + "{\"members\": [], \"meta-type\": \"object\", \"name\": \"0\"}, " + "{\"members\": [{\"name\": \"arg1\", \"type\": \"[2]\"}], \"meta-type\": \"object\", \"name\": \"1\"}, " + "{\"members\": [{\"name\": \"integer\", \"type\": \"int\"}, {\"default\": null, \"name\": \"string\", \"type\": \"str\"}], \"meta-type\": \"object\", \"name\": \"2\"}, " + "{\"element-type\": \"2\", \"meta-type\": \"array\", \"name\": \"[2]\"}, " + "{\"json-type\": \"int\", \"meta-type\": \"builtin\", \"name\": \"int\"}, " + "{\"json-type\": \"string\", \"meta-type\": \"builtin\", \"name\": \"str\"}]"; diff --git a/qemu/docs/qcow2-cache.txt b/qemu/docs/qcow2-cache.txt new file mode 100644 index 000000000..5bb06072d --- /dev/null +++ b/qemu/docs/qcow2-cache.txt @@ -0,0 +1,164 @@ +qcow2 L2/refcount cache configuration +===================================== +Copyright (C) 2015 Igalia, S.L. +Author: Alberto Garcia <berto@igalia.com> + +This work is licensed under the terms of the GNU GPL, version 2 or +later. See the COPYING file in the top-level directory. + +Introduction +------------ +The QEMU qcow2 driver has two caches that can improve the I/O +performance significantly. However, setting the right cache sizes is +not a straightforward operation. + +This document attempts to give an overview of the L2 and refcount +caches, and how to configure them. + +Please refer to the docs/specs/qcow2.txt file for an in-depth +technical description of the qcow2 file format. + + +Clusters +-------- +A qcow2 file is organized in units of constant size called clusters. + +The cluster size is configurable, but it must be a power of two and +its value 512 bytes or higher. QEMU currently defaults to 64 KB +clusters, and it does not support sizes larger than 2MB. + +The 'qemu-img create' command supports specifying the size using the +cluster_size option: + + qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G + + +The L2 tables +------------- +The qcow2 format uses a two-level structure to map the virtual disk as +seen by the guest to the disk image in the host. These structures are +called the L1 and L2 tables. + +There is one single L1 table per disk image. The table is small and is +always kept in memory. + +There can be many L2 tables, depending on how much space has been +allocated in the image. Each table is one cluster in size. In order to +read or write data from the virtual disk, QEMU needs to read its +corresponding L2 table to find out where that data is located. Since +reading the table for each I/O operation can be expensive, QEMU keeps +an L2 cache in memory to speed up disk access. + +The size of the L2 cache can be configured, and setting the right +value can improve the I/O performance significantly. + + +The refcount blocks +------------------- +The qcow2 format also mantains a reference count for each cluster. +Reference counts are used for cluster allocation and internal +snapshots. The data is stored in a two-level structure similar to the +L1/L2 tables described above. + +The second level structures are called refcount blocks, are also one +cluster in size and the number is also variable and dependent on the +amount of allocated space. + +Each block contains a number of refcount entries. Their size (in bits) +is a power of two and must not be higher than 64. It defaults to 16 +bits, but a different value can be set using the refcount_bits option: + + qemu-img create -f qcow2 -o refcount_bits=8 hd.qcow2 4G + +QEMU keeps a refcount cache to speed up I/O much like the +aforementioned L2 cache, and its size can also be configured. + + +Choosing the right cache sizes +------------------------------ +In order to choose the cache sizes we need to know how they relate to +the amount of allocated space. + +The amount of virtual disk that can be mapped by the L2 and refcount +caches (in bytes) is: + + disk_size = l2_cache_size * cluster_size / 8 + disk_size = refcount_cache_size * cluster_size * 8 / refcount_bits + +With the default values for cluster_size (64KB) and refcount_bits +(16), that is + + disk_size = l2_cache_size * 8192 + disk_size = refcount_cache_size * 32768 + +So in order to cover n GB of disk space with the default values we +need: + + l2_cache_size = disk_size_GB * 131072 + refcount_cache_size = disk_size_GB * 32768 + +QEMU has a default L2 cache of 1MB (1048576 bytes) and a refcount +cache of 256KB (262144 bytes), so using the formulas we've just seen +we have + + 1048576 / 131072 = 8 GB of virtual disk covered by that cache + 262144 / 32768 = 8 GB + + +How to configure the cache sizes +-------------------------------- +Cache sizes can be configured using the -drive option in the +command-line, or the 'blockdev-add' QMP command. + +There are three options available, and all of them take bytes: + +"l2-cache-size": maximum size of the L2 table cache +"refcount-cache-size": maximum size of the refcount block cache +"cache-size": maximum size of both caches combined + +There are two things that need to be taken into account: + + - Both caches must have a size that is a multiple of the cluster + size. + + - If you only set one of the options above, QEMU will automatically + adjust the others so that the L2 cache is 4 times bigger than the + refcount cache. + +This means that these options are equivalent: + + -drive file=hd.qcow2,l2-cache-size=2097152 + -drive file=hd.qcow2,refcount-cache-size=524288 + -drive file=hd.qcow2,cache-size=2621440 + +The reason for this 1/4 ratio is to ensure that both caches cover the +same amount of disk space. Note however that this is only valid with +the default value of refcount_bits (16). If you are using a different +value you might want to calculate both cache sizes yourself since QEMU +will always use the same 1/4 ratio. + +It's also worth mentioning that there's no strict need for both caches +to cover the same amount of disk space. The refcount cache is used +much less often than the L2 cache, so it's perfectly reasonable to +keep it small. + + +Reducing the memory usage +------------------------- +It is possible to clean unused cache entries in order to reduce the +memory usage during periods of low I/O activity. + +The parameter "cache-clean-interval" defines an interval (in seconds). +All cache entries that haven't been accessed during that interval are +removed from memory. + +This example removes all unused cache entries every 15 minutes: + + -drive file=hd.qcow2,cache-clean-interval=900 + +If unset, the default value for this parameter is 0 and it disables +this feature. + +Note that this functionality currently relies on the MADV_DONTNEED +argument for madvise() to actually free the memory, so it is not +useful in systems that don't follow that behavior. diff --git a/qemu/docs/qmp/qmp-events.txt b/qemu/docs/qmp-events.txt index d92cc4833..fa7574d67 100644 --- a/qemu/docs/qmp/qmp-events.txt +++ b/qemu/docs/qmp-events.txt @@ -28,6 +28,8 @@ Example: "data": { "actual": 944766976 }, "timestamp": { "seconds": 1267020223, "microseconds": 435656 } } +Note: this event is rate-limited. + BLOCK_IMAGE_CORRUPTED --------------------- @@ -218,6 +220,24 @@ Data: }, "timestamp": { "seconds": 1265044230, "microseconds": 450486 } } +DUMP_COMPLETED +-------------- + +Emitted when the guest has finished one memory dump. + +Data: + +- "result": DumpQueryResult type described in qapi-schema.json +- "error": Error message when dump failed. This is only a + human-readable string provided when dump failed. It should not be + parsed in any way (json-string, optional) + +Example: + +{ "event": "DUMP_COMPLETED", + "data": {"result": {"total": 1090650112, "status": "completed", + "completed": 1090650112} } } + GUEST_PANICKED -------------- @@ -296,6 +316,8 @@ Example: "data": { "reference": "usr1", "sector-num": 345435, "sectors-count": 5 }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } +Note: this event is rate-limited. + QUORUM_REPORT_BAD ----------------- @@ -303,6 +325,7 @@ Emitted to report a corruption of a Quorum file. Data: +- "type": Quorum operation type - "error": Error message (json-string, optional) Only present on failure. This field contains a human-readable error message. There are no semantics other than that the @@ -314,10 +337,20 @@ Data: Example: +Read operation: { "event": "QUORUM_REPORT_BAD", - "data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 }, + "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 5, + "type": "read" }, "timestamp": { "seconds": 1344522075, "microseconds": 745528 } } +Flush operation: +{ "event": "QUORUM_REPORT_BAD", + "data": { "node-name": "node0", "sector-num": 0, "sectors-count": 2097120, + "type": "flush", "error": "Broken pipe" }, + "timestamp": { "seconds": 1456406829, "microseconds": 291763 } } + +Note: this event is rate-limited. + RESET ----- @@ -358,6 +391,8 @@ Example: "data": { "offset": 78 }, "timestamp": { "seconds": 1267020223, "microseconds": 435656 } } +Note: this event is rate-limited. + SHUTDOWN -------- @@ -488,6 +523,20 @@ Example: {"timestamp": {"seconds": 1432121972, "microseconds": 744001}, "event": "MIGRATION", "data": {"status": "completed"}} +MIGRATION_PASS +-------------- + +Emitted from the source side of a migration at the start of each pass +(when it syncs the dirty bitmap) + +Data: None. + + - "pass": An incrementing count (starting at 1 on the first pass) + +Example: +{"timestamp": {"seconds": 1449669631, "microseconds": 239225}, + "event": "MIGRATION_PASS", "data": {"pass": 2}} + STOP ---- @@ -632,6 +681,8 @@ Example: "data": { "id": "channel0", "open": true }, "timestamp": { "seconds": 1401385907, "microseconds": 422329 } } +Note: this event is rate-limited separately for each "id". + WAKEUP ------ @@ -662,3 +713,5 @@ Example: Note: If action is "reset", "shutdown", or "pause" the WATCHDOG event is followed respectively by the RESET, SHUTDOWN, or STOP events. + +Note: this event is rate-limited. diff --git a/qemu/docs/qmp/README b/qemu/docs/qmp-intro.txt index f6a3a031e..f6a3a031e 100644 --- a/qemu/docs/qmp/README +++ b/qemu/docs/qmp-intro.txt diff --git a/qemu/docs/qmp/qmp-spec.txt b/qemu/docs/qmp-spec.txt index 4c28cd943..f8b535601 100644 --- a/qemu/docs/qmp/qmp-spec.txt +++ b/qemu/docs/qmp-spec.txt @@ -3,7 +3,7 @@ 0. About This Document ====================== -Copyright (C) 2009-2015 Red Hat, Inc. +Copyright (C) 2009-2016 Red Hat, Inc. This work is licensed under the terms of the GNU GPL, version 2 or later. See the COPYING file in the top-level directory. @@ -175,7 +175,12 @@ The format of asynchronous events is: For a listing of supported asynchronous events, please, refer to the qmp-events.txt file. -2.5 QGA Synchronization +Some events are rate-limited to at most one per second. If additional +"similar" events arrive within one second, all but the last one are +dropped, and the last one is delayed. "Similar" normally means same +event type. See qmp-events.txt for details. + +2.6 QGA Synchronization ----------------------- When using QGA, an additional synchronization feature is built into @@ -272,7 +277,7 @@ However, Clients must not assume any particular: - Amount of errors generated by a command, that is, new errors can be added to any existing command in newer versions of the Server -Any command or field name beginning with "x-" is deemed experimental, +Any command or member name beginning with "x-" is deemed experimental, and may be withdrawn or changed in an incompatible manner in a future release. diff --git a/qemu/docs/rcu.txt b/qemu/docs/rcu.txt index 21ecb8106..2f70954e8 100644 --- a/qemu/docs/rcu.txt +++ b/qemu/docs/rcu.txt @@ -128,7 +128,7 @@ The core RCU API is small: the callback function is g_free, in particular, g_free_rcu can be used. In the above case, one could have written simply: - g_free_rcu(foo_reclaim, rcu); + g_free_rcu(&foo, rcu); typeof(*p) atomic_rcu_read(p); diff --git a/qemu/docs/replay.txt b/qemu/docs/replay.txt new file mode 100644 index 000000000..779c6c059 --- /dev/null +++ b/qemu/docs/replay.txt @@ -0,0 +1,197 @@ +Copyright (c) 2010-2015 Institute for System Programming + of the Russian Academy of Sciences. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Record/replay +------------- + +Record/replay functions are used for the reverse execution and deterministic +replay of qemu execution. This implementation of deterministic replay can +be used for deterministic debugging of guest code through a gdb remote +interface. + +Execution recording writes a non-deterministic events log, which can be later +used for replaying the execution anywhere and for unlimited number of times. +It also supports checkpointing for faster rewinding during reverse debugging. +Execution replaying reads the log and replays all non-deterministic events +including external input, hardware clocks, and interrupts. + +Deterministic replay has the following features: + * Deterministically replays whole system execution and all contents of + the memory, state of the hardware devices, clocks, and screen of the VM. + * Writes execution log into the file for later replaying for multiple times + on different machines. + * Supports i386, x86_64, and ARM hardware platforms. + * Performs deterministic replay of all operations with keyboard and mouse + input devices. + +Usage of the record/replay: + * First, record the execution, by adding the following arguments to the command line: + '-icount shift=7,rr=record,rrfile=replay.bin -net none'. + Block devices' images are not actually changed in the recording mode, + because all of the changes are written to the temporary overlay file. + * Then you can replay it by using another command + line option: '-icount shift=7,rr=replay,rrfile=replay.bin -net none' + * '-net none' option should also be specified if network replay patches + are not applied. + +Papers with description of deterministic replay implementation: +http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html +http://dl.acm.org/citation.cfm?id=2786805.2803179 + +Modifications of qemu include: + * wrappers for clock and time functions to save their return values in the log + * saving different asynchronous events (e.g. system shutdown) into the log + * synchronization of the bottom halves execution + * synchronization of the threads from thread pool + * recording/replaying user input (mouse and keyboard) + * adding internal checkpoints for cpu and io synchronization + +Non-deterministic events +------------------------ + +Our record/replay system is based on saving and replaying non-deterministic +events (e.g. keyboard input) and simulating deterministic ones (e.g. reading +from HDD or memory of the VM). Saving only non-deterministic events makes +log file smaller, simulation faster, and allows using reverse debugging even +for realtime applications. + +The following non-deterministic data from peripheral devices is saved into +the log: mouse and keyboard input, network packets, audio controller input, +USB packets, serial port input, and hardware clocks (they are non-deterministic +too, because their values are taken from the host machine). Inputs from +simulated hardware, memory of VM, software interrupts, and execution of +instructions are not saved into the log, because they are deterministic and +can be replayed by simulating the behavior of virtual machine starting from +initial state. + +We had to solve three tasks to implement deterministic replay: recording +non-deterministic events, replaying non-deterministic events, and checking +that there is no divergence between record and replay modes. + +We changed several parts of QEMU to make event log recording and replaying. +Devices' models that have non-deterministic input from external devices were +changed to write every external event into the execution log immediately. +E.g. network packets are written into the log when they arrive into the virtual +network adapter. + +All non-deterministic events are coming from these devices. But to +replay them we need to know at which moments they occur. We specify +these moments by counting the number of instructions executed between +every pair of consecutive events. + +Instruction counting +-------------------- + +QEMU should work in icount mode to use record/replay feature. icount was +designed to allow deterministic execution in absence of external inputs +of the virtual machine. We also use icount to control the occurrence of the +non-deterministic events. The number of instructions elapsed from the last event +is written to the log while recording the execution. In replay mode we +can predict when to inject that event using the instruction counter. + +Timers +------ + +Timers are used to execute callbacks from different subsystems of QEMU +at the specified moments of time. There are several kinds of timers: + * Real time clock. Based on host time and used only for callbacks that + do not change the virtual machine state. For this reason real time + clock and timers does not affect deterministic replay at all. + * Virtual clock. These timers run only during the emulation. In icount + mode virtual clock value is calculated using executed instructions counter. + That is why it is completely deterministic and does not have to be recorded. + * Host clock. This clock is used by device models that simulate real time + sources (e.g. real time clock chip). Host clock is the one of the sources + of non-determinism. Host clock read operations should be logged to + make the execution deterministic. + * Virtual real time clock. This clock is similar to real time clock but + it is used only for increasing virtual clock while virtual machine is + sleeping. Due to its nature it is also non-deterministic as the host clock + and has to be logged too. + +Checkpoints +----------- + +Replaying of the execution of virtual machine is bound by sources of +non-determinism. These are inputs from clock and peripheral devices, +and QEMU thread scheduling. Thread scheduling affect on processing events +from timers, asynchronous input-output, and bottom halves. + +Invocations of timers are coupled with clock reads and changing the state +of the virtual machine. Reads produce non-deterministic data taken from +host clock. And VM state changes should preserve their order. Their relative +order in replay mode must replicate the order of callbacks in record mode. +To preserve this order we use checkpoints. When a specific clock is processed +in record mode we save to the log special "checkpoint" event. +Checkpoints here do not refer to virtual machine snapshots. They are just +record/replay events used for synchronization. + +QEMU in replay mode will try to invoke timers processing in random moment +of time. That's why we do not process a group of timers until the checkpoint +event will be read from the log. Such an event allows synchronizing CPU +execution and timer events. + +Two other checkpoints govern the "warping" of the virtual clock. +While the virtual machine is idle, the virtual clock increments at +1 ns per *real time* nanosecond. This is done by setting up a timer +(called the warp timer) on the virtual real time clock, so that the +timer fires at the next deadline of the virtual clock; the virtual clock +is then incremented (which is called "warping" the virtual clock) as +soon as the timer fires or the CPUs need to go out of the idle state. +Two functions are used for this purpose; because these actions change +virtual machine state and must be deterministic, each of them creates a +checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so +starts accounting real time to virtual clock. qemu_account_warp_timer +is called when the CPUs get an interrupt or when the warp timer fires, +and it warps the virtual clock by the amount of real time that has passed +since qemu_start_warp_timer. + +Bottom halves +------------- + +Disk I/O events are completely deterministic in our model, because +in both record and replay modes we start virtual machine from the same +disk state. But callbacks that virtual disk controller uses for reading and +writing the disk may occur at different moments of time in record and replay +modes. + +Reading and writing requests are created by CPU thread of QEMU. Later these +requests proceed to block layer which creates "bottom halves". Bottom +halves consist of callback and its parameters. They are processed when +main loop locks the global mutex. These locks are not synchronized with +replaying process because main loop also processes the events that do not +affect the virtual machine state (like user interaction with monitor). + +That is why we had to implement saving and replaying bottom halves callbacks +synchronously to the CPU execution. When the callback is about to execute +it is added to the queue in the replay module. This queue is written to the +log when its callbacks are executed. In replay mode callbacks are not processed +until the corresponding event is read from the events log file. + +Sometimes the block layer uses asynchronous callbacks for its internal purposes +(like reading or writing VM snapshots or disk image cluster tables). In this +case bottom halves are not marked as "replayable" and do not saved +into the log. + +Block devices +------------- + +Block devices record/replay module intercepts calls of +bdrv coroutine functions at the top of block drivers stack. +To record and replay block operations the drive must be configured +as following: + -drive file=disk.qcow,if=none,id=img-direct + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay + -device ide-hd,drive=img-blkreplay + +blkreplay driver should be inserted between disk image and virtual driver +controller. Therefore all disk requests may be recorded and replayed. + +All block completion operations are added to the queue in the coroutines. +Queue is flushed at checkpoints and information about processed requests +is recorded to the log. In replay phase the queue is matched with +events read from the log. Therefore block devices requests are processed +deterministically. diff --git a/qemu/docs/specs/fw_cfg.txt b/qemu/docs/specs/fw_cfg.txt index 74351dd18..7a5f8c782 100644 --- a/qemu/docs/specs/fw_cfg.txt +++ b/qemu/docs/specs/fw_cfg.txt @@ -76,6 +76,22 @@ increasing address order, similar to memcpy(). Selector Register IOport: 0x510 Data Register IOport: 0x511 +DMA Address IOport: 0x514 + +=== ARM Register Locations === + +Selector Register address: Base + 8 (2 bytes) +Data Register address: Base + 0 (8 bytes) +DMA Address address: Base + 16 (8 bytes) + +== ACPI Interface == + +The fw_cfg device is defined with ACPI ID "QEMU0002". Since we expect +ACPI tables to be passed into the guest through the fw_cfg device itself, +the guest-side firmware can not use ACPI to find fw_cfg. However, once the +firmware is finished setting up ACPI tables and hands control over to the +guest kernel, the latter can use the fw_cfg ACPI node for a more accurate +inventory of in-use IOport or MMIO regions. == Firmware Configuration Items == @@ -86,11 +102,15 @@ by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE), and reading four bytes from the data register. If the fw_cfg device is present, the four bytes read will contain the characters "QEMU". -=== Revision (Key 0x0001, FW_CFG_ID) === +If the DMA interface is available, then reading the DMA Address +Register returns 0x51454d5520434647 ("QEMU CFG" in big-endian format). + +=== Revision / feature bitmap (Key 0x0001, FW_CFG_ID) === -A 32-bit little-endian unsigned int, this item is used as an interface -revision number, and is currently set to 1 by QEMU when fw_cfg is -initialized. +A 32-bit little-endian unsigned int, this item is used to check for enabled +features. + - Bit 0: traditional interface. Always set. + - Bit 1: DMA interface. === File Directory (Key 0x0019, FW_CFG_FILE_DIR) === @@ -132,79 +152,56 @@ Selector Reg. Range Usage In practice, the number of allowed firmware configuration items is given by the value of FW_CFG_MAX_ENTRY (see fw_cfg.h). -= Host-side API = - -The following functions are available to the QEMU programmer for adding -data to a fw_cfg device during guest initialization (see fw_cfg.h for -each function's complete prototype): - -== fw_cfg_add_bytes() == - -Given a selector key value, starting pointer, and size, create an item -as a raw "blob" of the given size, available by selecting the given key. -The data referenced by the starting pointer is only linked, NOT copied, -into the data structure of the fw_cfg device. - -== fw_cfg_add_string() == += Guest-side DMA Interface = -Instead of a starting pointer and size, this function accepts a pointer -to a NUL-terminated ascii string, and inserts a newly allocated copy of -the string (including the NUL terminator) into the fw_cfg device data -structure. +If bit 1 of the feature bitmap is set, the DMA interface is present. This does +not replace the existing fw_cfg interface, it is an add-on. This interface +can be used through the 64-bit wide address register. -== fw_cfg_add_iXX() == +The address register is in big-endian format. The value for the register is 0 +at startup and after an operation. A write to the least significant half (at +offset 4) triggers an operation. This means that operations with 32-bit +addresses can be triggered with just one write, whereas operations with +64-bit addresses can be triggered with one 64-bit write or two 32-bit writes, +starting with the most significant half (at offset 0). -Insert an XX-bit item, where XX may be 16, 32, or 64. These functions -will convert a 16-, 32-, or 64-bit integer to little-endian, then add -a dynamically allocated copy of the appropriately sized item to fw_cfg -under the given selector key value. +In this register, the physical address of a FWCfgDmaAccess structure in RAM +should be written. This is the format of the FWCfgDmaAccess structure: -== fw_cfg_add_file() == +typedef struct FWCfgDmaAccess { + uint32_t control; + uint32_t length; + uint64_t address; +} FWCfgDmaAccess; -Given a filename (i.e., fw_cfg item name), starting pointer, and size, -create an item as a raw "blob" of the given size. Unlike fw_cfg_add_bytes() -above, the next available selector key (above 0x0020, FW_CFG_FILE_FIRST) -will be used, and a new entry will be added to the file directory structure -(at key 0x0019), containing the item name, blob size, and automatically -assigned selector key value. The data referenced by the starting pointer -is only linked, NOT copied, into the fw_cfg data structure. +The fields of the structure are in big endian mode, and the field at the lowest +address is the "control" field. -== fw_cfg_add_file_callback() == +The "control" field has the following bits: + - Bit 0: Error + - Bit 1: Read + - Bit 2: Skip + - Bit 3: Select. The upper 16 bits are the selected index. -Like fw_cfg_add_file(), but additionally sets pointers to a callback -function (and opaque argument), which will be executed host-side by -QEMU each time a byte is read by the guest from this particular item. +When an operation is triggered, if the "control" field has bit 3 set, the +upper 16 bits are interpreted as an index of a firmware configuration item. +This has the same effect as writing the selector register. -NOTE: The callback function is given the opaque argument set by -fw_cfg_add_file_callback(), but also the current data offset, -allowing it the option of only acting upon specific offset values -(e.g., 0, before the first data byte of the selected item is -returned to the guest). +If the "control" field has bit 1 set, a read operation will be performed. +"length" bytes for the current selector and offset will be copied into the +physical RAM address specified by the "address" field. -== fw_cfg_modify_file() == +If the "control" field has bit 2 set (and not bit 1), a skip operation will be +performed. The offset for the current selector will be advanced "length" bytes. -Given a filename (i.e., fw_cfg item name), starting pointer, and size, -completely replace the configuration item referenced by the given item -name with the new given blob. If an existing blob is found, its -callback information is removed, and a pointer to the old data is -returned to allow the caller to free it, helping avoid memory leaks. -If a configuration item does not already exist under the given item -name, a new item will be created as with fw_cfg_add_file(), and NULL -is returned to the caller. In any case, the data referenced by the -starting pointer is only linked, NOT copied, into the fw_cfg data -structure. +To check the result, read the "control" field: + error bit set -> something went wrong. + all bits cleared -> transfer finished successfully. + otherwise -> transfer still in progress (doesn't happen + today due to implementation not being async, + but may in the future). -== fw_cfg_add_callback() == - -Like fw_cfg_add_bytes(), but additionally sets pointers to a callback -function (and opaque argument), which will be executed host-side by -QEMU each time a guest-side write operation to this particular item -completes fully overwriting the item's data. - -NOTE: This function is deprecated, and will be completely removed -starting with QEMU v2.4. - -== Externally Provided Items == += Externally Provided Items = As of v2.4, "file" fw_cfg items (i.e., items with selector keys above FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file @@ -213,14 +210,27 @@ the following syntax: -fw_cfg [name=]<item_name>,file=<path> -where <item_name> is the fw_cfg item name, and <path> is the location -on the host file system of a file containing the data to be inserted. +Or + + -fw_cfg [name=]<item_name>,string=<string> + +See QEMU man page for more documentation. + +Using item_name with plain ASCII characters only is recommended. + +Item names beginning with "opt/" are reserved for users. QEMU will +never create entries with such names unless explicitly ordered by the +user. + +To avoid clashes among different users, it is strongly recommended +that you use names beginning with opt/RFQDN/, where RFQDN is a reverse +fully qualified domain name you control. For instance, if SeaBIOS +wanted to define additional names, the prefix "opt/org.seabios/" would +be appropriate. -NOTE: Users *SHOULD* choose item names beginning with the prefix "opt/" -when using the "-fw_cfg" command line option, to avoid conflicting with -item names used internally by QEMU. For instance: +For historical reasons, "opt/ovmf/" is reserved for OVMF firmware. - -fw_cfg name=opt/my_item_name,file=./my_blob.bin +Prefix "opt/org.qemu/" is reserved for QEMU itself. -Similarly, QEMU developers *SHOULD NOT* use item names prefixed with -"opt/" when inserting items programmatically, e.g. via fw_cfg_add_file(). +Use of names not beginning with "opt/" is potentially dangerous and +entirely unsupported. QEMU will warn if you try. diff --git a/qemu/docs/specs/ivshmem-spec.txt b/qemu/docs/specs/ivshmem-spec.txt new file mode 100644 index 000000000..a1f549979 --- /dev/null +++ b/qemu/docs/specs/ivshmem-spec.txt @@ -0,0 +1,254 @@ += Device Specification for Inter-VM shared memory device = + +The Inter-VM shared memory device (ivshmem) is designed to share a +memory region between multiple QEMU processes running different guests +and the host. In order for all guests to be able to pick up the +shared memory area, it is modeled by QEMU as a PCI device exposing +said memory to the guest as a PCI BAR. + +The device can use a shared memory object on the host directly, or it +can obtain one from an ivshmem server. + +In the latter case, the device can additionally interrupt its peers, and +get interrupted by its peers. + + +== Configuring the ivshmem PCI device == + +There are two basic configurations: + +- Just shared memory: -device ivshmem-plain,memdev=HMB,... + + This uses host memory backend HMB. It should have option "share" + set. + +- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,... + + An ivshmem server must already be running on the host. The device + connects to the server's UNIX domain socket via character device + CHR. + + Each peer gets assigned a unique ID by the server. IDs must be + between 0 and 65535. + + Interrupts are message-signaled (MSI-X). vectors=N configures the + number of vectors to use. + +For more details on ivshmem device properties, see The QEMU Emulator +User Documentation (qemu-doc.*). + + +== The ivshmem PCI device's guest interface == + +The device has vendor ID 1af4, device ID 1110, revision 1. Before +QEMU 2.6.0, it had revision 0. + +=== PCI BARs === + +The ivshmem PCI device has two or three BARs: + +- BAR0 holds device registers (256 Byte MMIO) +- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell) +- BAR2 maps the shared memory object + +There are two ways to use this device: + +- If you only need the shared memory part, BAR2 suffices. This way, + you have access to the shared memory in the guest and can use it as + you see fit. Memnic, for example, uses ivshmem this way from guest + user space (see http://dpdk.org/browse/memnic). + +- If you additionally need the capability for peers to interrupt each + other, you need BAR0 and BAR1. You will most likely want to write a + kernel driver to handle interrupts. Requires the device to be + configured for interrupts, obviously. + +Before QEMU 2.6.0, BAR2 can initially be invalid if the device is +configured for interrupts. It becomes safely accessible only after +the ivshmem server provided the shared memory. These devices have PCI +revision 0 rather than 1. Guest software should wait for the +IVPosition register (described below) to become non-negative before +accessing BAR2. + +Revision 0 of the device is not capable to tell guest software whether +it is configured for interrupts. + +=== PCI device registers === + +BAR 0 contains the following registers: + + Offset Size Access On reset Function + 0 4 read/write 0 Interrupt Mask + bit 0: peer interrupt (rev 0) + reserved (rev 1) + bit 1..31: reserved + 4 4 read/write 0 Interrupt Status + bit 0: peer interrupt (rev 0) + reserved (rev 1) + bit 1..31: reserved + 8 4 read-only 0 or ID IVPosition + 12 4 write-only N/A Doorbell + bit 0..15: vector + bit 16..31: peer ID + 16 240 none N/A reserved + +Software should only access the registers as specified in column +"Access". Reserved bits should be ignored on read, and preserved on +write. + +In revision 0 of the device, Interrupt Status and Mask Register +together control the legacy INTx interrupt when the device has no +MSI-X capability: INTx is asserted when the bit-wise AND of Status and +Mask is non-zero and the device has no MSI-X capability. Interrupt +Status Register bit 0 becomes 1 when an interrupt request from a peer +is received. Reading the register clears it. + +IVPosition Register: if the device is not configured for interrupts, +this is zero. Else, it is the device's ID (between 0 and 65535). + +Before QEMU 2.6.0, the register may read -1 for a short while after +reset. These devices have PCI revision 0 rather than 1. + +There is no good way for software to find out whether the device is +configured for interrupts. A positive IVPosition means interrupts, +but zero could be either. + +Doorbell Register: writing this register requests to interrupt a peer. +The written value's high 16 bits are the ID of the peer to interrupt, +and its low 16 bits select an interrupt vector. + +If the device is not configured for interrupts, the write is ignored. + +If the interrupt hasn't completed setup, the write is ignored. The +device is not capable to tell guest software whether setup is +complete. Interrupts can regress to this state on migration. + +If the peer with the requested ID isn't connected, or it has fewer +interrupt vectors connected, the write is ignored. The device is not +capable to tell guest software what peers are connected, or how many +interrupt vectors are connected. + +The peer's interrupt for this vector then becomes pending. There is +no way for software to clear the pending bit, and a polling mode of +operation is therefore impossible. + +If the peer is a revision 0 device without MSI-X capability, its +Interrupt Status register is set to 1. This asserts INTx unless +masked by the Interrupt Mask register. The device is not capable to +communicate the interrupt vector to guest software then. + +With multiple MSI-X vectors, different vectors can be used to indicate +different events have occurred. The semantics of interrupt vectors +are left to the application. + + +== Interrupt infrastructure == + +When configured for interrupts, the peers share eventfd objects in +addition to shared memory. The shared resources are managed by an +ivshmem server. + +=== The ivshmem server === + +The server listens on a UNIX domain socket. + +For each new client that connects to the server, the server +- picks an ID, +- creates eventfd file descriptors for the interrupt vectors, +- sends the ID and the file descriptor for the shared memory to the + new client, +- sends connect notifications for the new client to the other clients + (these contain file descriptors for sending interrupts), +- sends connect notifications for the other clients to the new client, + and +- sends interrupt setup messages to the new client (these contain file + descriptors for receiving interrupts). + +The first client to connect to the server receives ID zero. + +When a client disconnects from the server, the server sends disconnect +notifications to the other clients. + +The next section describes the protocol in detail. + +If the server terminates without sending disconnect notifications for +its connected clients, the clients can elect to continue. They can +communicate with each other normally, but won't receive disconnect +notification on disconnect, and no new clients can connect. There is +no way for the clients to connect to a restarted server. The device +is not capable to tell guest software whether the server is still up. + +Example server code is in contrib/ivshmem-server/. Not to be used in +production. It assumes all clients use the same number of interrupt +vectors. + +A standalone client is in contrib/ivshmem-client/. It can be useful +for debugging. + +=== The ivshmem Client-Server Protocol === + +An ivshmem device configured for interrupts connects to an ivshmem +server. This section details the protocol between the two. + +The connection is one-way: the server sends messages to the client. +Each message consists of a single 8 byte little-endian signed number, +and may be accompanied by a file descriptor via SCM_RIGHTS. Both +client and server close the connection on error. + +Note: QEMU currently doesn't close the connection right on error, but +only when the character device is destroyed. + +On connect, the server sends the following messages in order: + +1. The protocol version number, currently zero. The client should + close the connection on receipt of versions it can't handle. + +2. The client's ID. This is unique among all clients of this server. + IDs must be between 0 and 65535, because the Doorbell register + provides only 16 bits for them. + +3. The number -1, accompanied by the file descriptor for the shared + memory. + +4. Connect notifications for existing other clients, if any. This is + a peer ID (number between 0 and 65535 other than the client's ID), + repeated N times. Each repetition is accompanied by one file + descriptor. These are for interrupting the peer with that ID using + vector 0,..,N-1, in order. If the client is configured for fewer + vectors, it closes the extra file descriptors. If it is configured + for more, the extra vectors remain unconnected. + +5. Interrupt setup. This is the client's own ID, repeated N times. + Each repetition is accompanied by one file descriptor. These are + for receiving interrupts from peers using vector 0,..,N-1, in + order. If the client is configured for fewer vectors, it closes + the extra file descriptors. If it is configured for more, the + extra vectors remain unconnected. + +From then on, the server sends these kinds of messages: + +6. Connection / disconnection notification. This is a peer ID. + + - If the number comes with a file descriptor, it's a connection + notification, exactly like in step 4. + + - Else, it's a disconnection notification for the peer with that ID. + +Known bugs: + +* The protocol changed incompatibly in QEMU 2.5. Before, messages + were native endian long, and there was no version number. + +* The protocol is poorly designed. + +=== The ivshmem Client-Client Protocol === + +An ivshmem device configured for interrupts receives eventfd file +descriptors for interrupting peers and getting interrupted by peers +from the server, as explained in the previous section. + +To interrupt a peer, the device writes the 8-byte integer 1 in native +byte order to the respective file descriptor. + +To receive an interrupt, the device reads and discards as many 8-byte +integers as it can. diff --git a/qemu/docs/specs/ivshmem_device_spec.txt b/qemu/docs/specs/ivshmem_device_spec.txt deleted file mode 100644 index 667a8628f..000000000 --- a/qemu/docs/specs/ivshmem_device_spec.txt +++ /dev/null @@ -1,96 +0,0 @@ - -Device Specification for Inter-VM shared memory device ------------------------------------------------------- - -The Inter-VM shared memory device is designed to share a region of memory to -userspace in multiple virtual guests. The memory region does not belong to any -guest, but is a POSIX memory object on the host. Optionally, the device may -support sending interrupts to other guests sharing the same memory region. - - -The Inter-VM PCI device ------------------------ - -*BARs* - -The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support -registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is -used to map the shared memory object from the host. The size of BAR2 is -specified when the guest is started and must be a power of 2 in size. - -*Registers* - -The device currently supports 4 registers of 32-bits each. Registers -are used for synchronization between guests sharing the same memory object when -interrupts are supported (this requires using the shared memory server). - -The server assigns each VM an ID number and sends this ID number to the QEMU -process when the guest starts. - -enum ivshmem_registers { - IntrMask = 0, - IntrStatus = 4, - IVPosition = 8, - Doorbell = 12 -}; - -The first two registers are the interrupt mask and status registers. Mask and -status are only used with pin-based interrupts. They are unused with MSI -interrupts. - -Status Register: The status register is set to 1 when an interrupt occurs. - -Mask Register: The mask register is bitwise ANDed with the interrupt status -and the result will raise an interrupt if it is non-zero. However, since 1 is -the only value the status will be set to, it is only the first bit of the mask -that has any effect. Therefore interrupts can be masked by setting the first -bit to 0 and unmasked by setting the first bit to 1. - -IVPosition Register: The IVPosition register is read-only and reports the -guest's ID number. The guest IDs are non-negative integers. When using the -server, since the server is a separate process, the VM ID will only be set when -the device is ready (shared memory is received from the server and accessible via -the device). If the device is not ready, the IVPosition will return -1. -Applications should ensure that they have a valid VM ID before accessing the -shared memory. - -Doorbell Register: To interrupt another guest, a guest must write to the -Doorbell register. The doorbell register is 32-bits, logically divided into -two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low -16-bits are the interrupt vector to trigger. The semantics of the value -written to the doorbell depends on whether the device is using MSI or a regular -pin-based interrupt. In short, MSI uses vectors while regular interrupts set the -status register. - -Regular Interrupts - -If regular interrupts are used (due to either a guest not supporting MSI or the -user specifying not to use them on startup) then the value written to the lower -16-bits of the Doorbell register results is arbitrary and will trigger an -interrupt in the destination guest. - -Message Signalled Interrupts - -A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits -written to the Doorbell register must be between 0 and the maximum number of -vectors the guest supports. The lower 16 bits written to the doorbell is the -MSI vector that will be raised in the destination guest. The number of MSI -vectors is configurable but it is set when the VM is started. - -The important thing to remember with MSI is that it is only a signal, no status -is set (since MSI interrupts are not shared). All information other than the -interrupt itself should be communicated via the shared memory region. Devices -supporting multiple MSI vectors can use different vectors to indicate different -events have occurred. The semantics of interrupt vectors are left to the -user's discretion. - - -Usage in the Guest ------------------- - -The shared memory device is intended to be used with the provided UIO driver. -Very little configuration is needed. The guest should map BAR0 to access the -registers (an array of 32-bit ints allows simple writing) and map BAR2 to -access the shared memory region itself. The size of the shared memory region -is specified when the guest (or shared memory server) is started. A guest may -map the whole shared memory region or only part of it. diff --git a/qemu/docs/specs/parallels.txt b/qemu/docs/specs/parallels.txt new file mode 100644 index 000000000..b4fe2295f --- /dev/null +++ b/qemu/docs/specs/parallels.txt @@ -0,0 +1,228 @@ += License = + +Copyright (c) 2015 Denis Lunev +Copyright (c) 2015 Vladimir Sementsov-Ogievskiy + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + += Parallels Expandable Image File Format = + +A Parallels expandable image file consists of three consecutive parts: + * header + * BAT + * data area + +All numbers in a Parallels expandable image are stored in little-endian byte +order. + + +== Definitions == + + Sector A 512-byte data chunk. + + Cluster A data chunk of the size specified in the image header. + Currently, the default size is 1MiB (2048 sectors). In previous + versions, cluster sizes of 63 sectors, 256 and 252 kilobytes were + used. + + BAT Block Allocation Table, an entity that contains information for + guest-to-host I/O data address translation. + + +== Header == + +The header is placed at the start of an image and contains the following +fields: + +Bytes: + 0 - 15: magic + Must contain "WithoutFreeSpace" or "WithouFreSpacExt". + + 16 - 19: version + Must be 2. + + 20 - 23: heads + Disk geometry parameter for guest. + + 24 - 27: cylinders + Disk geometry parameter for guest. + + 28 - 31: tracks + Cluster size, in sectors. + + 32 - 35: nb_bat_entries + Disk size, in clusters (BAT size). + + 36 - 43: nb_sectors + Disk size, in sectors. + + For "WithoutFreeSpace" images: + Only the lowest 4 bytes are used. The highest 4 bytes must be + cleared in this case. + + For "WithouFreSpacExt" images, there are no such + restrictions. + + 44 - 47: in_use + Set to 0x746F6E59 when the image is opened by software in R/W + mode; set to 0x312e3276 when the image is closed. + + A zero in this field means that the image was opened by an old + version of the software that doesn't support Format Extension + (see below). + + Other values are not allowed. + + 48 - 51: data_off + An offset, in sectors, from the start of the file to the start of + the data area. + + For "WithoutFreeSpace" images: + - If data_off is zero, the offset is calculated as the end of BAT + table plus some padding to ensure sector size alignment. + - If data_off is non-zero, the offset should be aligned to sector + size. However it is recommended to align it to cluster size for + newly created images. + + For "WithouFreSpacExt" images: + data_off must be non-zero and aligned to cluster size. + + 52 - 55: flags + Miscellaneous flags. + + Bit 0: Empty Image bit. If set, the image should be + considered clear. + + Bits 2-31: Unused. + + 56 - 63: ext_off + Format Extension offset, an offset, in sectors, from the start of + the file to the start of the Format Extension Cluster. + + ext_off must meet the same requirements as cluster offsets + defined by BAT entries (see below). + + +== BAT == + +BAT is placed immediately after the image header. In the file, BAT is a +contiguous array of 32-bit unsigned little-endian integers with +(bat_entries * 4) bytes size. + +Each BAT entry contains an offset from the start of the file to the +corresponding cluster. The offset set in clusters for "WithouFreSpacExt" images +and in sectors for "WithoutFreeSpace" images. + +If a BAT entry is zero, the corresponding cluster is not allocated and should +be considered as filled with zeroes. + +Cluster offsets specified by BAT entries must meet the following requirements: + - the value must not be lower than data offset (provided by header.data_off + or calculated as specified above), + - the value must be lower than the desired file size, + - the value must be unique among all BAT entries, + - the result of (cluster offset - data offset) must be aligned to cluster + size. + + +== Data Area == + +The data area is an area from the data offset (provided by header.data_off or +calculated as specified above) to the end of the file. It represents a +contiguous array of clusters. Most of them are allocated by the BAT, some may +be allocated by the ext_off field in the header while other may be allocated by +extensions. All clusters allocated by ext_off and extensions should meet the +same requirements as clusters specified by BAT entries. + + +== Format Extension == + +The Format Extension is an area 1 cluster in size that provides additional +format features. This cluster is addressed by the ext_off field in the header. +The format of the Format Extension area is the following: + + 0 - 7: magic + Must be 0xAB234CEF23DCEA87 + + 8 - 23: m_CheckSum + The MD5 checksum of the entire Header Extension cluster except + the first 24 bytes. + + The above are followed by feature sections or "extensions". The last + extension must be "End of features" (see below). + +Each feature section has the following format: + + 0 - 7: magic + The identifier of the feature: + 0x0000000000000000 - End of features + 0x20385FAE252CB34A - Dirty bitmap + + 8 - 15: flags + External flags for extension: + + Bit 0: NECESSARY + If the software cannot load the extension (due to an + unknown magic number or error), the file should not be + changed. If this flag is unset and there is an error on + loading the extension, said extension should be dropped. + + Bit 1: TRANSIT + If there is an unknown extension with this flag set, + said extension should be left as is. + + If neither NECESSARY nor TRANSIT are set, the extension should be + dropped. + + 16 - 19: data_size + The size of the following feature data, in bytes. + + 20 - 23: unused32 + Align header to 8 bytes boundary. + + variable: data (data_size bytes) + + The above is followed by padding to the next 8 bytes boundary, then the + next extension starts. + + The last extension must be "End of features" with all the fields set to 0. + + +=== Dirty bitmaps feature === + +This feature provides a way of storing dirty bitmaps in the image. The fields +of its data area are: + + 0 - 7: size + The bitmap size, should be equal to disk size in sectors. + + 8 - 23: id + An identifier for backup consistency checking. + + 24 - 27: granularity + Bitmap granularity, in sectors. I.e., the number of sectors + corresponding to one bit of the bitmap. Granularity must be + a power of 2. + + 28 - 31: l1_size + The number of entries in the L1 table of the bitmap. + + variable: l1 (64 * l1_size bytes) + L1 offset table (in bytes) + +A dirty bitmap is stored using a one-level structure for the mapping to host +clusters - an L1 table. + +Given an offset in bytes into the bitmap data, the offset in bytes into the +image file can be obtained as follows: + + offset = l1_table[offset / cluster_size] + (offset % cluster_size) + +If an L1 table entry is 0, the corresponding cluster of the bitmap is assumed +to be zero. + +If an L1 table entry is 1, the corresponding cluster of the bitmap is assumed +to have all bits set. + +If an L1 table entry is not 0 or 1, it allocates a cluster from the data area. diff --git a/qemu/docs/specs/pci-ids.txt b/qemu/docs/specs/pci-ids.txt index 0adcb89aa..fd27c677d 100644 --- a/qemu/docs/specs/pci-ids.txt +++ b/qemu/docs/specs/pci-ids.txt @@ -15,13 +15,23 @@ The 1000 -> 10ff device ID range is used as follows for virtio-pci devices. Note that this allocation separate from the virtio device IDs, which are maintained as part of the virtio specification. -1af4:1000 network device -1af4:1001 block device -1af4:1002 balloon device -1af4:1003 console device -1af4:1004 SCSI host bus adapter device -1af4:1005 entropy generator device -1af4:1009 9p filesystem device +1af4:1000 network device (legacy) +1af4:1001 block device (legacy) +1af4:1002 balloon device (legacy) +1af4:1003 console device (legacy) +1af4:1004 SCSI host bus adapter device (legacy) +1af4:1005 entropy generator device (legacy) +1af4:1009 9p filesystem device (legacy) + +1af4:1041 network device (modern) +1af4:1042 block device (modern) +1af4:1043 console device (modern) +1af4:1044 entropy generator device (modern) +1af4:1045 balloon device (modern) +1af4:1048 SCSI host bus adapter device (modern) +1af4:1049 9p filesystem device (modern) +1af4:1050 virtio gpu device (modern) +1af4:1052 virtio input device (modern) 1af4:10f0 Available for experimental usage without registration. Must get to official ID when the code leaves the test lab (i.e. when seeking diff --git a/qemu/docs/specs/ppc-spapr-hcalls.txt b/qemu/docs/specs/ppc-spapr-hcalls.txt index 667b3fa00..5bd8eab78 100644 --- a/qemu/docs/specs/ppc-spapr-hcalls.txt +++ b/qemu/docs/specs/ppc-spapr-hcalls.txt @@ -41,8 +41,8 @@ When the guest runs in "real mode" (in powerpc lingua this means with MMU disabled, ie guest effective == guest physical), it only has access to a subset of memory and no IOs. -PAPR provides a set of hypervisor calls to perform cachable or -non-cachable accesses to any guest physical addresses that the +PAPR provides a set of hypervisor calls to perform cacheable or +non-cacheable accesses to any guest physical addresses that the guest can use in order to access IO devices while in real mode. This is typically used by the firmware running in the guest. diff --git a/qemu/docs/specs/ppc-spapr-hotplug.txt b/qemu/docs/specs/ppc-spapr-hotplug.txt index 46e07196b..631b0cada 100644 --- a/qemu/docs/specs/ppc-spapr-hotplug.txt +++ b/qemu/docs/specs/ppc-spapr-hotplug.txt @@ -302,4 +302,52 @@ consisting of <phys>, <size> and <maxcpus>. pseries guests use this property to note the maximum allowed CPUs for the guest. +== ibm,dynamic-reconfiguration-memory == + +ibm,dynamic-reconfiguration-memory is a device tree node that represents +dynamically reconfigurable logical memory blocks (LMB). This node +is generated only when the guest advertises the support for it via +ibm,client-architecture-support call. Memory that is not dynamically +reconfigurable is represented by /memory nodes. The properties of this +node that are of interest to the sPAPR memory hotplug implementation +in QEMU are described here. + +ibm,lmb-size + +This 64bit integer defines the size of each dynamically reconfigurable LMB. + +ibm,associativity-lookup-arrays + +This property defines a lookup array in which the NUMA associativity +information for each LMB can be found. It is a property encoded array +that begins with an integer M, the number of associativity lists followed +by an integer N, the number of entries per associativity list and terminated +by M associativity lists each of length N integers. + +This property provides the same information as given by ibm,associativity +property in a /memory node. Each assigned LMB has an index value between +0 and M-1 which is used as an index into this table to select which +associativity list to use for the LMB. This index value for each LMB +is defined in ibm,dynamic-memory property. + +ibm,dynamic-memory + +This property describes the dynamically reconfigurable memory. It is a +property encoded array that has an integer N, the number of LMBs followed +by N LMB list entires. + +Each LMB list entry consists of the following elements: + +- Logical address of the start of the LMB encoded as a 64bit integer. This + corresponds to reg property in /memory node. +- DRC index of the LMB that corresponds to ibm,my-drc-index property + in a /memory node. +- Four bytes reserved for expansion. +- Associativity list index for the LMB that is used as an index into + ibm,associativity-lookup-arrays property described earlier. This + is used to retrieve the right associativity list to be used for this + LMB. +- A 32bit flags word. The bit at bit position 0x00000008 defines whether + the LMB is assigned to the the partition as of boot time. + [1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867 diff --git a/qemu/docs/specs/qcow2.txt b/qemu/docs/specs/qcow2.txt index 121dfc8cc..80cdfd0e9 100644 --- a/qemu/docs/specs/qcow2.txt +++ b/qemu/docs/specs/qcow2.txt @@ -103,7 +103,18 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. - Bits 0-63: Reserved (set to 0) + Bit 0: Bitmaps extension bit + This bit indicates consistency for the bitmaps + extension data. + + It is an error if this bit is set without the + bitmaps extension present. + + If the bitmaps extension is present but this + bit is unset, the bitmaps extension data must be + considered inconsistent. + + Bits 1-63: Reserved (set to 0) 96 - 99: refcount_order Describes the width of a reference count block entry (width @@ -123,6 +134,7 @@ be stored. Each extension has a structure like the following: 0x00000000 - End of the header extension area 0xE2792ACA - Backing file format name 0x6803f857 - Feature name table + 0x23852875 - Bitmaps extension other - Unknown header extension, can be safely ignored @@ -166,6 +178,36 @@ the header extension data. Each entry look like this: terminated if it has full length) +== Bitmaps extension == + +The bitmaps extension is an optional header extension. It provides the ability +to store bitmaps related to a virtual disk. For now, there is only one bitmap +type: the dirty tracking bitmap, which tracks virtual disk changes from some +point in time. + +The data of the extension should be considered consistent only if the +corresponding auto-clear feature bit is set, see autoclear_features above. + +The fields of the bitmaps extension are: + + Byte 0 - 3: nb_bitmaps + The number of bitmaps contained in the image. Must be + greater than or equal to 1. + + Note: Qemu currently only supports up to 65535 bitmaps per + image. + + 4 - 7: Reserved, must be zero. + + 8 - 15: bitmap_directory_size + Size of the bitmap directory in bytes. It is the cumulative + size of all (nb_bitmaps) bitmap headers. + + 16 - 23: bitmap_directory_offset + Offset into the image file at which the bitmap directory + starts. Must be aligned to a cluster boundary. + + == Host cluster management == qcow2 manages the allocation of host clusters by maintaining a reference count @@ -257,7 +299,7 @@ L2 table entry: 63: 0 for a cluster that is unused or requires COW, 1 if its refcount is exactly one. This information is only accurate - in L2 tables that are reachable from the the active L1 + in L2 tables that are reachable from the active L1 table. Standard Cluster Descriptor: @@ -360,3 +402,180 @@ Snapshot table entry: variable: Padding to round up the snapshot table entry size to the next multiple of 8. + + +== Bitmaps == + +As mentioned above, the bitmaps extension provides the ability to store bitmaps +related to a virtual disk. This section describes how these bitmaps are stored. + +All stored bitmaps are related to the virtual disk stored in the same image, so +each bitmap size is equal to the virtual disk size. + +Each bit of the bitmap is responsible for strictly defined range of the virtual +disk. For bit number bit_nr the corresponding range (in bytes) will be: + + [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1] + +Granularity is a property of the concrete bitmap, see below. + + +=== Bitmap directory === + +Each bitmap saved in the image is described in a bitmap directory entry. The +bitmap directory is a contiguous area in the image file, whose starting offset +and length are given by the header extension fields bitmap_directory_offset and +bitmap_directory_size. The entries of the bitmap directory have variable +length, depending on the lengths of the bitmap name and extra data. These +entries are also called bitmap headers. + +Structure of a bitmap directory entry: + + Byte 0 - 7: bitmap_table_offset + Offset into the image file at which the bitmap table + (described below) for the bitmap starts. Must be aligned to + a cluster boundary. + + 8 - 11: bitmap_table_size + Number of entries in the bitmap table of the bitmap. + + 12 - 15: flags + Bit + 0: in_use + The bitmap was not saved correctly and may be + inconsistent. + + 1: auto + The bitmap must reflect all changes of the virtual + disk by any application that would write to this qcow2 + file (including writes, snapshot switching, etc.). The + type of this bitmap must be 'dirty tracking bitmap'. + + 2: extra_data_compatible + This flags is meaningful when the extra data is + unknown to the software (currently any extra data is + unknown to Qemu). + If it is set, the bitmap may be used as expected, extra + data must be left as is. + If it is not set, the bitmap must not be used, but + both it and its extra data be left as is. + + Bits 3 - 31 are reserved and must be 0. + + 16: type + This field describes the sort of the bitmap. + Values: + 1: Dirty tracking bitmap + + Values 0, 2 - 255 are reserved. + + 17: granularity_bits + Granularity bits. Valid values: 0 - 63. + + Note: Qemu currently doesn't support granularity_bits + greater than 31. + + Granularity is calculated as + granularity = 1 << granularity_bits + + A bitmap's granularity is how many bytes of the image + accounts for one bit of the bitmap. + + 18 - 19: name_size + Size of the bitmap name. Must be non-zero. + + Note: Qemu currently doesn't support values greater than + 1023. + + 20 - 23: extra_data_size + Size of type-specific extra data. + + For now, as no extra data is defined, extra_data_size is + reserved and should be zero. If it is non-zero the + behavior is defined by extra_data_compatible flag. + + variable: extra_data + Extra data for the bitmap, occupying extra_data_size bytes. + Extra data must never contain references to clusters or in + some other way allocate additional clusters. + + variable: name + The name of the bitmap (not null terminated), occupying + name_size bytes. Must be unique among all bitmap names + within the bitmaps extension. + + variable: Padding to round up the bitmap directory entry size to the + next multiple of 8. All bytes of the padding must be zero. + + +=== Bitmap table === + +Each bitmap is stored using a one-level structure (as opposed to two-level +structures like for refcounts and guest clusters mapping) for the mapping of +bitmap data to host clusters. This structure is called the bitmap table. + +Each bitmap table has a variable size (stored in the bitmap directory entry) +and may use multiple clusters, however, it must be contiguous in the image +file. + +Structure of a bitmap table entry: + + Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero. + If bits 9 - 55 are zero: + 0: Cluster should be read as all zeros. + 1: Cluster should be read as all ones. + + 1 - 8: Reserved and must be zero. + + 9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to + a cluster boundary. If the offset is 0, the cluster is + unallocated; in that case, bit 0 determines how this + cluster should be treated during reads. + + 56 - 63: Reserved and must be zero. + + +=== Bitmap data === + +As noted above, bitmap data is stored in separate clusters, described by the +bitmap table. Given an offset (in bytes) into the bitmap data, the offset into +the image file can be obtained as follows: + + image_offset(bitmap_data_offset) = + bitmap_table[bitmap_data_offset / cluster_size] + + (bitmap_data_offset % cluster_size) + +This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see +above). + +Given an offset byte_nr into the virtual disk and the bitmap's granularity, the +bit offset into the image file to the corresponding bit of the bitmap can be +calculated like this: + + bit_offset(byte_nr) = + image_offset(byte_nr / granularity / 8) * 8 + + (byte_nr / granularity) % 8 + +If the size of the bitmap data is not a multiple of the cluster size then the +last cluster of the bitmap data contains some unused tail bits. These bits must +be zero. + + +=== Dirty tracking bitmaps === + +Bitmaps with 'type' field equal to one are dirty tracking bitmaps. + +When the virtual disk is in use dirty tracking bitmap may be 'enabled' or +'disabled'. While the bitmap is 'enabled', all writes to the virtual disk +should be reflected in the bitmap. A set bit in the bitmap means that the +corresponding range of the virtual disk (see above) was written to while the +bitmap was 'enabled'. An unset bit means that this range was not written to. + +The software doesn't have to sync the bitmap in the image file with its +representation in RAM after each write. Flag 'in_use' should be set while the +bitmap is not synced. + +In the image file the 'enabled' state is reflected by the 'auto' flag. If this +flag is set, the software must consider the bitmap as 'enabled' and start +tracking virtual disk changes to this bitmap from the first write to the +virtual disk. If this flag is not set then the bitmap is disabled. diff --git a/qemu/docs/specs/rocker.txt b/qemu/docs/specs/rocker.txt index 1c743515c..d2a82624f 100644 --- a/qemu/docs/specs/rocker.txt +++ b/qemu/docs/specs/rocker.txt @@ -297,7 +297,7 @@ but not fired. If only partial credits are returned, the interrupt remains masked but the device generates an interrupt, signaling the driver that more outstanding work is available. -(* this masking is unrelated to to the MSI-X interrupt mask register) +(* this masking is unrelated to the MSI-X interrupt mask register) Endianness ---------- diff --git a/qemu/docs/specs/vhost-user.txt b/qemu/docs/specs/vhost-user.txt index 650bb1818..777c49cfe 100644 --- a/qemu/docs/specs/vhost-user.txt +++ b/qemu/docs/specs/vhost-user.txt @@ -87,6 +87,14 @@ Depending on the request type, payload can be: User address: a 64-bit user address mmap offset: 64-bit offset where region starts in the mapped memory +* Log description + --------------------------- + | log size | log offset | + --------------------------- + log size: size of area used for logging + log offset: offset from start of supplied file descriptor + where logging starts (i.e. where guest address 0 would be logged) + In QEMU the vhost-user message is implemented with the following struct: typedef struct VhostUserMsg { @@ -98,6 +106,7 @@ typedef struct VhostUserMsg { struct vhost_vring_state state; struct vhost_vring_addr addr; VhostUserMemory memory; + VhostUserLog log; }; } QEMU_PACKED VhostUserMsg; @@ -113,12 +122,15 @@ message replies. Most of the requests don't require replies. Here is a list of the ones that do: * VHOST_GET_FEATURES + * VHOST_GET_PROTOCOL_FEATURES * VHOST_GET_VRING_BASE + * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) There are several messages that the master sends with file descriptors passed in the ancillary data: * VHOST_SET_MEM_TABLE + * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) * VHOST_SET_LOG_FD * VHOST_SET_VRING_KICK * VHOST_SET_VRING_CALL @@ -127,6 +139,122 @@ in the ancillary data: If Master is unable to send the full message or receives a wrong reply it will close the connection. An optional reconnection mechanism can be implemented. +Any protocol extensions are gated by protocol feature bits, +which allows full backwards compatibility on both master +and slave. +As older slaves don't support negotiating protocol features, +a feature bit was dedicated for this purpose: +#define VHOST_USER_F_PROTOCOL_FEATURES 30 + +Starting and stopping rings +---------------------- +Client must only process each ring when it is started. + +Client must only pass data between the ring and the +backend, when the ring is enabled. + +If ring is started but disabled, client must process the +ring without talking to the backend. + +For example, for a networking device, in the disabled state +client must not supply any new RX packets, but must process +and discard any TX packets. + +If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized +in an enabled state. + +If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized +in a disabled state. Client must not pass data to/from the backend until ring is enabled by +VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by +VHOST_USER_SET_VRING_ENABLE with parameter 0. + +Each ring is initialized in a stopped state, client must not process it until +ring is started, or after it has been stopped. + +Client must start ring upon receiving a kick (that is, detecting that file +descriptor is readable) on the descriptor specified by +VHOST_USER_SET_VRING_KICK, and stop ring upon receiving +VHOST_USER_GET_VRING_BASE. + +While processing the rings (whether they are enabled or not), client must +support changing some configuration aspects on the fly. + +Multiple queue support +---------------------- + +Multiple queue is treated as a protocol extension, hence the slave has to +implement protocol features first. The multiple queues feature is supported +only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set. + +The max number of queues the slave supports can be queried with message +VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of +requested queues is bigger than that. + +As all queues share one connection, the master uses a unique index for each +queue in the sent message to identify a specified queue. One queue pair +is enabled initially. More queues are enabled dynamically, by sending +message VHOST_USER_SET_VRING_ENABLE. + +Migration +--------- + +During live migration, the master may need to track the modifications +the slave makes to the memory mapped regions. The client should mark +the dirty pages in a log. Once it complies to this logging, it may +declare the VHOST_F_LOG_ALL vhost feature. + +To start/stop logging of data/used ring writes, server may send messages +VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with +VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively. + +All the modifications to memory pointed by vring "descriptor" should +be marked. Modifications to "used" vring should be marked if +VHOST_VRING_F_LOG is part of ring's flags. + +Dirty pages are of size: +#define VHOST_LOG_PAGE 0x1000 + +The log memory fd is provided in the ancillary data of +VHOST_USER_SET_LOG_BASE message when the slave has +VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature. + +The size of the log is supplied as part of VhostUserMsg +which should be large enough to cover all known guest +addresses. Log starts at the supplied offset in the +supplied file descriptor. +The log covers from address 0 to the maximum of guest +regions. In pseudo-code, to mark page at "addr" as dirty: + +page = addr / VHOST_LOG_PAGE +log[page / 8] |= 1 << page % 8 + +Where addr is the guest physical address. + +Use atomic operations, as the log may be concurrently manipulated. + +Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG +is set for this ring), log_guest_addr should be used to calculate the log +offset: the write to first byte of the used ring is logged at this offset from +log start. Also note that this value might be outside the legal guest physical +address range (i.e. does not have to be covered by the VhostUserMemory table), +but the bit offset of the last byte of the ring must fall within +the size supplied by VhostUserLog. + +VHOST_USER_SET_LOG_FD is an optional message with an eventfd in +ancillary data, it may be used to inform the master that the log has +been modified. + +Once the source has finished migration, rings will be stopped by +the source. No further update must be done before rings are +restarted. + +Protocol features +----------------- + +#define VHOST_USER_PROTOCOL_F_MQ 0 +#define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 +#define VHOST_USER_PROTOCOL_F_RARP 2 + Message types ------------- @@ -138,6 +266,8 @@ Message types Slave payload: u64 Get from the underlying vhost implementation the features bitmask. + Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for + VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. * VHOST_USER_SET_FEATURES @@ -146,6 +276,33 @@ Message types Master payload: u64 Enable features in the underlying vhost implementation using a bitmask. + Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for + VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. + + * VHOST_USER_GET_PROTOCOL_FEATURES + + Id: 15 + Equivalent ioctl: VHOST_GET_FEATURES + Master payload: N/A + Slave payload: u64 + + Get the protocol feature bitmask from the underlying vhost implementation. + Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in + VHOST_USER_GET_FEATURES. + Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support + this message even before VHOST_USER_SET_FEATURES was called. + + * VHOST_USER_SET_PROTOCOL_FEATURES + + Id: 16 + Ioctl: VHOST_SET_FEATURES + Master payload: u64 + + Enable protocol features in the underlying vhost implementation. + Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in + VHOST_USER_GET_FEATURES. + Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support + this message even before VHOST_USER_SET_FEATURES was called. * VHOST_USER_SET_OWNER @@ -160,11 +317,13 @@ Message types * VHOST_USER_RESET_OWNER Id: 4 - Equivalent ioctl: VHOST_RESET_OWNER Master payload: N/A - Issued when a new connection is about to be closed. The Master will no - longer own this connection (and will usually close it). + This is no longer used. Used to be sent to request disabling + all rings, but some clients interpreted it to also discard + connection state (this interpretation would lead to bugs). + It is recommended that clients either ignore this message, + or use it to disable all rings. * VHOST_USER_SET_MEM_TABLE @@ -182,8 +341,14 @@ Message types Id: 6 Equivalent ioctl: VHOST_SET_LOG_BASE Master payload: u64 + Slave payload: N/A + + Sets logging shared memory space. + When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol + feature, the log memory fd is provided in the ancillary data of + VHOST_USER_SET_LOG_BASE message, the size and offset of shared + memory area provided in the message. - Sets the logging base address. * VHOST_USER_SET_LOG_FD @@ -199,7 +364,7 @@ Message types Equivalent ioctl: VHOST_SET_VRING_NUM Master payload: vring state description - Sets the number of vrings for this owner. + Set the size of the queue. * VHOST_USER_SET_VRING_ADDR @@ -264,3 +429,38 @@ Message types Bits (0-7) of the payload contain the vring index. Bit 8 is the invalid FD flag. This flag is set when there is no file descriptor in the ancillary data. + + * VHOST_USER_GET_QUEUE_NUM + + Id: 17 + Equivalent ioctl: N/A + Master payload: N/A + Slave payload: u64 + + Query how many queues the backend supports. This request should be + sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol + features by VHOST_USER_GET_PROTOCOL_FEATURES. + + * VHOST_USER_SET_VRING_ENABLE + + Id: 18 + Equivalent ioctl: N/A + Master payload: vring state description + + Signal slave to enable or disable corresponding vring. + This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES + has been negotiated. + + * VHOST_USER_SEND_RARP + + Id: 19 + Equivalent ioctl: N/A + Master payload: u64 + + Ask vhost user backend to broadcast a fake RARP to notify the migration + is terminated for guest that does not support GUEST_ANNOUNCE. + Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in + VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP + is present in VHOST_USER_GET_PROTOCOL_FEATURES. + The first 6 bytes of the payload contain the mac address of the guest to + allow the vhost user backend to construct and broadcast the fake RARP. diff --git a/qemu/docs/throttle.txt b/qemu/docs/throttle.txt new file mode 100644 index 000000000..28204e46c --- /dev/null +++ b/qemu/docs/throttle.txt @@ -0,0 +1,252 @@ +The QEMU throttling infrastructure +================================== +Copyright (C) 2016 Igalia, S.L. +Author: Alberto Garcia <berto@igalia.com> + +This work is licensed under the terms of the GNU GPL, version 2 or +later. See the COPYING file in the top-level directory. + +Introduction +------------ +QEMU includes a throttling module that can be used to set limits to +I/O operations. The code itself is generic and independent of the I/O +units, but it is currenly used to limit the number of bytes per second +and operations per second (IOPS) when performing disk I/O. + +This document explains how to use the throttling code in QEMU, and how +it works internally. The implementation is in throttle.c. + + +Using throttling to limit disk I/O +---------------------------------- +Two aspects of the disk I/O can be limited: the number of bytes per +second and the number of operations per second (IOPS). For each one of +them the user can set a global limit or separate limits for read and +write operations. This gives us a total of six different parameters. + +I/O limits can be set using the throttling.* parameters of -drive, or +using the QMP 'block_set_io_throttle' command. These are the names of +the parameters for both cases: + +|-----------------------+-----------------------| +| -drive | block_set_io_throttle | +|-----------------------+-----------------------| +| throttling.iops-total | iops | +| throttling.iops-read | iops_rd | +| throttling.iops-write | iops_wr | +| throttling.bps-total | bps | +| throttling.bps-read | bps_rd | +| throttling.bps-write | bps_wr | +|-----------------------+-----------------------| + +It is possible to set limits for both IOPS and bps and the same time, +and for each case we can decide whether to have separate read and +write limits or not, but note that if iops-total is set then neither +iops-read nor iops-write can be set. The same applies to bps-total and +bps-read/write. + +The default value of these parameters is 0, and it means 'unlimited'. + +In its most basic usage, the user can add a drive to QEMU with a limit +of 100 IOPS with the following -drive line: + + -drive file=hd0.qcow2,throttling.iops-total=100 + +We can do the same using QMP. In this case all these parameters are +mandatory, so we must set to 0 the ones that we don't want to limit: + + { "execute": "block_set_io_throttle", + "arguments": { + "device": "virtio0", + "iops": 100, + "iops_rd": 0, + "iops_wr": 0, + "bps": 0, + "bps_rd": 0, + "bps_wr": 0 + } + } + + +I/O bursts +---------- +In addition to the basic limits we have just seen, QEMU allows the +user to do bursts of I/O for a configurable amount of time. A burst is +an amount of I/O that can exceed the basic limit. Bursts are useful to +allow better performance when there are peaks of activity (the OS +boots, a service needs to be restarted) while keeping the average +limits lower the rest of the time. + +Two parameters control bursts: their length and the maximum amount of +I/O they allow. These two can be configured separately for each one of +the six basic parameters described in the previous section, but in +this section we'll use 'iops-total' as an example. + +The I/O limit during bursts is set using 'iops-total-max', and the +maximum length (in seconds) is set with 'iops-total-max-length'. So if +we want to configure a drive with a basic limit of 100 IOPS and allow +bursts of 2000 IOPS for 60 seconds, we would do it like this (the line +is split for clarity): + + -drive file=hd0.qcow2, + throttling.iops-total=100, + throttling.iops-total-max=2000, + throttling.iops-total-max-length=60 + +Or, with QMP: + + { "execute": "block_set_io_throttle", + "arguments": { + "device": "virtio0", + "iops": 100, + "iops_rd": 0, + "iops_wr": 0, + "bps": 0, + "bps_rd": 0, + "bps_wr": 0, + "iops_max": 2000, + "iops_max_length": 60, + } + } + +With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 +IOPS for 1 minute before it's throttled down to 100 IOPS. + +The user will be able to do bursts again if there's a sufficiently +long period of time with unused I/O (see below for details). + +The default value for 'iops-total-max' is 0 and it means that bursts +are not allowed. 'iops-total-max-length' can only be set if +'iops-total-max' is set as well, and its default value is 1 second. + +Here's the complete list of parameters for configuring bursts: + +|----------------------------------+-----------------------| +| -drive | block_set_io_throttle | +|----------------------------------+-----------------------| +| throttling.iops-total-max | iops_max | +| throttling.iops-total-max-length | iops_max_length | +| throttling.iops-read-max | iops_rd_max | +| throttling.iops-read-max-length | iops_rd_max_length | +| throttling.iops-write-max | iops_wr_max | +| throttling.iops-write-max-length | iops_wr_max_length | +| throttling.bps-total-max | bps_max | +| throttling.bps-total-max-length | bps_max_length | +| throttling.bps-read-max | bps_rd_max | +| throttling.bps-read-max-length | bps_rd_max_length | +| throttling.bps-write-max | bps_wr_max | +| throttling.bps-write-max-length | bps_wr_max_length | +|----------------------------------+-----------------------| + + +Controlling the size of I/O operations +-------------------------------------- +When applying IOPS limits all I/O operations are treated equally +regardless of their size. This means that the user can take advantage +of this in order to circumvent the limits and submit one huge I/O +request instead of several smaller ones. + +QEMU provides a setting called throttling.iops-size to prevent this +from happening. This setting specifies the size (in bytes) of an I/O +request for accounting purposes. Larger requests will be counted +proportionally to this size. + +For example, if iops-size is set to 4096 then an 8KB request will be +counted as two, and a 6KB request will be counted as one and a +half. This only applies to requests larger than iops-size: smaller +requests will be always counted as one, no matter their size. + +The default value of iops-size is 0 and it means that the size of the +requests is never taken into account when applying IOPS limits. + + +Applying I/O limits to groups of disks +-------------------------------------- +In all the examples so far we have seen how to apply limits to the I/O +performed on individual drives, but QEMU allows grouping drives so +they all share the same limits. + +The way it works is that each drive with I/O limits is assigned to a +group named using the throttling.group parameter. If this parameter is +not specified, then the device name (i.e. 'virtio0', 'ide0-hd0') will +be used as the group name. + +Limits set using the throttling.* parameters discussed earlier in this +document apply to the combined I/O of all members of a group. + +Consider this example: + + -drive file=hd1.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd2.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd3.qcow2,throttling.iops-total=3000,throttling.group=bar + -drive file=hd4.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd5.qcow2,throttling.iops-total=3000,throttling.group=bar + -drive file=hd6.qcow2,throttling.iops-total=5000 + +Here hd1, hd2 and hd4 are all members of a group named 'foo' with a +combined IOPS limit of 6000, and hd3 and hd5 are members of 'bar'. hd6 +is left alone (technically it is part of a 1-member group). + +Limits are applied in a round-robin fashion so if there are concurrent +I/O requests on several drives of the same group they will be +distributed evenly. + +When I/O limits are applied to an existing drive using the QMP command +'block_set_io_throttle', the following things need to be taken into +account: + + - I/O limits are shared within the same group, so new values will + affect all members and overwrite the previous settings. In other + words: if different limits are applied to members of the same + group, the last one wins. + + - If 'group' is unset it is assumed to be the current group of that + drive. If the drive is not in a group yet, it will be added to a + group named after the device name. + + - If 'group' is set then the drive will be moved to that group if + it was member of a different one. In this case the limits + specified in the parameters will be applied to the new group + only. + + - I/O limits can be disabled by setting all of them to 0. In this + case the device will be removed from its group and the rest of + its members will not be affected. The 'group' parameter is + ignored. + + +The Leaky Bucket algorithm +-------------------------- +I/O limits in QEMU are implemented using the leaky bucket algorithm +(specifically the "Leaky bucket as a meter" variant). + +This algorithm uses the analogy of a bucket that leaks water +constantly. The water that gets into the bucket represents the I/O +that has been performed, and no more I/O is allowed once the bucket is +full. + +To see the way this corresponds to the throttling parameters in QEMU, +consider the following values: + + iops-total=100 + iops-total-max=2000 + iops-total-max-length=60 + + - Water leaks from the bucket at a rate of 100 IOPS. + - Water can be added to the bucket at a rate of 2000 IOPS. + - The size of the bucket is 2000 x 60 = 120000 + - If 'iops-total-max-length' is unset then the bucket size is 100. + +The bucket is initially empty, therefore water can be added until it's +full at a rate of 2000 IOPS (the burst rate). Once the bucket is full +we can only add as much water as it leaks, therefore the I/O rate is +reduced to 100 IOPS. If we add less water than it leaks then the +bucket will start to empty, allowing for bursts again. + +Note that since water is leaking from the bucket even during bursts, +it will take a bit more than 60 seconds at 2000 IOPS to fill it +up. After those 60 seconds the bucket will have leaked 60 x 100 = +6000, allowing for 3 more seconds of I/O at 2000 IOPS. + +Also, due to the way the algorithm works, longer burst can be done at +a lower I/O rate, e.g. 1000 IOPS during 120 seconds. diff --git a/qemu/docs/tracing.txt b/qemu/docs/tracing.txt index 7117c5e7d..0bd6b9cf9 100644 --- a/qemu/docs/tracing.txt +++ b/qemu/docs/tracing.txt @@ -157,9 +157,9 @@ performance penalty. Note that regardless of the selected trace backend, events with the "disable" property will be generated with the "nop" backend. -=== Stderr === +=== Log === -The "stderr" backend sends trace events directly to standard error. This +The "log" backend sends trace events directly to standard error. This effectively turns trace events into debug printfs. This is the simplest backend and can be used together with existing code that @@ -172,9 +172,6 @@ source tree. It may not be as powerful as platform-specific or third-party trace backends but it is portable. This is the recommended trace backend unless you have specific needs for more advanced backends. -The "simple" backend currently does not capture string arguments, it simply -records the char* pointer value instead of the string that is pointed to. - === Ftrace === The "ftrace" backend writes trace data to ftrace marker. This effectively @@ -258,11 +255,11 @@ is generated to make use in scripts more convenient. This step can also be performed manually after a build in order to change the binary name in the .stp probes: - scripts/tracetool --dtrace --stap \ - --binary path/to/qemu-binary \ - --target-type system \ - --target-name x86_64 \ - <trace-events >qemu.stp + scripts/tracetool.py --backends=dtrace --format=stap \ + --binary path/to/qemu-binary \ + --target-type system \ + --target-name x86_64 \ + <trace-events >qemu.stp == Trace event properties == @@ -347,3 +344,44 @@ This will immediately call: and will generate the TCG code to call: void trace_foo(uint8_t a1, uint32_t a2); + +=== "vcpu" === + +Identifies events that trace vCPU-specific information. It implicitly adds a +"CPUState*" argument, and extends the tracing print format to show the vCPU +information. If used together with the "tcg" property, it adds a second +"TCGv_env" argument that must point to the per-target global TCG register that +points to the vCPU when guest code is executed (usually the "cpu_env" variable). + +The following example events: + + foo(uint32_t a) "a=%x" + vcpu bar(uint32_t a) "a=%x" + tcg vcpu baz(uint32_t a) "a=%x", "a=%x" + +Can be used as: + + #include "trace-tcg.h" + + CPUArchState *env; + TCGv_ptr cpu_env; + + void some_disassembly_func(...) + { + /* trace emitted at this point */ + trace_foo(0xd1); + /* trace emitted at this point */ + trace_bar(ENV_GET_CPU(env), 0xd2); + /* trace emitted at this point (env) and when guest code is executed (cpu_env) */ + trace_baz_tcg(ENV_GET_CPU(env), cpu_env, 0xd3); + } + +If the translating vCPU has address 0xc1 and code is later executed by vCPU +0xc2, this would be an example output: + + // at guest code translation + foo a=0xd1 + bar cpu=0xc1 a=0xd2 + baz_trans cpu=0xc1 a=0xd3 + // at guest code execution + baz_exec cpu=0xc2 a=0xd3 diff --git a/qemu/docs/virtio-migration.txt b/qemu/docs/virtio-migration.txt new file mode 100644 index 000000000..cf66458b9 --- /dev/null +++ b/qemu/docs/virtio-migration.txt @@ -0,0 +1,106 @@ +Virtio devices and migration +============================ + +Copyright 2015 IBM Corp. + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + +Saving and restoring the state of virtio devices is a bit of a twisty maze, +for several reasons: +- state is distributed between several parts: + - virtio core, for common fields like features, number of queues, ... + - virtio transport (pci, ccw, ...), for the different proxy devices and + transport specific state (msix vectors, indicators, ...) + - virtio device (net, blk, ...), for the different device types and their + state (mac address, request queue, ...) +- most fields are saved via the stream interface; subsequently, subsections + have been added to make cross-version migration possible + +This file attempts to document the current procedure and point out some +caveats. + + +Save state procedure +==================== + +virtio core virtio transport virtio device +----------- ---------------- ------------- + + save() function registered + via register_savevm() +virtio_save() <---------- + ------> save_config() + - save proxy device + - save transport-specific + device fields +- save common device + fields +- save common virtqueue + fields + ------> save_queue() + - save transport-specific + virtqueue fields + ------> save_device() + - save device-specific + fields +- save subsections + - device endianness, + if changed from + default endianness + - 64 bit features, if + any high feature bit + is set + - virtio-1 virtqueue + fields, if VERSION_1 + is set + + +Load state procedure +==================== + +virtio core virtio transport virtio device +----------- ---------------- ------------- + + load() function registered + via register_savevm() +virtio_load() <---------- + ------> load_config() + - load proxy device + - load transport-specific + device fields +- load common device + fields +- load common virtqueue + fields + ------> load_queue() + - load transport-specific + virtqueue fields +- notify guest + ------> load_device() + - load device-specific + fields +- load subsections + - device endianness + - 64 bit features + - virtio-1 virtqueue + fields +- sanitize endianness +- sanitize features +- virtqueue index sanity + check + - feature-dependent setup + + +Implications of this setup +========================== + +Devices need to be careful in their state processing during load: The +load_device() procedure is invoked by the core before subsections have +been loaded. Any code that depends on information transmitted in subsections +therefore has to be invoked in the device's load() function _after_ +virtio_load() returned (like e.g. code depending on features). + +Any extension of the state being migrated should be done in subsections +added to the core for compatibility reasons. If transport or device specific +state is added, core needs to invoke a callback from the new subsection. diff --git a/qemu/docs/win32-qemu-event.promela b/qemu/docs/win32-qemu-event.promela new file mode 100644 index 000000000..c446a7155 --- /dev/null +++ b/qemu/docs/win32-qemu-event.promela @@ -0,0 +1,98 @@ +/* + * This model describes the implementation of QemuEvent in + * util/qemu-thread-win32.c. + * + * Author: Paolo Bonzini <pbonzini@redhat.com> + * + * This file is in the public domain. If you really want a license, + * the WTFPL will do. + * + * To verify it: + * spin -a docs/event.promela + * gcc -O2 pan.c -DSAFETY + * ./a.out + */ + +bool event; +int value; + +/* Primitives for a Win32 event */ +#define RAW_RESET event = false +#define RAW_SET event = true +#define RAW_WAIT do :: event -> break; od + +#if 0 +/* Basic sanity checking: test the Win32 event primitives */ +#define RESET RAW_RESET +#define SET RAW_SET +#define WAIT RAW_WAIT +#else +/* Full model: layer a userspace-only fast path on top of the RAW_* + * primitives. SET/RESET/WAIT have exactly the same semantics as + * RAW_SET/RAW_RESET/RAW_WAIT, but try to avoid invoking them. + */ +#define EV_SET 0 +#define EV_FREE 1 +#define EV_BUSY -1 + +int state = EV_FREE; + +int xchg_result; +#define SET if :: state != EV_SET -> \ + atomic { /* xchg_result=xchg(state, EV_SET) */ \ + xchg_result = state; \ + state = EV_SET; \ + } \ + if :: xchg_result == EV_BUSY -> RAW_SET; \ + :: else -> skip; \ + fi; \ + :: else -> skip; \ + fi + +#define RESET if :: state == EV_SET -> atomic { state = state | EV_FREE; } \ + :: else -> skip; \ + fi + +int tmp1, tmp2; +#define WAIT tmp1 = state; \ + if :: tmp1 != EV_SET -> \ + if :: tmp1 == EV_FREE -> \ + RAW_RESET; \ + atomic { /* tmp2=cas(state, EV_FREE, EV_BUSY) */ \ + tmp2 = state; \ + if :: tmp2 == EV_FREE -> state = EV_BUSY; \ + :: else -> skip; \ + fi; \ + } \ + if :: tmp2 == EV_SET -> tmp1 = EV_SET; \ + :: else -> tmp1 = EV_BUSY; \ + fi; \ + :: else -> skip; \ + fi; \ + assert(tmp1 != EV_FREE); \ + if :: tmp1 == EV_BUSY -> RAW_WAIT; \ + :: else -> skip; \ + fi; \ + :: else -> skip; \ + fi +#endif + +active proctype waiter() +{ + if + :: !value -> + RESET; + if + :: !value -> WAIT; + :: else -> skip; + fi; + :: else -> skip; + fi; + assert(value); +} + +active proctype notifier() +{ + value = true; + SET; +} diff --git a/qemu/docs/writing-qmp-commands.txt b/qemu/docs/writing-qmp-commands.txt index ab1fdd36b..59aa77ae2 100644 --- a/qemu/docs/writing-qmp-commands.txt +++ b/qemu/docs/writing-qmp-commands.txt @@ -122,12 +122,12 @@ There are a few things to be noticed: Now a little hack is needed. As we're still using the old QMP server we need to add the new command to its internal dispatch table. This step won't be required in the near future. Open the qmp-commands.hx file and add the -following in the botton: +following at the bottom: { .name = "hello-world", .args_type = "", - .mhandler.cmd_new = qmp_marshal_input_hello_world, + .mhandler.cmd_new = qmp_marshal_hello_world, }, You're done. Now build qemu, run it as suggested in the "Testing" section, @@ -179,7 +179,7 @@ The last step is to update the qmp-commands.hx file: { .name = "hello-world", .args_type = "message:s?", - .mhandler.cmd_new = qmp_marshal_input_hello_world, + .mhandler.cmd_new = qmp_marshal_hello_world, }, Notice that the "args_type" member got our "message" argument. The character @@ -210,7 +210,7 @@ if you don't see these strings, then something went wrong. === Errors === QMP commands should use the error interface exported by the error.h header -file. Basically, errors are set by calling the error_set() function. +file. Basically, most errors are set by calling the error_setg() function. Let's say we don't accept the string "message" to contain the word "love". If it does contain it, we want the "hello-world" command to return an error: @@ -219,8 +219,7 @@ void qmp_hello_world(bool has_message, const char *message, Error **errp) { if (has_message) { if (strstr(message, "love")) { - error_set(errp, ERROR_CLASS_GENERIC_ERROR, - "the word 'love' is not allowed"); + error_setg(errp, "the word 'love' is not allowed"); return; } printf("%s\n", message); @@ -229,10 +228,8 @@ void qmp_hello_world(bool has_message, const char *message, Error **errp) } } -The first argument to the error_set() function is the Error pointer to pointer, -which is passed to all QMP functions. The second argument is a ErrorClass -value, which should be ERROR_CLASS_GENERIC_ERROR most of the time (more -details about error classes are given below). The third argument is a human +The first argument to the error_setg() function is the Error pointer +to pointer, which is passed to all QMP functions. The next argument is a human description of the error, this is a free-form printf-like string. Let's test the example above. Build qemu, run it as defined in the "Testing" @@ -249,8 +246,9 @@ The QMP server's response should be: } } -As a general rule, all QMP errors should use ERROR_CLASS_GENERIC_ERROR. There -are two exceptions to this rule: +As a general rule, all QMP errors should use ERROR_CLASS_GENERIC_ERROR +(done by default when using error_setg()). There are two exceptions to +this rule: 1. A non-generic ErrorClass value exists* for the failure you want to report (eg. DeviceNotFound) @@ -259,8 +257,8 @@ are two exceptions to this rule: want to report, hence you have to add a new ErrorClass value so that they can check for it -If the failure you want to report doesn't fall in one of the two cases above, -just report ERROR_CLASS_GENERIC_ERROR. +If the failure you want to report falls into one of the two cases above, +use error_set() with a second argument of an ErrorClass value. * All existing ErrorClass values are defined in the qapi-schema.json file @@ -461,7 +459,7 @@ The last step is to add the correspoding entry in the qmp-commands.hx file: { .name = "query-alarm-clock", .args_type = "", - .mhandler.cmd_new = qmp_marshal_input_query_alarm_clock, + .mhandler.cmd_new = qmp_marshal_query_alarm_clock, }, Time to test the new command. Build qemu, run it as described in the "Testing" @@ -607,7 +605,7 @@ To test this you have to add the corresponding qmp-commands.hx entry: { .name = "query-alarm-methods", .args_type = "", - .mhandler.cmd_new = qmp_marshal_input_query_alarm_methods, + .mhandler.cmd_new = qmp_marshal_query_alarm_methods, }, Now Build qemu, run it as explained in the "Testing" section and try our new |