summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/dev/rbd-layering.rst
diff options
context:
space:
mode:
Diffstat (limited to 'src/ceph/doc/dev/rbd-layering.rst')
-rw-r--r--src/ceph/doc/dev/rbd-layering.rst281
1 files changed, 0 insertions, 281 deletions
diff --git a/src/ceph/doc/dev/rbd-layering.rst b/src/ceph/doc/dev/rbd-layering.rst
deleted file mode 100644
index e6e224c..0000000
--- a/src/ceph/doc/dev/rbd-layering.rst
+++ /dev/null
@@ -1,281 +0,0 @@
-============
-RBD Layering
-============
-
-RBD layering refers to the creation of copy-on-write clones of block
-devices. This allows for fast image creation, for example to clone a
-golden master image of a virtual machine into a new instance. To
-simplify the semantics, you can only create a clone of a snapshot -
-snapshots are always read-only, so the rest of the image is
-unaffected, and there's no possibility of writing to them
-accidentally.
-
-From a user's perspective, a clone is just like any other rbd image.
-You can take snapshots of them, read/write them, resize them, etc.
-There are no restrictions on clones from a user's viewpoint.
-
-Note: the terms `child` and `parent` below mean an rbd image created
-by cloning, and the rbd image snapshot a child was cloned from.
-
-Command line interface
-----------------------
-
-Before cloning a snapshot, you must mark it as protected, to prevent
-it from being deleted while child images refer to it:
-::
-
- $ rbd snap protect pool/image@snap
-
-Then you can perform the clone:
-::
-
- $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1
-
-You can create a clone with different object sizes from the parent:
-::
-
- $ rbd clone --order 25 pool/parent@snap pool2/child2
-
-To delete the parent, you must first mark it unprotected, which checks
-that there are no children left:
-::
-
- $ rbd snap unprotect pool/image@snap
- Cannot unprotect: Still in use by pool2/image2
- $ rbd children pool/image@snap
- pool2/child1
- pool2/child2
- $ rbd flatten pool2/child1
- $ rbd rm pool2/child2
- $ rbd snap rm pool/image@snap
- Cannot remove a protected snapshot: pool/image@snap
- $ rbd snap unprotect pool/image@snap
-
-Then the snapshot can be deleted like normal:
-::
-
- $ rbd snap rm pool/image@snap
-
-Implementation
---------------
-
-Data Flow
-^^^^^^^^^
-
-In the initial implementation, called 'trivial layering', there will
-be no tracking of which objects exist in a clone. A read that hits a
-non-existent object will attempt to read from the parent snapshot, and
-this will continue recursively until an object exists or an image with
-no parent is found. This is done through the normal read path from
-the parent, so differing object sizes between parents and children
-do not matter.
-
-Before a write to an object is performed, the object is checked for
-existence. If it doesn't exist, a copy-up operation is performed,
-which means reading the relevant range of data from the parent
-snapshot and writing it (plus the original write) to the child
-image. To prevent races with multiple writes trying to copy-up the
-same object, this copy-up operation will include an atomic create. If
-the atomic create fails, the original write is done instead. This
-copy-up operation is implemented as a class method so that extra
-metadata can be stored by it in the future. In trivial layering, the
-copy-up operation copies the entire range needed to the child object
-(that is, the full size of the child object). A future optimization
-could make this copy-up more fine-grained.
-
-Another future optimization could be storing a bitmap of which objects
-actually exist in a child. This would obviate the check for existence
-before each write, and let reads go directly to the parent if needed.
-
-These optimizations are discussed in:
-
-http://marc.info/?l=ceph-devel&m=129867273303846
-
-Parent/Child relationships
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Children store a reference to their parent in their header, as a tuple
-of (pool id, image id, snapshot id). This is enough information to
-open the parent and read from it.
-
-In addition to knowing which parent a given image has, we want to be
-able to tell if a protected snapshot still has children. This is
-accomplished with a new per-pool object, `rbd_children`, which maps
-(parent pool id, parent image id, parent snapshot id) to a list of
-child image ids. This is stored in the same pool as the child image
-because the client creating a clone already has read/write access to
-everything in this pool, but may not have write access to the parent's
-pool. This lets a client with read-only access to one pool clone a
-snapshot from that pool into a pool they have full access to. It
-increases the cost of unprotecting an image, since this needs to check
-for children in every pool, but this is a rare operation. It would
-likely only be done before removing old images, which is already much
-more expensive because it involves deleting every data object in the
-image.
-
-Protection
-^^^^^^^^^^
-
-Internally, protection_state is a field in the header object that
-can be in three states. "protected", "unprotected", and
-"unprotecting". The first two are set as the result of "rbd
-protect/unprotect". The "unprotecting" state is set while the "rbd
-unprotect" command checks for any child images. Only snapshots in the
-"protected" state may be cloned, so the "unprotected" state prevents
-a race like:
-
-1. A: walk through all pools, look for clones, find none
-2. B: create a clone
-3. A: unprotect parent
-4. A: rbd snap rm pool/parent@snap
-
-Resizing
-^^^^^^^^
-
-Resizing an rbd image is like truncating a sparse file. New space is
-treated as zeroes, and shrinking an rbd image deletes the contents
-beyond the old bounds. This means that if you have a 10G image full of
-data, and you resize it down to 5G and then up to 10G again, the last
-5G is treated as zeroes (and any objects that held that data were
-removed when the image was shrunk).
-
-Layering complicates this because the absence of an object no longer
-implies it should be treated as zeroes - if the object is part of a
-clone, it may mean that some data needs to be read from the parent.
-
-To preserve the resizing behavior for clones, we need to keep track of
-which objects could be stored in the parent. We can track this as the
-amount of overlap the child has with the parent, since resizing only
-changes the end of an image. When a child is created, its overlap
-is the size of the parent snapshot. On each subsequent resize, the
-overlap is `min(overlap, new_size)`. That is, shrinking the image
-may shrinks the overlap, but increasing the image's size does not
-change the overlap.
-
-Objects that do not exist past the overlap are treated as zeroes.
-Objects that do not exist before that point fall back to reading
-from the parent.
-
-Since this overlap changes over time, we store it as part of the
-metadata for a snapshot as well.
-
-Renaming
-^^^^^^^^
-
-Currently the rbd header object (that stores all the metadata about an
-image) is named after the name of the image. This makes renaming
-disrupt clients who have the image open (such as children reading from
-a parent). To avoid this, we can name the header object by the
-id of the image, which does not change. That is, the name of the
-header object could be `rbd_header.$id`, where $id is a unique id for
-the image in the pool.
-
-When a client opens an image, all it knows is the name. There is
-already a per-pool `rbd_directory` object that maps image names to
-ids, but if we relied on it to get the id, we could not open any
-images in that pool if that single object was unavailable. To avoid
-this dependency, we can store the id of an image in an object called
-`rbd_id.$image_name`, where $image_name is the name of the image. The
-per-pool `rbd_directory` object is still useful for listing all images
-in a pool, however.
-
-Header changes
---------------
-
-The header needs a few new fields:
-
-* int64_t parent_pool_id
-* string parent_image_id
-* uint64_t parent_snap_id
-* uint64_t overlap (how much of the image may be referring to the parent)
-
-These are stored in a "parent" key, which is only present if the image
-has a parent.
-
-cls_rbd
-^^^^^^^
-
-Some new methods are needed:
-::
-
- /***************** methods on the rbd header *********************/
- /**
- * Sets the parent and overlap keys.
- * Fails if any of these keys exist, since the image already
- * had a parent.
- */
- set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)
-
- /**
- * returns the parent pool id, image id, snap id, and overlap, or -ENOENT
- * if parent_pool_id does not exist or is -1
- */
- get_parent(uint64_t snapid)
-
- /**
- * Removes the parent key
- */
- remove_parent() // after all parent data is copied to the child
-
- /*************** methods on the rbd_children object *****************/
-
- add_child(uint64_t parent_pool_id, string parent_image_id,
- uint64_t parent_snap_id, string image_id);
- remove_child(uint64_t parent_pool_id, string parent_image_id,
- uint64_t parent_snap_id, string image_id);
- /**
- * List ids of a given parent
- */
- get_children(uint64_t parent_pool_id, string parent_image_id,
- uint64_t parent_snap_id, uint64_t max_return,
- string start);
- /**
- * list parent
- */
- get_parents(uint64_t max_return, uint64_t start_pool_id,
- string start_image_id, string start_snap_id);
-
-
- /************ methods on the rbd_id.$image_name object **************/
-
- set_id(string id)
- get_id()
-
- /************** methods on the rbd_directory object *****************/
-
- dir_get_id(string name);
- dir_get_name(string id);
- dir_list(string start_after, uint64_t max_return);
- dir_add_image(string name, string id);
- dir_remove_image(string name, string id);
- dir_rename_image(string src, string dest, string id);
-
-Two existing methods will change if the image supports
-layering:
-::
-
- snapshot_add - stores current overlap and has_parent with
- other snapshot metadata (images that don't have
- layering enabled aren't affected)
-
- set_size - will adjust the parent overlap down as needed.
-
-librbd
-^^^^^^
-
-Opening a child image opens its parent (and this will continue
-recursively as needed). This means that an ImageCtx will contain a
-pointer to the parent image context. Differing object sizes won't
-matter, since reading from the parent will go through the parent
-image context.
-
-Discard will need to change for layered images so that it only
-truncates objects, and does not remove them. If we removed objects, we
-could not tell if we needed to read them from the parent.
-
-A new clone method will be added, which takes the same arguments as
-create except size (size of the parent image is used).
-
-Instead of expanding the rbd_info struct, we will break the metadata
-retrieval into several API calls. Right now, the only users of
-rbd_stat() other than 'rbd info' only use it to retrieve image size.