summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/dev/osd_internals
diff options
context:
space:
mode:
authorQiaowei Ren <qiaowei.ren@intel.com>2018-01-04 13:43:33 +0800
committerQiaowei Ren <qiaowei.ren@intel.com>2018-01-05 11:59:39 +0800
commit812ff6ca9fcd3e629e49d4328905f33eee8ca3f5 (patch)
tree04ece7b4da00d9d2f98093774594f4057ae561d4 /src/ceph/doc/dev/osd_internals
parent15280273faafb77777eab341909a3f495cf248d9 (diff)
initial code repo
This patch creates initial code repo. For ceph, luminous stable release will be used for base code, and next changes and optimization for ceph will be added to it. For opensds, currently any changes can be upstreamed into original opensds repo (https://github.com/opensds/opensds), and so stor4nfv will directly clone opensds code to deploy stor4nfv environment. And the scripts for deployment based on ceph and opensds will be put into 'ci' directory. Change-Id: I46a32218884c75dda2936337604ff03c554648e4 Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Diffstat (limited to 'src/ceph/doc/dev/osd_internals')
-rw-r--r--src/ceph/doc/dev/osd_internals/backfill_reservation.rst38
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding.rst82
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst223
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst207
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst33
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst385
-rw-r--r--src/ceph/doc/dev/osd_internals/index.rst10
-rw-r--r--src/ceph/doc/dev/osd_internals/last_epoch_started.rst60
-rw-r--r--src/ceph/doc/dev/osd_internals/log_based_pg.rst206
-rw-r--r--src/ceph/doc/dev/osd_internals/map_message_handling.rst131
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_overview.rst106
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_throttles.rst93
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_throttles.txt21
-rw-r--r--src/ceph/doc/dev/osd_internals/pg.rst31
-rw-r--r--src/ceph/doc/dev/osd_internals/pg_removal.rst56
-rw-r--r--src/ceph/doc/dev/osd_internals/pgpool.rst22
-rw-r--r--src/ceph/doc/dev/osd_internals/recovery_reservation.rst75
-rw-r--r--src/ceph/doc/dev/osd_internals/scrub.rst30
-rw-r--r--src/ceph/doc/dev/osd_internals/snaps.rst128
-rw-r--r--src/ceph/doc/dev/osd_internals/watch_notify.rst81
-rw-r--r--src/ceph/doc/dev/osd_internals/wbthrottle.rst28
21 files changed, 2046 insertions, 0 deletions
diff --git a/src/ceph/doc/dev/osd_internals/backfill_reservation.rst b/src/ceph/doc/dev/osd_internals/backfill_reservation.rst
new file mode 100644
index 0000000..cf9dab4
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/backfill_reservation.rst
@@ -0,0 +1,38 @@
+====================
+Backfill Reservation
+====================
+
+When a new osd joins a cluster, all pgs containing it must eventually backfill
+to it. If all of these backfills happen simultaneously, it would put excessive
+load on the osd. osd_max_backfills limits the number of outgoing or
+incoming backfills on a single node. The maximum number of outgoing backfills is
+osd_max_backfills. The maximum number of incoming backfills is
+osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2
+simultaneous backfills on one osd.
+
+Each OSDService now has two AsyncReserver instances: one for backfills going
+from the osd (local_reserver) and one for backfills going to the osd
+(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue
+by priority of waiting items and a set of current reservation holders. When a
+slot frees up, the AsyncReserver queues the Context* associated with the next
+item on the highest priority queue in the finisher provided to the constructor.
+
+For a primary to initiate a backfill, it must first obtain a reservation from
+its own local_reserver. Then, it must obtain a reservation from the backfill
+target's remote_reserver via a MBackfillReserve message. This process is
+managed by substates of Active and ReplicaActive (see the substates of Active
+in PG.h). The reservations are dropped either on the Backfilled event, which
+is sent on the primary before calling recovery_complete and on the replica on
+receipt of the BackfillComplete progress message), or upon leaving Active or
+ReplicaActive.
+
+It's important that we always grab the local reservation before the remote
+reservation in order to prevent a circular dependency.
+
+We want to minimize the risk of data loss by prioritizing the order in
+which PGs are recovered. The highest priority is log based recovery
+(OSD_RECOVERY_PRIORITY_MAX) since this must always complete before
+backfill can start. The next priority is backfill of degraded PGs and
+is a function of the degradation. A backfill for a PG missing two
+replicas will have a priority higher than a backfill for a PG missing
+one replica. The lowest priority is backfill of non-degraded PGs.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding.rst b/src/ceph/doc/dev/osd_internals/erasure_coding.rst
new file mode 100644
index 0000000..7263cc3
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding.rst
@@ -0,0 +1,82 @@
+==============================
+Erasure Coded Placement Groups
+==============================
+
+Glossary
+--------
+
+*chunk*
+ when the encoding function is called, it returns chunks of the same
+ size. Data chunks which can be concatenated to reconstruct the original
+ object and coding chunks which can be used to rebuild a lost chunk.
+
+*chunk rank*
+ the index of a chunk when returned by the encoding function. The
+ rank of the first chunk is 0, the rank of the second chunk is 1
+ etc.
+
+*stripe*
+ when an object is too large to be encoded with a single call,
+ each set of chunks created by a call to the encoding function is
+ called a stripe.
+
+*shard|strip*
+ an ordered sequence of chunks of the same rank from the same
+ object. For a given placement group, each OSD contains shards of
+ the same rank. When dealing with objects that are encoded with a
+ single operation, *chunk* is sometime used instead of *shard*
+ because the shard is made of a single chunk. The *chunks* in a
+ *shard* are ordered according to the rank of the stripe they belong
+ to.
+
+*K*
+ the number of data *chunks*, i.e. the number of *chunks* in which the
+ original object is divided. For instance if *K* = 2 a 10KB object
+ will be divided into *K* objects of 5KB each.
+
+*M*
+ the number of coding *chunks*, i.e. the number of additional *chunks*
+ computed by the encoding functions. If there are 2 coding *chunks*,
+ it means 2 OSDs can be out without losing data.
+
+*N*
+ the number of data *chunks* plus the number of coding *chunks*,
+ i.e. *K+M*.
+
+*rate*
+ the proportion of the *chunks* that contains useful information, i.e. *K/N*.
+ For instance, for *K* = 9 and *M* = 3 (i.e. *K+M* = *N* = 12) the rate is
+ *K* = 9 / *N* = 12 = 0.75, i.e. 75% of the chunks contain useful information.
+
+The definitions are illustrated as follows (PG stands for placement group):
+::
+
+ OSD 40 OSD 33
+ +-------------------------+ +-------------------------+
+ | shard 0 - PG 10 | | shard 1 - PG 10 |
+ |+------ object O -------+| |+------ object O -------+|
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 0 ||| stripe 0 ||| ||| stripe 0 |||
+ ||+---------------------+|| ||+---------------------+||
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 1 ||| stripe 1 ||| ||| stripe 1 |||
+ ||+---------------------+|| ||+---------------------+||
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 2 ||| stripe 2 ||| ||| stripe 2 |||
+ ||+---------------------+|| ||+---------------------+||
+ |+-----------------------+| |+-----------------------+|
+ | ... | | ... |
+ +-------------------------+ +-------------------------+
+
+Table of content
+----------------
+
+.. toctree::
+ :maxdepth: 1
+
+ Developer notes <erasure_coding/developer_notes>
+ Jerasure plugin <erasure_coding/jerasure>
+ High level design document <erasure_coding/ecbackend>
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst
new file mode 100644
index 0000000..770ff4a
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst
@@ -0,0 +1,223 @@
+============================
+Erasure Code developer notes
+============================
+
+Introduction
+------------
+
+Each chapter of this document explains an aspect of the implementation
+of the erasure code within Ceph. It is mostly based on examples being
+explained to demonstrate how things work.
+
+Reading and writing encoded chunks from and to OSDs
+---------------------------------------------------
+
+An erasure coded pool stores each object as K+M chunks. It is divided
+into K data chunks and M coding chunks. The pool is configured to have
+a size of K+M so that each chunk is stored in an OSD in the acting
+set. The rank of the chunk is stored as an attribute of the object.
+
+Let's say an erasure coded pool is created to use five OSDs ( K+M =
+5 ) and sustain the loss of two of them ( M = 2 ).
+
+When the object *NYAN* containing *ABCDEFGHI* is written to it, the
+erasure encoding function splits the content in three data chunks,
+simply by dividing the content in three : the first contains *ABC*,
+the second *DEF* and the last *GHI*. The content will be padded if the
+content length is not a multiple of K. The function also creates two
+coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each
+chunk is stored in an OSD in the acting set. The chunks are stored in
+objects that have the same name ( *NYAN* ) but reside on different
+OSDs. The order in which the chunks were created must be preserved and
+is stored as an attribute of the object ( shard_t ), in addition to its
+name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4*
+contains *YXY* and is stored on *OSD3*.
+
+::
+
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ |
+ |
+ v
+ +------+------+
+ +---------------+ encode(3,2) +-----------+
+ | +--+--+---+---+ |
+ | | | | |
+ | +-------+ | +-----+ |
+ | | | | |
+ +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
+ name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 | | 5 |
+ +------+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY | | QGC |
+ +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
+ | | | | |
+ | | | | |
+ | | +--+---+ | |
+ | | | OSD1 | | |
+ | | +------+ | |
+ | | +------+ | |
+ | +------>| OSD2 | | |
+ | +------+ | |
+ | +------+ | |
+ | | OSD3 |<----+ |
+ | +------+ |
+ | +------+ |
+ | | OSD4 |<--------------+
+ | +------+
+ | +------+
+ +----------------->| OSD5 |
+ +------+
+
+
+
+
+When the object *NYAN* is read from the erasure coded pool, the
+decoding function reads three chunks : chunk *1* containing *ABC*,
+chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild
+the original content of the object *ABCDEFGHI*. The decoding function
+is informed that the chunks *2* and *5* are missing ( they are called
+*erasures* ). The chunk *5* could not be read because the *OSD4* is
+*out*.
+
+The decoding function could be called as soon as three chunks are
+read : *OSD2* was the slowest and its chunk does not need to be taken into
+account. This optimization is not implemented in Firefly.
+
+::
+
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ ^
+ |
+ |
+ +------+------+
+ | decode(3,2) |
+ | erasures 2,5|
+ +-------------->| |
+ | +-------------+
+ | ^ ^
+ | | +-----+
+ | | |
+ +--+---+ +------+ +--+---+ +--+---+
+ name | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 |
+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY |
+ +--+---+ +--+---+ +--+---+ +--+---+
+ ^ . ^ ^
+ | TOO . | |
+ | SLOW . +--+---+ |
+ | ^ | OSD1 | |
+ | | +------+ |
+ | | +------+ |
+ | +-------| OSD2 | |
+ | +------+ |
+ | +------+ |
+ | | OSD3 |-----+
+ | +------+
+ | +------+
+ | | OSD4 | OUT
+ | +------+
+ | +------+
+ +------------------| OSD5 |
+ +------+
+
+
+Erasure code library
+--------------------
+
+Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_,
+with parameters K+M, object O is encoded by dividing it into chunks O1,
+O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks
+out of the available K+M chunks can be used to obtain the original
+object. If data chunk O2 or coding chunk P2 are lost, they can be
+repaired using any K chunks out of the K+M chunks. If more than M
+chunks are lost, it is not possible to recover the object.
+
+Reading the original content of object O can be a simple
+concatenation of O1, O2, ... OM, because the plugins are using
+`systematic codes
+<http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks
+must be given to the erasure code library *decode* method to retrieve
+the content of the object.
+
+Performance depend on the parameters to the encoding functions and
+is also influenced by the packet sizes used when calling the encoding
+functions ( for Cauchy or Liberation for instance ): smaller packets
+means more calls and more overhead.
+
+Although Reed-Solomon is provided as a default, Ceph uses it via an
+`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to
+allow each pool to choose the plugin that implements it using
+key=value pairs stored in an `erasure code profile`_.
+
+.. _erasure code profile: ../../../erasure-coded-pool
+
+::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ crush-failure-domain=osd
+ $ ceph osd erasure-code-profile get myprofile
+ directory=/usr/lib/ceph/erasure-code
+ k=2
+ m=1
+ plugin=jerasure
+ technique=reed_sol_van
+ crush-failure-domain=osd
+ $ ceph osd pool create ecpool 12 12 erasure myprofile
+
+The *plugin* is dynamically loaded from *directory* and expected to
+implement the *int __erasure_code_init(char *plugin_name, char *directory)* function
+which is responsible for registering an object derived from *ErasureCodePlugin*
+in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads:
+
+::
+
+ ErasureCodePluginRegistry &instance =
+ ErasureCodePluginRegistry::instance();
+ instance.add(plugin_name, new ErasureCodePluginExample());
+
+The *ErasureCodePlugin* derived object must provide a factory method
+from which the concrete implementation of the *ErasureCodeInterface*
+object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads:
+
+::
+
+ virtual int factory(const map<std::string,std::string> &parameters,
+ ErasureCodeInterfaceRef *erasure_code) {
+ *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
+ return 0;
+ }
+
+The *parameters* argument is the list of *key=value* pairs that were
+set in the erasure code profile, before the pool was created.
+
+::
+
+ ceph osd erasure-code-profile set myprofile \
+ directory=<dir> \ # mandatory
+ plugin=jerasure \ # mandatory
+ m=10 \ # optional and plugin dependant
+ k=3 \ # optional and plugin dependant
+ technique=reed_sol_van \ # optional and plugin dependant
+
+Notes
+-----
+
+If the objects are large, it may be impractical to encode and decode
+them in memory. However, when using *RBD* a 1TB device is divided in
+many individual 4MB objects and *RGW* does the same.
+
+Encoding and decoding is implemented in the OSD. Although it could be
+implemented client side for read write, the OSD must be able to encode
+and decode on its own when scrubbing.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst
new file mode 100644
index 0000000..624ec21
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst
@@ -0,0 +1,207 @@
+=================================
+ECBackend Implementation Strategy
+=================================
+
+Misc initial design notes
+=========================
+
+The initial (and still true for ec pools without the hacky ec
+overwrites debug flag enabled) design for ec pools restricted
+EC pools to operations which can be easily rolled back:
+
+- CEPH_OSD_OP_APPEND: We can roll back an append locally by
+ including the previous object size as part of the PG log event.
+- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
+ requires that we retain the deleted object until all replicas have
+ persisted the deletion event. ErasureCoded backend will therefore
+ need to store objects with the version at which they were created
+ included in the key provided to the filestore. Old versions of an
+ object can be pruned when all replicas have committed up to the log
+ event deleting the object.
+- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
+ to be set or removed, we can roll back these operations locally.
+
+Log entries contain a structure explaining how to locally undo the
+operation represented by the operation
+(see osd_types.h:TransactionInfo::LocalRollBack).
+
+PGTemp and Crush
+----------------
+
+Primaries are able to request a temp acting set mapping in order to
+allow an up-to-date OSD to serve requests while a new primary is
+backfilled (and for other reasons). An erasure coded pg needs to be
+able to designate a primary for these reasons without putting it in
+the first position of the acting set. It also needs to be able to
+leave holes in the requested acting set.
+
+Core Changes:
+
+- OSDMap::pg_to_*_osds needs to separately return a primary. For most
+ cases, this can continue to be acting[0].
+- MOSDPGTemp (and related OSD structures) needs to be able to specify
+ a primary as well as an acting set.
+- Much of the existing code base assumes that acting[0] is the primary
+ and that all elements of acting are valid. This needs to be cleaned
+ up since the acting set may contain holes.
+
+Distinguished acting set positions
+----------------------------------
+
+With the replicated strategy, all replicas of a PG are
+interchangeable. With erasure coding, different positions in the
+acting set have different pieces of the erasure coding scheme and are
+not interchangeable. Worse, crush might cause chunk 2 to be written
+to an OSD which happens already to contain an (old) copy of chunk 4.
+This means that the OSD and PG messages need to work in terms of a
+type like pair<shard_t, pg_t> in order to distinguish different pg
+chunks on a single OSD.
+
+Because the mapping of object name to object in the filestore must
+be 1-to-1, we must ensure that the objects in chunk 2 and the objects
+in chunk 4 have different names. To that end, the objectstore must
+include the chunk id in the object key.
+
+Core changes:
+
+- The objectstore `ghobject_t needs to also include a chunk id
+ <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
+ tuple<hobject_t, gen_t, shard_t>.
+- coll_t needs to include a shard_t.
+- The OSD pg_map and similar pg mappings need to work in terms of a
+ spg_t (essentially
+ pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
+ a shard_t
+- For client->PG messages, the OSD will need a way to know which PG
+ chunk should get the message since the OSD may contain both a
+ primary and non-primary chunk for the same pg
+
+Object Classes
+--------------
+
+Reads from object classes will return ENOTSUP on ec pools by invoking
+a special SYNC read.
+
+Scrub
+-----
+
+The main catch, however, for ec pools is that sending a crc32 of the
+stored chunk on a replica isn't particularly helpful since the chunks
+on different replicas presumably store different data. Because we
+don't support overwrites except via DELETE, however, we have the
+option of maintaining a crc32 on each chunk through each append.
+Thus, each replica instead simply computes a crc32 of its own stored
+chunk and compares it with the locally stored checksum. The replica
+then reports to the primary whether the checksums match.
+
+With overwrites, all scrubs are disabled for now until we work out
+what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst).
+
+Crush
+-----
+
+If crush is unable to generate a replacement for a down member of an
+acting set, the acting set should have a hole at that position rather
+than shifting the other elements of the acting set out of position.
+
+=========
+ECBackend
+=========
+
+MAIN OPERATION OVERVIEW
+=======================
+
+A RADOS put operation can span
+multiple stripes of a single object. There must be code that
+tessellates the application level write into a set of per-stripe write
+operations -- some whole-stripes and up to two partial
+stripes. Without loss of generality, for the remainder of this
+document we will focus exclusively on writing a single stripe (whole
+or partial). We will use the symbol "W" to represent the number of
+blocks within a stripe that are being written, i.e., W <= K.
+
+There are three data flows for handling a write into an EC stripe. The
+choice of which of the three data flows to choose is based on the size
+of the write operation and the arithmetic properties of the selected
+parity-generation algorithm.
+
+(1) whole stripe is written/overwritten
+(2) a read-modify-write operation is performed.
+
+WHOLE STRIPE WRITE
+------------------
+
+This is the simple case, and is already performed in the existing code
+(for appends, that is). The primary receives all of the data for the
+stripe in the RADOS request, computes the appropriate parity blocks
+and send the data and parity blocks to their destination shards which
+write them. This is essentially the current EC code.
+
+READ-MODIFY-WRITE
+-----------------
+
+The primary determines which of the K-W blocks are to be unmodified,
+and reads them from the shards. Once all of the data is received it is
+combined with the received new data and new parity blocks are
+computed. The modified blocks are sent to their respective shards and
+written. The RADOS operation is acknowledged.
+
+OSD Object Write and Consistency
+--------------------------------
+
+Regardless of the algorithm chosen above, writing of the data is a two
+phase process: commit and rollforward. The primary sends the log
+entries with the operation described (see
+osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
+In all cases, the "commit" is performed in place, possibly leaving some
+information required for a rollback in a write-aside object. The
+rollforward phase occurs once all acting set replicas have committed
+the commit (sorry, overloaded term) and removes the rollback information.
+
+In the case of overwrites of exsting stripes, the rollback information
+has the form of a sparse object containing the old values of the
+overwritten extents populated using clone_range. This is essentially
+a place-holder implementation, in real life, bluestore will have an
+efficient primitive for this.
+
+The rollforward part can be delayed since we report the operation as
+committed once all replicas have committed. Currently, whenever we
+send a write, we also indicate that all previously committed
+operations should be rolled forward (see
+ECBackend::try_reads_to_commit). If there aren't any in the pipeline
+when we arrive at the waiting_rollforward queue, we start a dummy
+write to move things along (see the Pipeline section later on and
+ECBackend::try_finish_rmw).
+
+ExtentCache
+-----------
+
+It's pretty important to be able to pipeline writes on the same
+object. For this reason, there is a cache of extents written by
+cacheable operations. Each extent remains pinned until the operations
+referring to it are committed. The pipeline prevents rmw operations
+from running until uncacheable transactions (clones, etc) are flushed
+from the pipeline.
+
+See ExtentCache.h for a detailed explanation of how the cache
+states correspond to the higher level invariants about the conditions
+under which cuncurrent operations can refer to the same object.
+
+Pipeline
+--------
+
+Reading src/osd/ExtentCache.h should have given a good idea of how
+operations might overlap. There are several states involved in
+processing a write operation and an important invariant which
+isn't enforced by PrimaryLogPG at a higher level which need to be
+managed by ECBackend. The important invariant is that we can't
+have uncacheable and rmw operations running at the same time
+on the same object. For simplicity, we simply enforce that any
+operation which contains an rmw operation must wait until
+all in-progress uncacheable operations complete.
+
+There are improvements to be made here in the future.
+
+For more details, see ECBackend::waiting_* and
+ECBackend::try_<from>_to_<to>.
+
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
new file mode 100644
index 0000000..27669a0
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
@@ -0,0 +1,33 @@
+===============
+jerasure plugin
+===============
+
+Introduction
+------------
+
+The parameters interpreted by the jerasure plugin are:
+
+::
+
+ ceph osd erasure-code-profile set myprofile \
+ directory=<dir> \ # plugin directory absolute path
+ plugin=jerasure \ # plugin name (only jerasure)
+ k=<k> \ # data chunks (default 2)
+ m=<m> \ # coding chunks (default 2)
+ technique=<technique> \ # coding technique
+
+The coding techniques can be chosen among *reed_sol_van*,
+*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*,
+*blaum_roth* and *liber8tion*.
+
+The *src/erasure-code/jerasure* directory contains the
+implementation. It is a wrapper around the code found at
+`https://github.com/ceph/jerasure <https://github.com/ceph/jerasure>`_
+and `https://github.com/ceph/gf-complete
+<https://github.com/ceph/gf-complete>`_ , pinned to the latest stable
+version in *.gitmodules*. These repositories are copies of the
+upstream repositories `http://jerasure.org/jerasure/jerasure
+<http://jerasure.org/jerasure/jerasure>`_ and
+`http://jerasure.org/jerasure/gf-complete
+<http://jerasure.org/jerasure/gf-complete>`_ . The difference
+between the two, if any, should match pull requests against upstream.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst
new file mode 100644
index 0000000..52f98e8
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst
@@ -0,0 +1,385 @@
+:orphan:
+
+=================================
+Proposed Next Steps for ECBackend
+=================================
+
+PARITY-DELTA-WRITE
+------------------
+
+RMW operations current require 4 network hops (2 round trips). In
+principle, for some codes, we can reduce this to 3 by sending the
+update to the replicas holding the data blocks and having them
+compute a delta to forward onto the parity blocks.
+
+The primary reads the current values of the "W" blocks and then uses
+the new values of the "W" blocks to compute parity-deltas for each of
+the parity blocks. The W blocks and the parity delta-blocks are sent
+to their respective shards.
+
+The choice of whether to use a read-modify-write or a
+parity-delta-write is complex policy issue that is TBD in the details
+and is likely to be heavily dependant on the computational costs
+associated with a parity-delta vs. a regular parity-generation
+operation. However, it is believed that the parity-delta scheme is
+likely to be the preferred choice, when available.
+
+The internal interface to the erasure coding library plug-ins needs to
+be extended to support the ability to query if parity-delta
+computation is possible for a selected algorithm as well as an
+interface to the actual parity-delta computation algorithm when
+available.
+
+Stripe Cache
+------------
+
+It may be a good idea to extend the current ExtentCache usage to
+cache some data past when the pinning operation releases it.
+One application pattern that is important to optimize is the small
+block sequential write operation (think of the journal of a journaling
+file system or a database transaction log). Regardless of the chosen
+redundancy algorithm, it is advantageous for the primary to
+retain/buffer recently read/written portions of a stripe in order to
+reduce network traffic. The dynamic contents of this cache may be used
+in the determination of whether a read-modify-write or a
+parity-delta-write is performed. The sizing of this cache is TBD, but
+we should plan on allowing at least a few full stripes per active
+client. Limiting the cache occupancy on a per-client basis will reduce
+the noisy neighbor problem.
+
+Recovery and Rollback Details
+=============================
+
+Implementing a Rollback-able Prepare Operation
+----------------------------------------------
+
+The prepare operation is implemented at each OSD through a simulation
+of a versioning or copy-on-write capability for modifying a portion of
+an object.
+
+When a prepare operation is performed, the new data is written into a
+temporary object. The PG log for the
+operation will contain a reference to the temporary object so that it
+can be located for recovery purposes as well as a record of all of the
+shards which are involved in the operation.
+
+In order to avoid fragmentation (and hence, future read performance),
+creation of the temporary object needs special attention. The name of
+the temporary object affects its location within the KV store. Right
+now its unclear whether it's desirable for the name to locate near the
+base object or whether a separate subset of keyspace should be used
+for temporary objects. Sam believes that colocation with the base
+object is preferred (he suggests using the generation counter of the
+ghobject for temporaries). Whereas Allen believes that using a
+separate subset of keyspace is desirable since these keys are
+ephemeral and we don't want to actually colocate them with the base
+object keys. Perhaps some modeling here can help resolve this
+issue. The data of the temporary object wants to be located as close
+to the data of the base object as possible. This may be best performed
+by adding a new ObjectStore creation primitive that takes the base
+object as an addtional parameter that is a hint to the allocator.
+
+Sam: I think that the short lived thing may be a red herring. We'll
+be updating the donor and primary objects atomically, so it seems like
+we'd want them adjacent in the key space, regardless of the donor's
+lifecycle.
+
+The apply operation moves the data from the temporary object into the
+correct position within the base object and deletes the associated
+temporary object. This operation is done using a specialized
+ObjectStore primitive. In the current ObjectStore interface, this can
+be done using the clonerange function followed by a delete, but can be
+done more efficiently with a specialized move primitive.
+Implementation of the specialized primitive on FileStore can be done
+by copying the data. Some file systems have extensions that might also
+be able to implement this operation (like a defrag API that swaps
+chunks between files). It is expected that NewStore will be able to
+support this efficiently and natively (It has been noted that this
+sequence requires that temporary object allocations, which tend to be
+small, be efficiently converted into blocks for main objects and that
+blocks that were formerly inside of main objects must be reusable with
+minimal overhead)
+
+The prepare and apply operations can be separated arbitrarily in
+time. If a read operation accesses an object that has been altered by
+a prepare operation (but without a corresponding apply operation) it
+must return the data after the prepare operation. This is done by
+creating an in-memory database of objects which have had a prepare
+operation without a corresponding apply operation. All read operations
+must consult this in-memory data structure in order to get the correct
+data. It should explicitly recognized that it is likely that there
+will be multiple prepare operations against a single base object and
+the code must handle this case correctly. This code is implemented as
+a layer between ObjectStore and all existing readers. Annoyingly,
+we'll want to trash this state when the interval changes, so the first
+thing that needs to happen after activation is that the primary and
+replicas apply up to last_update so that the empty cache will be
+correct.
+
+During peering, it is now obvious that an unapplied prepare operation
+can easily be rolled back simply by deleting the associated temporary
+object and removing that entry from the in-memory data structure.
+
+Partial Application Peering/Recovery modifications
+--------------------------------------------------
+
+Some writes will be small enough to not require updating all of the
+shards holding data blocks. For write amplification minization
+reasons, it would be best to avoid writing to those shards at all,
+and delay even sending the log entries until the next write which
+actually hits that shard.
+
+The delaying (buffering) of the transmission of the prepare and apply
+operations for witnessing OSDs creates new situations that peering
+must handle. In particular the logic for determining the authoritative
+last_update value (and hence the selection of the OSD which has the
+authoritative log) must be modified to account for the valid but
+missing (i.e., delayed/buffered) pglog entries to which the
+authoritative OSD was only a witness to.
+
+Because a partial write might complete without persisting a log entry
+on every replica, we have to do a bit more work to determine an
+authoritative last_update. The constraint (as with a replicated PG)
+is that last_update >= the most recent log entry for which a commit
+was sent to the client (call this actual_last_update). Secondarily,
+we want last_update to be as small as possible since any log entry
+past actual_last_update (we do not apply a log entry until we have
+sent the commit to the client) must be able to be rolled back. Thus,
+the smaller a last_update we choose, the less recovery will need to
+happen (we can always roll back, but rolling a replica forward may
+require an object rebuild). Thus, we will set last_update to 1 before
+the oldest log entry we can prove cannot have been committed. In
+current master, this is simply the last_update of the shortest log
+from that interval (because that log did not persist any entry past
+that point -- a precondition for sending a commit to the client). For
+this design, we must consider the possibility that any log is missing
+at its head log entries in which it did not participate. Thus, we
+must determine the most recent interval in which we went active
+(essentially, this is what find_best_info currently does). We then
+pull the log from each live osd from that interval back to the minimum
+last_update among them. Then, we extend all logs from the
+authoritative interval until each hits an entry in which it should
+have participated, but did not record. The shortest of these extended
+logs must therefore contain any log entry for which we sent a commit
+to the client -- and the last entry gives us our last_update.
+
+Deep scrub support
+------------------
+
+The simple answer here is probably our best bet. EC pools can't use
+the omap namespace at all right now. The simplest solution would be
+to take a prefix of the omap space and pack N M byte L bit checksums
+into each key/value. The prefixing seems like a sensible precaution
+against eventually wanting to store something else in the omap space.
+It seems like any write will need to read at least the blocks
+containing the modified range. However, with a code able to compute
+parity deltas, we may not need to read a whole stripe. Even without
+that, we don't want to have to write to blocks not participating in
+the write. Thus, each shard should store checksums only for itself.
+It seems like you'd be able to store checksums for all shards on the
+parity blocks, but there may not be distinguished parity blocks which
+are modified on all writes (LRC or shec provide two examples). L
+should probably have a fixed number of options (16, 32, 64?) and be
+configurable per-pool at pool creation. N, M should be likewise be
+configurable at pool creation with sensible defaults.
+
+We need to handle online upgrade. I think the right answer is that
+the first overwrite to an object with an append only checksum
+removes the append only checksum and writes in whatever stripe
+checksums actually got written. The next deep scrub then writes
+out the full checksum omap entries.
+
+RADOS Client Acknowledgement Generation Optimization
+====================================================
+
+Now that the recovery scheme is understood, we can discuss the
+generation of of the RADOS operation acknowledgement (ACK) by the
+primary ("sufficient" from above). It is NOT required that the primary
+wait for all shards to complete their respective prepare
+operations. Using our example where the RADOS operations writes only
+"W" chunks of the stripe, the primary will generate and send W+M
+prepare operations (possibly including a send-to-self). The primary
+need only wait for enough shards to be written to ensure recovery of
+the data, Thus after writing W + M chunks you can afford the lost of M
+chunks. Hence the primary can generate the RADOS ACK after W+M-M => W
+of those prepare operations are completed.
+
+Inconsistent object_info_t versions
+===================================
+
+A natural consequence of only writing the blocks which actually
+changed is that we don't want to update the object_info_t of the
+objects which didn't. I actually think it would pose a problem to do
+so: pg ghobject namespaces are generally large, and unless the osd is
+seeing a bunch of overwrites on a small set of objects, I'd expect
+each write to be far enough apart in the backing ghobject_t->data
+mapping to each constitute a random metadata update. Thus, we have to
+accept that not every shard will have the current version in its
+object_info_t. We can't even bound how old the version on a
+particular shard will happen to be. In particular, the primary does
+not necessarily have the current version. One could argue that the
+parity shards would always have the current version, but not every
+code necessarily has designated parity shards which see every write
+(certainly LRC, iirc shec, and even with a more pedestrian code, it
+might be desirable to rotate the shards based on object hash). Even
+if you chose to designate a shard as witnessing all writes, the pg
+might be degraded with that particular shard missing. This is a bit
+tricky, currently reads and writes implicitely return the most recent
+version of the object written. On reads, we'd have to read K shards
+to answer that question. We can get around that by adding a "don't
+tell me the current version" flag. Writes are more problematic: we
+need an object_info from the most recent write in order to form the
+new object_info and log_entry.
+
+A truly terrifying option would be to eliminate version and
+prior_version entirely from the object_info_t. There are a few
+specific purposes it serves:
+
+#. On OSD startup, we prime the missing set by scanning backwards
+ from last_update to last_complete comparing the stored object's
+ object_info_t to the version of most recent log entry.
+#. During backfill, we compare versions between primary and target
+ to avoid some pushes. We use it elsewhere as well
+#. While pushing and pulling objects, we verify the version.
+#. We return it on reads and writes and allow the librados user to
+ assert it atomically on writesto allow the user to deal with write
+ races (used extensively by rbd).
+
+Case (3) isn't actually essential, just convenient. Oh well. (4)
+is more annoying. Writes are easy since we know the version. Reads
+are tricky because we may not need to read from all of the replicas.
+Simplest solution is to add a flag to rados operations to just not
+return the user version on read. We can also just not support the
+user version assert on ec for now (I think? Only user is rgw bucket
+indices iirc, and those will always be on replicated because they use
+omap).
+
+We can avoid (1) by maintaining the missing set explicitely. It's
+already possible for there to be a missing object without a
+corresponding log entry (Consider the case where the most recent write
+is to an object which has not been updated in weeks. If that write
+becomes divergent, the written object needs to be marked missing based
+on the prior_version which is not in the log.) THe PGLog already has
+a way of handling those edge cases (see divergent_priors). We'd
+simply expand that to contain the entire missing set and maintain it
+atomically with the log and the objects. This isn't really an
+unreasonable option, the addiitonal keys would be fewer than the
+existing log keys + divergent_priors and aren't updated in the fast
+write path anyway.
+
+The second case is a bit trickier. It's really an optimization for
+the case where a pg became not in the acting set long enough for the
+logs to no longer overlap but not long enough for the PG to have
+healed and removed the old copy. Unfortunately, this describes the
+case where a node was taken down for maintenance with noout set. It's
+probably not acceptable to re-backfill the whole OSD in such a case,
+so we need to be able to quickly determine whether a particular shard
+is up to date given a valid acting set of other shards.
+
+Let ordinary writes which do not change the object size not touch the
+object_info at all. That means that the object_info version won't
+match the pg log entry version. Include in the pg_log_entry_t the
+current object_info version as well as which shards participated (as
+mentioned above). In addition to the object_info_t attr, record on
+each shard s a vector recording for each other shard s' the most
+recent write which spanned both s and s'. Operationally, we maintain
+an attr on each shard containing that vector. A write touching S
+updates the version stamp entry for each shard in S on each shard in
+S's attribute (and leaves the rest alone). If we have a valid acting
+set during backfill, we must have a witness of every write which
+completed -- so taking the max of each entry over all of the acting
+set shards must give us the current version for each shard. During
+recovery, we set the attribute on the recovery target to that max
+vector (Question: with LRC, we may not need to touch much of the
+acting set to recover a particular shard -- can we just use the max of
+the shards we used to recovery, or do we need to grab the version
+vector from the rest of the acting set as well? I'm not sure, not a
+big deal anyway, I think).
+
+The above lets us perform blind writes without knowing the current
+object version (log entry version, that is) while still allowing us to
+avoid backfilling up to date objects. The only catch is that our
+backfill scans will can all replicas, not just the primary and the
+backfill targets.
+
+It would be worth adding into scrub the ability to check the
+consistency of the gathered version vectors -- probably by just
+taking 3 random valid subsets and verifying that they generate
+the same authoritative version vector.
+
+Implementation Strategy
+=======================
+
+It goes without saying that it would be unwise to attempt to do all of
+this in one massive PR. It's also not a good idea to merge code which
+isn't being tested. To that end, it's worth thinking a bit about
+which bits can be tested on their own (perhaps with a bit of temporary
+scaffolding).
+
+We can implement the overwrite friendly checksumming scheme easily
+enough with the current implementation. We'll want to enable it on a
+per-pool basis (probably using a flag which we'll later repurpose for
+actual overwrite support). We can enable it in some of the ec
+thrashing tests in the suite. We can also add a simple test
+validating the behavior of turning it on for an existing ec pool
+(later, we'll want to be able to convert append-only ec pools to
+overwrite ec pools, so that test will simply be expanded as we go).
+The flag should be gated by the experimental feature flag since we
+won't want to support this as a valid configuration -- testing only.
+We need to upgrade append only ones in place during deep scrub.
+
+Similarly, we can implement the unstable extent cache with the current
+implementation, it even lets us cut out the readable ack the replicas
+send to the primary after the commit which lets it release the lock.
+Same deal, implement, gate with experimental flag, add to some of the
+automated tests. I don't really see a reason not to use the same flag
+as above.
+
+We can certainly implement the move-range primitive with unit tests
+before there are any users. Adding coverage to the existing
+objectstore tests would suffice here.
+
+Explicit missing set can be implemented now, same deal as above --
+might as well even use the same feature bit.
+
+The TPC protocol outlined above can actually be implemented an append
+only EC pool. Same deal as above, can even use the same feature bit.
+
+The RADOS flag to suppress the read op user version return can be
+implemented immediately. Mostly just needs unit tests.
+
+The version vector problem is an interesting one. For append only EC
+pools, it would be pointless since all writes increase the size and
+therefore update the object_info. We could do it for replicated pools
+though. It's a bit silly since all "shards" see all writes, but it
+would still let us implement and partially test the augmented backfill
+code as well as the extra pg log entry fields -- this depends on the
+explicit pg log entry branch having already merged. It's not entirely
+clear to me that this one is worth doing seperately. It's enough code
+that I'd really prefer to get it done independently, but it's also a
+fair amount of scaffolding that will be later discarded.
+
+PGLog entries need to be able to record the participants and log
+comparison needs to be modified to extend logs with entries they
+wouldn't have witnessed. This logic should be abstracted behind
+PGLog so it can be unittested -- that would let us test it somewhat
+before the actual ec overwrites code merges.
+
+Whatever needs to happen to the ec plugin interface can probably be
+done independently of the rest of this (pending resolution of
+questions below).
+
+The actual nuts and bolts of performing the ec overwrite it seems to
+me can't be productively tested (and therefore implemented) until the
+above are complete, so best to get all of the supporting code in
+first.
+
+Open Questions
+==============
+
+Is there a code we should be using that would let us compute a parity
+delta without rereading and reencoding the full stripe? If so, is it
+the kind of thing we need to design for now, or can it be reasonably
+put off?
+
+What needs to happen to the EC plugin interface?
diff --git a/src/ceph/doc/dev/osd_internals/index.rst b/src/ceph/doc/dev/osd_internals/index.rst
new file mode 100644
index 0000000..7e82914
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/index.rst
@@ -0,0 +1,10 @@
+==============================
+OSD developer documentation
+==============================
+
+.. rubric:: Contents
+
+.. toctree::
+ :glob:
+
+ *
diff --git a/src/ceph/doc/dev/osd_internals/last_epoch_started.rst b/src/ceph/doc/dev/osd_internals/last_epoch_started.rst
new file mode 100644
index 0000000..9978bd3
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/last_epoch_started.rst
@@ -0,0 +1,60 @@
+======================
+last_epoch_started
+======================
+
+info.last_epoch_started records an activation epoch e for interval i
+such that all writes commited in i or earlier are reflected in the
+local info/log and no writes after i are reflected in the local
+info/log. Since no committed write is ever divergent, even if we
+get an authoritative log/info with an older info.last_epoch_started,
+we can leave our info.last_epoch_started alone since no writes could
+have commited in any intervening interval (See PG::proc_master_log).
+
+info.history.last_epoch_started records a lower bound on the most
+recent interval in which the pg as a whole went active and accepted
+writes. On a particular osd, it is also an upper bound on the
+activation epoch of intervals in which writes in the local pg log
+occurred (we update it before accepting writes). Because all
+committed writes are committed by all acting set osds, any
+non-divergent writes ensure that history.last_epoch_started was
+recorded by all acting set members in the interval. Once peering has
+queried one osd from each interval back to some seen
+history.last_epoch_started, it follows that no interval after the max
+history.last_epoch_started can have reported writes as committed
+(since we record it before recording client writes in an interval).
+Thus, the minimum last_update across all infos with
+info.last_epoch_started >= MAX(history.last_epoch_started) must be an
+upper bound on writes reported as committed to the client.
+
+We update info.last_epoch_started with the intial activation message,
+but we only update history.last_epoch_started after the new
+info.last_epoch_started is persisted (possibly along with the first
+write). This ensures that we do not require an osd with the most
+recent info.last_epoch_started until all acting set osds have recorded
+it.
+
+In find_best_info, we do include info.last_epoch_started values when
+calculating the max_last_epoch_started_found because we want to avoid
+designating a log entry divergent which in a prior interval would have
+been non-divergent since it might have been used to serve a read. In
+activate(), we use the peer's last_epoch_started value as a bound on
+how far back divergent log entries can be found.
+
+However, in a case like
+
+.. code::
+
+ calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
+
+since osd.1 is the only one which recorded info.les=477 while 4,0
+which were the acting set in that interval did not (4 restarted and 0
+did not get the message in time) the pg is marked incomplete when
+either 4 or 0 would have been valid choices. To avoid this, we do not
+consider info.les for incomplete peers when calculating
+min_last_epoch_started_found. It would not have been in the acting
+set, so we must have another osd from that interval anyway (if
+maybe_went_rw). If that osd does not remember that info.les, then we
+cannot have served reads.
diff --git a/src/ceph/doc/dev/osd_internals/log_based_pg.rst b/src/ceph/doc/dev/osd_internals/log_based_pg.rst
new file mode 100644
index 0000000..8b11012
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/log_based_pg.rst
@@ -0,0 +1,206 @@
+============
+Log Based PG
+============
+
+Background
+==========
+
+Why PrimaryLogPG?
+-----------------
+
+Currently, consistency for all ceph pool types is ensured by primary
+log-based replication. This goes for both erasure-coded and
+replicated pools.
+
+Primary log-based replication
+-----------------------------
+
+Reads must return data written by any write which completed (where the
+client could possibly have received a commit message). There are lots
+of ways to handle this, but ceph's architecture makes it easy for
+everyone at any map epoch to know who the primary is. Thus, the easy
+answer is to route all writes for a particular pg through a single
+ordering primary and then out to the replicas. Though we only
+actually need to serialize writes on a single object (and even then,
+the partial ordering only really needs to provide an ordering between
+writes on overlapping regions), we might as well serialize writes on
+the whole PG since it lets us represent the current state of the PG
+using two numbers: the epoch of the map on the primary in which the
+most recent write started (this is a bit stranger than it might seem
+since map distribution itself is asyncronous -- see Peering and the
+concept of interval changes) and an increasing per-pg version number
+-- this is referred to in the code with type eversion_t and stored as
+pg_info_t::last_update. Furthermore, we maintain a log of "recent"
+operations extending back at least far enough to include any
+*unstable* writes (writes which have been started but not committed)
+and objects which aren't uptodate locally (see recovery and
+backfill). In practice, the log will extend much further
+(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not
+clean) because it's handy for quickly performing recovery.
+
+Using this log, as long as we talk to a non-empty subset of the OSDs
+which must have accepted any completed writes from the most recent
+interval in which we accepted writes, we can determine a conservative
+log which must contain any write which has been reported to a client
+as committed. There is some freedom here, we can choose any log entry
+between the oldest head remembered by an element of that set (any
+newer cannot have completed without that log containing it) and the
+newest head remembered (clearly, all writes in the log were started,
+so it's fine for us to remember them) as the new head. This is the
+main point of divergence between replicated pools and ec pools in
+PG/PrimaryLogPG: replicated pools try to choose the newest valid
+option to avoid the client needing to replay those operations and
+instead recover the other copies. EC pools instead try to choose
+the *oldest* option available to them.
+
+The reason for this gets to the heart of the rest of the differences
+in implementation: one copy will not generally be enough to
+reconstruct an ec object. Indeed, there are encodings where some log
+combinations would leave unrecoverable objects (as with a 4+2 encoding
+where 3 of the replicas remember a write, but the other 3 do not -- we
+don't have 3 copies of either version). For this reason, log entries
+representing *unstable* writes (writes not yet committed to the
+client) must be rollbackable using only local information on ec pools.
+Log entries in general may therefore be rollbackable (and in that case,
+via a delayed application or via a set of instructions for rolling
+back an inplace update) or not. Replicated pool log entries are
+never able to be rolled back.
+
+For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
+osd_types.h:pg_log_entry_t, and peering in general.
+
+ReplicatedBackend/ECBackend unification strategy
+================================================
+
+PGBackend
+---------
+
+So, the fundamental difference between replication and erasure coding
+is that replication can do destructive updates while erasure coding
+cannot. It would be really annoying if we needed to have two entire
+implementations of PrimaryLogPG, one for each of the two, if there
+are really only a few fundamental differences:
+
+#. How reads work -- async only, requires remote reads for ec
+#. How writes work -- either restricted to append, or must write aside and do a
+ tpc
+#. Whether we choose the oldest or newest possible head entry during peering
+#. A bit of extra information in the log entry to enable rollback
+
+and so many similarities
+
+#. All of the stats and metadata for objects
+#. The high level locking rules for mixing client IO with recovery and scrub
+#. The high level locking rules for mixing reads and writes without exposing
+ uncommitted state (which might be rolled back or forgotten later)
+#. The process, metadata, and protocol needed to determine the set of osds
+ which partcipated in the most recent interval in which we accepted writes
+#. etc.
+
+Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
+
+#. PGBackend
+#. PGTransaction
+#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
+#. Various bits of the write pipeline disallow some operations based on pool
+ type -- like omap operations, class operation reads, and writes which are
+ not aligned appends (officially, so far) for ec
+#. Misc other kludges here and there
+
+PGBackend and PGTransaction enable abstraction of differences 1, 2,
+and the addition of 4 as needed to the log entries.
+
+The replicated implementation is in ReplicatedBackend.h/cc and doesn't
+require much explanation, I think. More detail on the ECBackend can be
+found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
+
+PGBackend Interface Explanation
+===============================
+
+Note: this is from a design document from before the original firefly
+and is probably out of date w.r.t. some of the method names.
+
+Readable vs Degraded
+--------------------
+
+For a replicated pool, an object is readable iff it is present on
+the primary (at the right version). For an ec pool, we need at least
+M shards present to do a read, and we need it on the primary. For
+this reason, PGBackend needs to include some interfaces for determing
+when recovery is required to serve a read vs a write. This also
+changes the rules for when peering has enough logs to prove that it
+
+Core Changes:
+
+- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
+ | objects to allow the user to make these determinations.
+
+Client Reads
+------------
+
+Reads with the replicated strategy can always be satisfied
+synchronously out of the primary OSD. With an erasure coded strategy,
+the primary will need to request data from some number of replicas in
+order to satisfy a read. PGBackend will therefore need to provide
+seperate objects_read_sync and objects_read_async interfaces where
+the former won't be implemented by the ECBackend.
+
+PGBackend interfaces:
+
+- objects_read_sync
+- objects_read_async
+
+Scrub
+-----
+
+We currently have two scrub modes with different default frequencies:
+
+#. [shallow] scrub: compares the set of objects and metadata, but not
+ the contents
+#. deep scrub: compares the set of objects, metadata, and a crc32 of
+ the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects. The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a crc32 of the contents of
+each object. The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the PGBackend implementation must be in charge of
+actually doing the scan.
+
+
+PGBackend interfaces:
+
+- be_*
+
+Recovery
+--------
+
+The logic for recovering an object depends on the backend. With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded pg may be
+unrecoverable without being unfound. The "unfound" concept
+should probably then be renamed to unrecoverable. Also, the
+PGBackend implementation will have to be able to direct the search
+for pg replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+
+Core changes:
+
+- s/unfound/unrecoverable
+
+PGBackend interfaces:
+
+- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
+- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
+- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
+- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
+- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
diff --git a/src/ceph/doc/dev/osd_internals/map_message_handling.rst b/src/ceph/doc/dev/osd_internals/map_message_handling.rst
new file mode 100644
index 0000000..a5013c2
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/map_message_handling.rst
@@ -0,0 +1,131 @@
+===========================
+Map and PG Message handling
+===========================
+
+Overview
+--------
+The OSD handles routing incoming messages to PGs, creating the PG if necessary
+in some cases.
+
+PG messages generally come in two varieties:
+
+ 1. Peering Messages
+ 2. Ops/SubOps
+
+There are several ways in which a message might be dropped or delayed. It is
+important that the message delaying does not result in a violation of certain
+message ordering requirements on the way to the relevant PG handling logic:
+
+ 1. Ops referring to the same object must not be reordered.
+ 2. Peering messages must not be reordered.
+ 3. Subops must not be reordered.
+
+MOSDMap
+-------
+MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the
+OSD must perform several tasks:
+
+ 1. Persist the new maps to the filestore.
+ Several PG operations rely on having access to maps dating back to the last
+ time the PG was clean.
+ 2. Update and persist the superblock.
+ 3. Update OSD state related to the current map.
+ 4. Expose new maps to PG processes via *OSDService*.
+ 5. Remove PGs due to pool removal.
+ 6. Queue dummy events to trigger PG map catchup.
+
+Each PG asynchronously catches up to the currently published map during
+process_peering_events before processing the event. As a result, different
+PGs may have different views as to the "current" map.
+
+One consequence of this design is that messages containing submessages from
+multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage
+with the PG's epoch as well as tagging the message as a whole with the OSD's
+current published epoch.
+
+MOSDPGOp/MOSDPGSubOp
+--------------------
+See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op
+
+MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used
+between OSDs to coordinate most non peering activities including replicating
+MOSDPGOp operations.
+
+OSD::require_same_or_newer map checks that the current OSDMap is at least
+as new as the map epoch indicated on the message. If not, the message is
+queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this
+cannot violate the above conditions since any two messages will be queued
+in order of receipt and if a message is received with epoch e0, a later message
+from the same source must be at epoch at least e0. Note that two PGs from
+the same OSD count for these purposes as different sources for single PG
+messages. That is, messages from different PGs may be reordered.
+
+
+MOSDPGOps follow the following process:
+
+ 1. OSD::handle_op: validates permissions and crush mapping.
+ discard the request if they are not connected and the client cannot get the reply ( See OSD::op_is_discardable )
+ See OSDService::handle_misdirected_op
+ See PG::op_has_sufficient_caps
+ See OSD::require_same_or_newer_map
+ 2. OSD::enqueue_op
+
+MOSDSubOps follow the following process:
+
+ 1. OSD::handle_sub_op checks that sender is an OSD
+ 2. OSD::enqueue_op
+
+OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it.
+
+OSD::dequeue_op is then eventually called, with a lock on the PG. At
+this time, the op is passed to PG::do_request, which checks that:
+
+ 1. the PG map is new enough (PG::must_delay_op)
+ 2. the client requesting the op has enough permissions (PG::op_has_sufficient_caps)
+ 3. the op is not to be discarded (PG::can_discard_{request,op,subop,scan,backfill})
+ 4. the PG is active (PG::flushed boolean)
+ 5. the op is a CEPH_MSG_OSD_OP and the PG is in PG_STATE_ACTIVE state and not in PG_STATE_REPLAY
+
+If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type:
+
+ 1. CEPH_MSG_OSD_OP is handled by PG::do_op
+ 2. MSG_OSD_SUBOP is handled by PG::do_sub_op
+ 3. MSG_OSD_SUBOPREPLY is handled by PG::do_sub_op_reply
+ 4. MSG_OSD_PG_SCAN is handled by PG::do_scan
+ 5. MSG_OSD_PG_BACKFILL is handled by PG::do_backfill
+
+CEPH_MSG_OSD_OP processing
+--------------------------
+
+PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it
+
+ 1. in wait_for_all_missing if it is a CEPH_OSD_OP_PGLS for a designated snapid and some object updates are still missing
+ 2. in waiting_for_active if the op may write but the scrubber is working
+ 3. in waiting_for_missing_object if the op requires an object or a snapdir or a specific snap that is still missing
+ 4. in waiting_for_degraded_object if the op may write an object or a snapdir that is degraded, or if another object blocks it ("blocked_by")
+ 5. in waiting_for_backfill_pos if the op requires an object that will be available after the backfill is complete
+ 6. in waiting_for_ack if an ack from another OSD is expected
+ 7. in waiting_for_ondisk if the op is waiting for a write to complete
+
+Peering Messages
+----------------
+See OSD::handle_pg_(notify|info|log|query)
+
+Peering messages are tagged with two epochs:
+
+ 1. epoch_sent: map epoch at which the message was sent
+ 2. query_epoch: map epoch at which the message triggering the message was sent
+
+These are the same in cases where there was no triggering message. We discard
+a peering message if the message's query_epoch if the PG in question has entered
+a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies,
+infos, notifies, and logs are all handled as PG::RecoveryMachine events and
+are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created
+state machine event along with epoch_sent and query_epoch in order to
+generically check PG::old_peering_message upon insertion and removal from the
+queue.
+
+Note, notifies, logs, and infos can trigger the creation of a PG. See
+OSD::get_or_create_pg.
+
+
diff --git a/src/ceph/doc/dev/osd_internals/osd_overview.rst b/src/ceph/doc/dev/osd_internals/osd_overview.rst
new file mode 100644
index 0000000..192ddf8
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/osd_overview.rst
@@ -0,0 +1,106 @@
+===
+OSD
+===
+
+Concepts
+--------
+
+*Messenger*
+ See src/msg/Messenger.h
+
+ Handles sending and receipt of messages on behalf of the OSD. The OSD uses
+ two messengers:
+
+ 1. cluster_messenger - handles traffic to other OSDs, monitors
+ 2. client_messenger - handles client traffic
+
+ This division allows the OSD to be configured with different interfaces for
+ client and cluster traffic.
+
+*Dispatcher*
+ See src/msg/Dispatcher.h
+
+ OSD implements the Dispatcher interface. Of particular note is ms_dispatch,
+ which serves as the entry point for messages received via either the client
+ or cluster messenger. Because there are two messengers, ms_dispatch may be
+ called from at least two threads. The osd_lock is always held during
+ ms_dispatch.
+
+*WorkQueue*
+ See src/common/WorkQueue.h
+
+ The WorkQueue class abstracts the process of queueing independent tasks
+ for asynchronous execution. Each OSD process contains workqueues for
+ distinct tasks:
+
+ 1. OpWQ: handles ops (from clients) and subops (from other OSDs).
+ Runs in the op_tp threadpool.
+ 2. PeeringWQ: handles peering tasks and pg map advancement
+ Runs in the op_tp threadpool.
+ See Peering
+ 3. CommandWQ: handles commands (pg query, etc)
+ Runs in the command_tp threadpool.
+ 4. RecoveryWQ: handles recovery tasks.
+ Runs in the recovery_tp threadpool.
+ 5. SnapTrimWQ: handles snap trimming
+ Runs in the disk_tp threadpool.
+ See SnapTrimmer
+ 6. ScrubWQ: handles primary scrub path
+ Runs in the disk_tp threadpool.
+ See Scrub
+ 7. ScrubFinalizeWQ: handles primary scrub finalize
+ Runs in the disk_tp threadpool.
+ See Scrub
+ 8. RepScrubWQ: handles replica scrub path
+ Runs in the disk_tp threadpool
+ See Scrub
+ 9. RemoveWQ: Asynchronously removes old pg directories
+ Runs in the disk_tp threadpool
+ See PGRemoval
+
+*ThreadPool*
+ See src/common/WorkQueue.h
+ See also above.
+
+ There are 4 OSD threadpools:
+
+ 1. op_tp: handles ops and subops
+ 2. recovery_tp: handles recovery tasks
+ 3. disk_tp: handles disk intensive tasks
+ 4. command_tp: handles commands
+
+*OSDMap*
+ See src/osd/OSDMap.h
+
+ The crush algorithm takes two inputs: a picture of the cluster
+ with status information about which nodes are up/down and in/out,
+ and the pgid to place. The former is encapsulated by the OSDMap.
+ Maps are numbered by *epoch* (epoch_t). These maps are passed around
+ within the OSD as std::tr1::shared_ptr<const OSDMap>.
+
+ See MapHandling
+
+*PG*
+ See src/osd/PG.* src/osd/PrimaryLogPG.*
+
+ Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto
+ OSDs. The PG structure is responsible for handling requests pertaining to
+ a particular *PG* as well as for maintaining relevant metadata and controlling
+ recovery.
+
+*OSDService*
+ See src/osd/OSD.cc OSDService
+
+ The OSDService acts as a broker between PG threads and OSD state which allows
+ PGs to perform actions using OSD services such as workqueues and messengers.
+ This is still a work in progress. Future cleanups will focus on moving such
+ state entirely from the OSD into the OSDService.
+
+Overview
+--------
+ See src/ceph_osd.cc
+
+ The OSD process represents one leaf device in the crush hierarchy. There
+ might be one OSD process per physical machine, or more than one if, for
+ example, the user configures one OSD instance per disk.
+
diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.rst b/src/ceph/doc/dev/osd_internals/osd_throttles.rst
new file mode 100644
index 0000000..6739bd9
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/osd_throttles.rst
@@ -0,0 +1,93 @@
+=============
+OSD Throttles
+=============
+
+There are three significant throttles in the filestore: wbthrottle,
+op_queue_throttle, and a throttle based on journal usage.
+
+WBThrottle
+----------
+The WBThrottle is defined in src/os/filestore/WBThrottle.[h,cc] and
+included in FileStore as FileStore::wbthrottle. The intention is to
+bound the amount of outstanding IO we need to do to flush the journal.
+At the same time, we don't want to necessarily do it inline in case we
+might be able to combine several IOs on the same object close together
+in time. Thus, in FileStore::_write, we queue the fd for asyncronous
+flushing and block in FileStore::_do_op if we have exceeded any hard
+limits until the background flusher catches up.
+
+The relevant config options are filestore_wbthrottle*. There are
+different defaults for xfs and btrfs. Each set has hard and soft
+limits on bytes (total dirty bytes), ios (total dirty ios), and
+inodes (total dirty fds). The WBThrottle will begin flushing
+when any of these hits the soft limit and will block in throttle()
+while any has exceeded the hard limit.
+
+Tighter soft limits will cause writeback to happen more quickly,
+but may cause the OSD to miss oportunities for write coalescing.
+Tighter hard limits may cause a reduction in latency variance by
+reducing time spent flushing the journal, but may reduce writeback
+parallelism.
+
+op_queue_throttle
+-----------------
+The op queue throttle is intended to bound the amount of queued but
+uncompleted work in the filestore by delaying threads calling
+queue_transactions more and more based on how many ops and bytes are
+currently queued. The throttle is taken in queue_transactions and
+released when the op is applied to the filesystem. This period
+includes time spent in the journal queue, time spent writing to the
+journal, time spent in the actual op queue, time spent waiting for the
+wbthrottle to open up (thus, the wbthrottle can push back indirectly
+on the queue_transactions caller), and time spent actually applying
+the op to the filesystem. A BackoffThrottle is used to gradually
+delay the queueing thread after each throttle becomes more than
+filestore_queue_low_threshhold full (a ratio of
+filestore_queue_max_(bytes|ops)). The throttles will block once the
+max value is reached (filestore_queue_max_(bytes|ops)).
+
+The significant config options are:
+filestore_queue_low_threshhold
+filestore_queue_high_threshhold
+filestore_expected_throughput_ops
+filestore_expected_throughput_bytes
+filestore_queue_high_delay_multiple
+filestore_queue_max_delay_multiple
+
+While each throttle is at less than low_threshhold of the max,
+no delay happens. Between low and high, the throttle will
+inject a per-op delay (per op or byte) ramping from 0 at low to
+high_delay_multiple/expected_throughput at high. From high to
+1, the delay will ramp from high_delay_multiple/expected_throughput
+to max_delay_multiple/expected_throughput.
+
+filestore_queue_high_delay_multiple and
+filestore_queue_max_delay_multiple probably do not need to be
+changed.
+
+Setting these properly should help to smooth out op latencies by
+mostly avoiding the hard limit.
+
+See FileStore::throttle_ops and FileSTore::thottle_bytes.
+
+journal usage throttle
+----------------------
+See src/os/filestore/JournalThrottle.h/cc
+
+The intention of the journal usage throttle is to gradually slow
+down queue_transactions callers as the journal fills up in order
+to smooth out hiccup during filestore syncs. JournalThrottle
+wraps a BackoffThrottle and tracks journaled but not flushed
+journal entries so that the throttle can be released when the
+journal is flushed. The configs work very similarly to the
+op_queue_throttle.
+
+The significant config options are:
+journal_throttle_low_threshhold
+journal_throttle_high_threshhold
+filestore_expected_throughput_ops
+filestore_expected_throughput_bytes
+journal_throttle_high_multiple
+journal_throttle_max_multiple
+
+.. literalinclude:: osd_throttles.txt
diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.txt b/src/ceph/doc/dev/osd_internals/osd_throttles.txt
new file mode 100644
index 0000000..0332377
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/osd_throttles.txt
@@ -0,0 +1,21 @@
+ Messenger throttle (number and size)
+ |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+ FileStore op_queue throttle (number and size, includes a soft throttle based on filestore_expected_throughput_(ops|bytes))
+ |--------------------------------------------------------|
+ WBThrottle
+ |---------------------------------------------------------------------------------------------------------|
+ Journal (size, includes a soft throttle based on filestore_expected_throughput_bytes)
+ |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+ |----------------------------------------------------------------------------------------------------> flushed ----------------> synced
+ |
+Op: Read Header --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> op_applied -------------------------------------------------------------> Complete
+ | |
+SubOp: --Messenger--> ReadHeader --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> sub_op_applied -
+ |
+ |-----------------------------> flushed ----------------> synced
+ |------------------------------------------------------------------------------------------|
+ Journal (size)
+ |---------------------------------|
+ WBThrottle
+ |-----------------------------------------------------|
+ FileStore op_queue throttle (number and size)
diff --git a/src/ceph/doc/dev/osd_internals/pg.rst b/src/ceph/doc/dev/osd_internals/pg.rst
new file mode 100644
index 0000000..4055363
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/pg.rst
@@ -0,0 +1,31 @@
+====
+PG
+====
+
+Concepts
+--------
+
+*Peering Interval*
+ See PG::start_peering_interval.
+ See PG::acting_up_affected
+ See PG::RecoveryState::Reset
+
+ A peering interval is a maximal set of contiguous map epochs in which the
+ up and acting sets did not change. PG::RecoveryMachine represents a
+ transition from one interval to another as passing through
+ RecoveryState::Reset. On PG::RecoveryState::AdvMap PG::acting_up_affected can
+ cause the pg to transition to Reset.
+
+
+Peering Details and Gotchas
+---------------------------
+For an overview of peering, see `Peering <../../peering>`_.
+
+ * PG::flushed defaults to false and is set to false in
+ PG::start_peering_interval. Upon transitioning to PG::RecoveryState::Started
+ we send a transaction through the pg op sequencer which, upon complete,
+ sends a FlushedEvt which sets flushed to true. The primary cannot go
+ active until this happens (See PG::RecoveryState::WaitFlushedPeering).
+ Replicas can go active but cannot serve ops (writes or reads).
+ This is necessary because we cannot read our ondisk state until unstable
+ transactions from the previous interval have cleared.
diff --git a/src/ceph/doc/dev/osd_internals/pg_removal.rst b/src/ceph/doc/dev/osd_internals/pg_removal.rst
new file mode 100644
index 0000000..d968ecc
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/pg_removal.rst
@@ -0,0 +1,56 @@
+==========
+PG Removal
+==========
+
+See OSD::_remove_pg, OSD::RemoveWQ
+
+There are two ways for a pg to be removed from an OSD:
+
+ 1. MOSDPGRemove from the primary
+ 2. OSD::advance_map finds that the pool has been removed
+
+In either case, our general strategy for removing the pg is to
+atomically set the metadata objects (pg->log_oid, pg->biginfo_oid) to
+backfill and asynronously remove the pg collections. We do not do
+this inline because scanning the collections to remove the objects is
+an expensive operation.
+
+OSDService::deleting_pgs tracks all pgs in the process of being
+deleted. Each DeletingState object in deleting_pgs lives while at
+least one reference to it remains. Each item in RemoveWQ carries a
+reference to the DeletingState for the relevant pg such that
+deleting_pgs.lookup(pgid) will return a null ref only if there are no
+collections currently being deleted for that pg.
+
+The DeletingState for a pg also carries information about the status
+of the current deletion and allows the deletion to be cancelled.
+The possible states are:
+
+ 1. QUEUED: the PG is in the RemoveWQ
+ 2. CLEARING_DIR: the PG's contents are being removed synchronously
+ 3. DELETING_DIR: the PG's directories and metadata being queued for removal
+ 4. DELETED_DIR: the final removal transaction has been queued
+ 5. CANCELED: the deletion has been canceled
+
+In 1 and 2, the deletion can be canceled. Each state transition
+method (and check_canceled) returns false if deletion has been
+canceled and true if the state transition was successful. Similarly,
+try_stop_deletion() returns true if it succeeds in canceling the
+deletion. Additionally, try_stop_deletion() in the event that it
+fails to stop the deletion will not return until the final removal
+transaction is queued. This ensures that any operations queued after
+that point will be ordered after the pg deletion.
+
+OSD::_create_lock_pg must handle two cases:
+
+ 1. Either there is no DeletingStateRef for the pg, or it failed to cancel
+ 2. We succeeded in canceling the deletion.
+
+In case 1., we proceed as if there were no deletion occurring, except that
+we avoid writing to the PG until the deletion finishes. In case 2., we
+proceed as in case 1., except that we first mark the PG as backfilling.
+
+Similarly, OSD::osr_registry ensures that the OpSequencers for those
+pgs can be reused for a new pg if created before the old one is fully
+removed, ensuring that operations on the new pg are sequenced properly
+with respect to operations on the old one.
diff --git a/src/ceph/doc/dev/osd_internals/pgpool.rst b/src/ceph/doc/dev/osd_internals/pgpool.rst
new file mode 100644
index 0000000..45a252b
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/pgpool.rst
@@ -0,0 +1,22 @@
+==================
+PGPool
+==================
+
+PGPool is a structure used to manage and update the status of removed
+snapshots. It does this by maintaining two fields, cached_removed_snaps - the
+current removed snap set and newly_removed_snaps - newly removed snaps in the
+last epoch. In OSD::load_pgs the osd map is recovered from the pg's file store
+and passed down to OSD::_get_pool where a PGPool object is initialised with the
+map.
+
+With each new map we receive we call PGPool::update with the new map. In that
+function we build a list of newly removed snaps
+(pg_pool_t::build_removed_snaps) and merge that with our cached_removed_snaps.
+This function included checks to make sure we only do this update when things
+have changed or there has been a map gap.
+
+When we activate the pg we initialise the snap trim queue from
+cached_removed_snaps and subtract the purged_snaps we have already purged
+leaving us with the list of snaps that need to be trimmed. Trimming is later
+performed asynchronously by the snap_trim_wq.
+
diff --git a/src/ceph/doc/dev/osd_internals/recovery_reservation.rst b/src/ceph/doc/dev/osd_internals/recovery_reservation.rst
new file mode 100644
index 0000000..4ab0319
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/recovery_reservation.rst
@@ -0,0 +1,75 @@
+====================
+Recovery Reservation
+====================
+
+Recovery reservation extends and subsumes backfill reservation. The
+reservation system from backfill recovery is used for local and remote
+reservations.
+
+When a PG goes active, first it determines what type of recovery is
+necessary, if any. It may need log-based recovery, backfill recovery,
+both, or neither.
+
+In log-based recovery, the primary first acquires a local reservation
+from the OSDService's local_reserver. Then a MRemoteReservationRequest
+message is sent to each replica in order of OSD number. These requests
+will always be granted (i.e., cannot be rejected), but they may take
+some time to be granted if the remotes have already granted all their
+remote reservation slots.
+
+After all reservations are acquired, log-based recovery proceeds as it
+would without the reservation system.
+
+After log-based recovery completes, the primary releases all remote
+reservations. The local reservation remains held. The primary then
+determines whether backfill is necessary. If it is not necessary, the
+primary releases its local reservation and waits in the Recovered state
+for all OSDs to indicate that they are clean.
+
+If backfill recovery occurs after log-based recovery, the local
+reservation does not need to be reacquired since it is still held from
+before. If it occurs immediately after activation (log-based recovery
+not possible/necessary), the local reservation is acquired according to
+the typical process.
+
+Once the primary has its local reservation, it requests a remote
+reservation from the backfill target. This reservation CAN be rejected,
+for instance if the OSD is too full (backfillfull_ratio osd setting).
+If the reservation is rejected, the primary drops its local
+reservation, waits (osd_backfill_retry_interval), and then retries. It
+will retry indefinitely.
+
+Once the primary has the local and remote reservations, backfill
+proceeds as usual. After backfill completes the remote reservation is
+dropped.
+
+Finally, after backfill (or log-based recovery if backfill was not
+necessary), the primary drops the local reservation and enters the
+Recovered state. Once all the PGs have reported they are clean, the
+primary enters the Clean state and marks itself active+clean.
+
+
+--------------
+Things to Note
+--------------
+
+We always grab the local reservation first, to prevent a circular
+dependency. We grab remote reservations in order of OSD number for the
+same reason.
+
+The recovery reservation state chart controls the PG state as reported
+to the monitor. The state chart can set:
+
+ - recovery_wait: waiting for local/remote reservations
+ - recovering: recovering
+ - recovery_toofull: recovery stopped, OSD(s) above full ratio
+ - backfill_wait: waiting for remote backfill reservations
+ - backfilling: backfilling
+ - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio
+
+
+--------
+See Also
+--------
+
+The Active substate of the automatically generated OSD state diagram.
diff --git a/src/ceph/doc/dev/osd_internals/scrub.rst b/src/ceph/doc/dev/osd_internals/scrub.rst
new file mode 100644
index 0000000..3343b39
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/scrub.rst
@@ -0,0 +1,30 @@
+
+Scrubbing Behavior Table
+========================
+
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Flags | none | noscrub | nodeep_scrub | noscrub/nodeep_scrub |
++=================================================+==========+===========+===============+======================+
+| Periodic tick | S | X | S | X |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Periodic tick after osd_deep_scrub_interval | D | D | S | X |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated scrub | S | S | S | S |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated scrub after osd_deep_scrub_interval | D | D | S | S |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated deep scrub | D | D | D | D |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+
+- X = Do nothing
+- S = Do regular scrub
+- D = Do deep scrub
+
+State variables
+---------------
+
+- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep
+- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep
+- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep
+- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep
+- Initiated deep scrub state is must_scrub && must_deep_scrub
diff --git a/src/ceph/doc/dev/osd_internals/snaps.rst b/src/ceph/doc/dev/osd_internals/snaps.rst
new file mode 100644
index 0000000..e17378f
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/snaps.rst
@@ -0,0 +1,128 @@
+======
+Snaps
+======
+
+Overview
+--------
+Rados supports two related snapshotting mechanisms:
+
+ 1. *pool snaps*: snapshots are implicitely applied to all objects
+ in a pool
+ 2. *self managed snaps*: the user must provide the current *SnapContext*
+ on each write.
+
+These two are mutually exclusive, only one or the other can be used on
+a particular pool.
+
+The *SnapContext* is the set of snapshots currently defined for an object
+as well as the most recent snapshot (the *seq*) requested from the mon for
+sequencing purposes (a *SnapContext* with a newer *seq* is considered to
+be more recent).
+
+The difference between *pool snaps* and *self managed snaps* from the
+OSD's point of view lies in whether the *SnapContext* comes to the OSD
+via the client's MOSDOp or via the most recent OSDMap.
+
+See OSD::make_writeable
+
+Ondisk Structures
+-----------------
+Each object has in the pg collection a *head* object (or *snapdir*, which we
+will come to shortly) and possibly a set of *clone* objects.
+Each hobject_t has a snap field. For the *head* (the only writeable version
+of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
+snap field is set to the *seq* of the *SnapContext* at their creation.
+When the OSD services a write, it first checks whether the most recent
+*clone* is tagged with a snapid prior to the most recent snap represented
+in the *SnapContext*. If so, at least one snapshot has occurred between
+the time of the write and the time of the last clone. Therefore, prior
+to performing the mutation, the OSD creates a new clone for servicing
+reads on snaps between the snapid of the last clone and the most recent
+snapid.
+
+The *head* object contains a *SnapSet* encoded in an attribute, which tracks
+
+ 1. The full set of snaps defined for the object
+ 2. The full set of clones which currently exist
+ 3. Overlapping intervals between clones for tracking space usage
+ 4. Clone size
+
+If the *head* is deleted while there are still clones, a *snapdir* object
+is created instead to house the *SnapSet*.
+
+Additionally, the *object_info_t* on each clone includes a vector of snaps
+for which clone is defined.
+
+Snap Removal
+------------
+To remove a snapshot, a request is made to the *Monitor* cluster to
+add the snapshot id to the list of purged snaps (or to remove it from
+the set of pool snaps in the case of *pool snaps*). In either case,
+the *PG* adds the snap to its *snap_trimq* for trimming.
+
+A clone can be removed when all of its snaps have been removed. In
+order to determine which clones might need to be removed upon snap
+removal, we maintain a mapping from snap to *hobject_t* using the
+*SnapMapper*.
+
+See PrimaryLogPG::SnapTrimmer, SnapMapper
+
+This trimming is performed asynchronously by the snap_trim_wq while the
+pg is clean and not scrubbing.
+
+ #. The next snap in PG::snap_trimq is selected for trimming
+ #. We determine the next object for trimming out of PG::snap_mapper.
+ For each object, we create a log entry and repop updating the
+ object info and the snap set (including adjusting the overlaps).
+ If the object is a clone which no longer belongs to any live snapshots,
+ it is removed here. (See PrimaryLogPG::trim_object() when new_snaps
+ is empty.)
+ #. We also locally update our *SnapMapper* instance with the object's
+ new snaps.
+ #. The log entry containing the modification of the object also
+ contains the new set of snaps, which the replica uses to update
+ its own *SnapMapper* instance.
+ #. The primary shares the info with the replica, which persists
+ the new set of purged_snaps along with the rest of the info.
+
+
+
+Recovery
+--------
+Because the trim operations are implemented using repops and log entries,
+normal pg peering and recovery maintain the snap trimmer operations with
+the caveat that push and removal operations need to update the local
+*SnapMapper* instance. If the purged_snaps update is lost, we merely
+retrim a now empty snap.
+
+SnapMapper
+----------
+*SnapMapper* is implemented on top of map_cacher<string, bufferlist>,
+which provides an interface over a backing store such as the filesystem
+with async transactions. While transactions are incomplete, the map_cacher
+instance buffers unstable keys allowing consistent access without having
+to flush the filestore. *SnapMapper* provides two mappings:
+
+ 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone
+ object
+ 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot
+ as one of its snaps
+
+Assumption: there are lots of hobjects and relatively few snaps. The
+first encoding has a stringification of the object as the key and an
+encoding of the set of snaps as a value. The second mapping, because there
+might be many hobjects for a single snap, is stored as a collection of keys
+of the form stringify(snap)_stringify(object) such that stringify(snap)
+is constant length. These keys have a bufferlist encoding
+pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
+object does not involve reading all objects for any snap. Additionally,
+upon construction, the *SnapMapper* is provided with a mask for filtering
+the objects in the single SnapMapper keyspace belonging to that pg.
+
+Split
+-----
+The snapid_t -> hobject_t key entries are arranged such that for any pg,
+up to 8 prefixes need to be checked to determine all hobjects in a particular
+snap for a particular pg. Upon split, the prefixes to check on the parent
+are adjusted such that only the objects remaining in the pg will be visible.
+The children will immediately have the correct mapping.
diff --git a/src/ceph/doc/dev/osd_internals/watch_notify.rst b/src/ceph/doc/dev/osd_internals/watch_notify.rst
new file mode 100644
index 0000000..8c2ce09
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/watch_notify.rst
@@ -0,0 +1,81 @@
+============
+Watch Notify
+============
+
+See librados for the watch/notify interface.
+
+Overview
+--------
+The object_info (See osd/osd_types.h) tracks the set of watchers for
+a particular object persistently in the object_info_t::watchers map.
+In order to track notify progress, we also maintain some ephemeral
+structures associated with the ObjectContext.
+
+Each Watch has an associated Watch object (See osd/Watch.h). The
+ObjectContext for a watched object will have a (strong) reference
+to one Watch object per watch, and each Watch object holds a
+reference to the corresponding ObjectContext. This circular reference
+is deliberate and is broken when the Watch state is discarded on
+a new peering interval or removed upon timeout expiration or an
+unwatch operation.
+
+A watch tracks the associated connection via a strong
+ConnectionRef Watch::conn. The associated connection has a
+WatchConState stashed in the OSD::Session for tracking associated
+Watches in order to be able to notify them upon ms_handle_reset()
+(via WatchConState::reset()).
+
+Each Watch object tracks the set of currently un-acked notifies.
+start_notify() on a Watch object adds a reference to a new in-progress
+Notify to the Watch and either:
+
+* if the Watch is *connected*, sends a Notify message to the client
+* if the Watch is *unconnected*, does nothing.
+
+When the Watch becomes connected (in PrimaryLogPG::do_osd_op_effects),
+Notifies are resent to all remaining tracked Notify objects.
+
+Each Notify object tracks the set of un-notified Watchers via
+calls to complete_watcher(). Once the remaining set is empty or the
+timeout expires (cb, registered in init()) a notify completion
+is sent to the client.
+
+Watch Lifecycle
+---------------
+A watch may be in one of 5 states:
+
+1. Non existent.
+2. On disk, but not registered with an object context.
+3. Connected
+4. Disconnected, callback registered with timer
+5. Disconnected, callback in queue for scrub or is_degraded
+
+Case 2 occurs between when an OSD goes active and the ObjectContext
+for an object with watchers is loaded into memory due to an access.
+During Case 2, no state is registered for the watch. Case 2
+transitions to Case 4 in PrimaryLogPG::populate_obc_watchers() during
+PrimaryLogPG::find_object_context. Case 1 becomes case 3 via
+OSD::do_osd_op_effects due to a watch operation. Case 4,5 become case
+3 in the same way. Case 3 becomes case 4 when the connection resets
+on a watcher's session.
+
+Cases 4&5 can use some explanation. Normally, when a Watch enters Case
+4, a callback is registered with the OSDService::watch_timer to be
+called at timeout expiration. At the time that the callback is
+called, however, the pg might be in a state where it cannot write
+to the object in order to remove the watch (i.e., during a scrub
+or while the object is degraded). In that case, we use
+Watch::get_delayed_cb() to generate another Context for use from
+the callbacks_for_degraded_object and Scrubber::callbacks lists.
+In either case, Watch::unregister_cb() does the right thing
+(SafeTimer::cancel_event() is harmless for contexts not registered
+with the timer).
+
+Notify Lifecycle
+----------------
+The notify timeout is simpler: a timeout callback is registered when
+the notify is init()'d. If all watchers ack notifies before the
+timeout occurs, the timeout is canceled and the client is notified
+of the notify completion. Otherwise, the timeout fires, the Notify
+object pings each Watch via cancel_notify to remove itself, and
+sends the notify completion to the client early.
diff --git a/src/ceph/doc/dev/osd_internals/wbthrottle.rst b/src/ceph/doc/dev/osd_internals/wbthrottle.rst
new file mode 100644
index 0000000..a3ae00d
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/wbthrottle.rst
@@ -0,0 +1,28 @@
+==================
+Writeback Throttle
+==================
+
+Previously, the filestore had a problem when handling large numbers of
+small ios. We throttle dirty data implicitely via the journal, but
+a large number of inodes can be dirtied without filling the journal
+resulting in a very long sync time when the sync finally does happen.
+The flusher was not an adequate solution to this problem since it
+forced writeback of small writes too eagerly killing performance.
+
+WBThrottle tracks unflushed io per hobject_t and ::fsyncs in lru
+order once the start_flusher threshold is exceeded for any of
+dirty bytes, dirty ios, or dirty inodes. While any of these exceed
+the hard_limit, we block on throttle() in _do_op.
+
+See src/os/WBThrottle.h, src/osd/WBThrottle.cc
+
+To track the open FDs through the writeback process, there is now an
+fdcache to cache open fds. lfn_open now returns a cached FDRef which
+implicitely closes the fd once all references have expired.
+
+Filestore syncs have a sideeffect of flushing all outstanding objects
+in the wbthrottle.
+
+lfn_unlink clears the cached FDRef and wbthrottle entries for the
+unlinked object when the last link is removed and asserts that all
+outstanding FDRefs for that object are dead.